Structured Temporal Regularization and Curriculum Optimization for Visual Speech Recognition
- In low-resource VSR, models often memorize speaker-specific spatial shortcuts, yielding high training accuracy but weak validation performance. SimMIM self-supervised pre-training was added with a Swin V2–style encoder using residual post-normalization and scaled cosine attention, curbing identity leakage and stabilizing visual features for temporal modeling.
- Batch normalization in temporal branches can leak batch statistics across speakers, hurting generalization under small batches. Group normalization replaced batch normalization and was tuned with the new visual stack, plus hierarchical temporal regularization, learnable mixed pooling, and staged curricula that smoothed train–validation curves and delayed overfitting.
- The AICLD-500 benchmark lacked a strong baseline that jointly improved accuracy and the generalization gap. Full comparisons and ablations over pre-training, backbone, regularization, and curriculum combinations improved about 1.91 points over the SwinLip baseline while substantially narrowing the train–validation gap.
Abstract
Visual Speech Recognition (VSR) in low-resource settings is highly vulnerable to overfitting and shortcut learning, resulting in severe train-validation divergence. To address this issue, we propose a structured optimization framework that jointly improves representation stability, temporal modeling reliability, and training dynamics. Specifically, we integrate SimMIM-based self-supervised pre-training to reduce identity-dependent spatial memorization, migrate the visual backbone from Swin Transformer V1 to a Swin V2-style design with Residual-Post-Normalization and scaled cosine attention to stabilize deep feature propagation, and replace Batch Normalization with Group Normalization in temporal branches to avoid batch-induced temporal leakage. We further introduce hierarchical temporal regularization, learnable mixed temporal pooling, and a stage-wise curriculum strategy with dynamic augmentation and plateau-aware adaptation to progressively shift learning from early discriminative fitting to robust generalization. Extensive experiments on the low-resource AICLD-500 benchmark demonstrate that the proposed method achieves a state-of-the-art Top-1 accuracy of 25.67%, outperforming a strong SwinLip baseline by 1.91 absolute percentage points, while significantly narrowing the generalization gap. These results indicate that structured temporal regularization coupled with curriculum optimization provides an effective and scalable solution for robust VSR under data-scarce conditions.