Structured Temporal Regularization and Curriculum Optimization for Visual Speech Recognition

Abstract

Visual Speech Recognition (VSR) in low-resource settings is highly vulnerable to overfitting and shortcut learning, resulting in severe train-validation divergence. To address this issue, we propose a structured optimization framework that jointly improves representation stability, temporal modeling reliability, and training dynamics. Specifically, we integrate SimMIM-based self-supervised pre-training to reduce identity-dependent spatial memorization, migrate the visual backbone from Swin Transformer V1 to a Swin V2-style design with Residual-Post-Normalization and scaled cosine attention to stabilize deep feature propagation, and replace Batch Normalization with Group Normalization in temporal branches to avoid batch-induced temporal leakage. We further introduce hierarchical temporal regularization, learnable mixed temporal pooling, and a stage-wise curriculum strategy with dynamic augmentation and plateau-aware adaptation to progressively shift learning from early discriminative fitting to robust generalization. Extensive experiments on the low-resource AICLD-500 benchmark demonstrate that the proposed method achieves a state-of-the-art Top-1 accuracy of 25.67%, outperforming a strong SwinLip baseline by 1.91 absolute percentage points, while significantly narrowing the generalization gap. These results indicate that structured temporal regularization coupled with curriculum optimization provides an effective and scalable solution for robust VSR under data-scarce conditions.

Detailed contributions

Abstract

Machine-vision lip reading: algorithm design and system development