The Evolution of Visual Speech Recognition: From Deep Spatio-Temporal Modeling to LLM-Guided Reasoning 視覺語音識別演化:從深度時空建模到大模型引導推理
- 視覺語音識別相關文獻按模型結構零散發表,缺少由前處理至解碼的貫通視角,研究人員難以把握技術演進的階段特徵。參與按五個技術時代組織代表性方法與典型架構,歸納由統計模型、RNN/CNN 至 Transformer 與大模型引導推理之演進邏輯,形成清晰之時代—方法對照表支撐全文敘事。
- 資料集與基準評測口徑不一,受控/野外、單語/多語場景混雜,不利於橫向比較。建立按粒度、採集環境、語言與模態劃分之分類學,並梳理基準由實驗室至大規模在野、多語言評測之發展趨勢,便於讀者按場景定位適用資料與評測設定。
- 僅羅列方法難以指導工程落地,韌性、低資源、標註品質與長時連續語音等仍是共性瓶頸。撰寫開放問題與未來方向章節,歸納環境適應性、部署效率等議題並銜接多模態與大模型趨勢,形成可引用之系統性參考框架。
摘要
Visual Speech Recognition (VSR) has rapidly evolved from handcrafted feature pipelines to deep spatio-temporal architectures and, more recently, LLM-guided reasoning systems. This survey provides a systematic review of that evolution, covering core components of the VSR pipeline, including preprocessing, visual feature extraction, spatio-temporal enhancement, sequence modeling, and decoding. We organize representative methods into five technological eras and analyze their structural shifts from statistical models and recurrent networks to temporal convolutions, Transformer-based global attention, and LLM-empowered generative refinement. We further present a comprehensive dataset taxonomy across granularity, collection environment, language, and modality, and summarize benchmark trends from controlled settings to large-scale in-the-wild and multilingual scenarios. Comparative analysis highlights that while modern visual encoders and attention mechanisms significantly improve discriminative capability, intrinsic viseme ambiguity remains a central bottleneck for visual-only recognition, motivating stronger linguistic priors and multimodal integration. Finally, we discuss key open challenges, including robustness under real-world perturbations, low-resource language coverage, annotation quality, long-form continuous speech modeling, and deployment efficiency, and outline future directions toward reliability-aware decoding, language-agnostic transfer, and scalable multimodal VSR systems.