Machine-vision lip reading: algorithm design and system development

SRTP: fully automated incremental AICLD corpus (~1.4M samples / 5k+ speakers), lip-reading models and training, outputs include IEEE TIP and other Q1 papers, plus invention patents and software copyrights.

Computer Vision

Background

As AI, computer vision, and NLP converge, lip reading offers audio-free communication with broad impact, yet Mandarin lip reading faces a dual bottleneck: data infrastructure (limited corpus scale, heavy manual labeling) and model design (weak low-resource generalization, insufficient temporal modeling), with few systematic surveys of how the field evolved. This SRTP (College Student Innovation Training Program) focuses on corpus automation and algorithmic renewal—building an incrementally scalable Mandarin lip-reading dataset, modernizing low-resource architectures, and clarifying technical lineages to fill gaps and provide practical support.

  1. For high labeling cost, poor A/V sync, and slow scaling in Mandarin lip-reading corpora, built a distributed AI-assisted incremental pipeline—FFmpeg preprocessing, shot-boundary and SyncNet alignment, Aeneas/MFA hierarchical forced alignment, and MTCNN+KCF ROI extraction with ResNet-18 speaker clustering—yielding AICLD at 1,400,000+ samples, 5,238 speakers, 110+ hours, and 3,000+ daily adds, backing the IEEE TIP dataset paper and invention patents.
  2. Video-only releases struggled with layered experiments and drifting public statistics. Defined a unified metadata schema (pose, key frames, reliability) with sampling QA; built a multi-dimensional AICLD matrix over scale, preprocessing, temporal resolution, and key-frame sampling, using ablations to set an optimal processing paradigm for verifiable public metrics.
  3. Led AICLD platform architecture and full-stack development: a Streamlit portal for indexing, versioning, and tasks, plus cloud incremental ingest, annotation/QA workflows, real-time monitoring, and technical/data-request documentation for daily updates and compliant team access (software copyright).
  4. For low-resource overfitting and large train–validation gaps, built a PyTorch stack with SimMIM pre-training, Swin V2, GN temporal branches, and staged curricula; on AICLD-500, Top-1 improved by 1.91 points over SwinLip with a much narrower generalization gap, supporting the TASLP paper.
  5. For scattered VSR literature lacking an end-to-end preprocessing-to-decoding view, helped organize five technological eras and typical architectures, established a dataset taxonomy over granularity, environment, language, and modality, and drafted open-problem and future-direction sections forming a citable reference framework for the ARC survey.

Research outputs

Stack

Python, PyTorch, Swin Transformer, FFmpeg tooling