一种AI 辅助的大规模唇语识别数据集自动化构建方法 AI-Assisted Automated Construction Method for Large-Scale Lip-Reading Datasets
- To address high manual labeling cost and difficult audiovisual sync in lip-reading datasets, a distributed automated pipeline was built. FFmpeg standardization plus color-histogram shot-boundary detection (threshold D>30) and SyncNet cosine similarity (threshold 0.3) corrected A/V offset, enabling automated harvest and cleaning of large video corpora with consistent alignment at scale.
- Coarse temporal labels left lip motion under-covered, so a hierarchical forced-alignment scheme from sentence to word level was developed. Aeneas handled sentence-level speech-to-text; MFA matched word-level timing with millisecond axis tuning by word length. Word–clip mapping with unique-ID Pinyin label files yields structured data for fine-grained lip reading.
- Multi-speaker interference and unstable ROI under complex poses were tackled with dual MTCNN detection and KCF tracking. ResNet-18 embeddings clustered identities; an SVM rejected non-speaking faces and extreme pose drift (e.g., large yaw). Dynamic ROI scaling from 68 facial landmarks preserved lip detail while cutting label noise and improving dataset robustness.
This invention discloses an AI-assisted method and system for automated construction of large-scale Mandarin lip-reading datasets. Distributed crawlers harvest video; FFmpeg extracts streams; shot-boundary detection plus SyncNet align audio and video; Aeneas and MFA transcribe speech and align timestamps; MTCNN, a KCF tracker, and ResNet-18 support face detection and speaker clustering; lip landmarks yield ROI crops stored by class; multi-model verification filters high-quality samples—mitigating high build cost, poor quality, and difficult synchronization while improving build efficiency and corpus quality for lip-reading model training.