AICLD : AI-assisted Incremental Chinese Lip-reading Database
- Chinese lip-reading datasets have long relied on manual annotation and scaled too slowly to support reproducible benchmarking and fair evaluation. The paper proposes a fully automated, AI-assisted construction pipeline covering audiovisual processing, lip-region detection/tracking, ROI alignment, metadata ingestion, and versioning. Built on this pipeline, AICLD is currently the world's largest Mandarin lip-reading dataset, with over 1,400,000 samples from 5,238 speakers, 110+ hours of speech, and sustained daily growth of 3,000+ samples; real-time statistics are available at the open dataset portal: http://aicld.ifzu.vip/.
- Video alone is insufficient for fine-grained error attribution and layered experiments; downstream work needs searchable metadata such as pose and reliability. A unified metadata schema was defined with pose angles, key frames, reliability fields, and sampling-based QA, keeping ingested data searchable and experiment-ready.
- On the AICLD dataset, a multi-dimensional experiment matrix was built spanning scale gradients, preprocessing comparisons, temporal-resolution sensitivity, and key-frame sampling. Systematic ablations quantified data gains and established an optimal processing paradigm, probing representation limits on large-scale data and providing a solid benchmark underpinning rigorous scale claims in the paper.
Abstract
Lip-reading is a key technology that interprets spoken language by visually analyzing lip movements, and it has a wide range of practical applications. When combined with deep learning and large-scale annotated datasets, its performance is significantly enhanced, thereby meeting the requirements of practical applications. In order to promote the advancement of the field of lip-reading, this paper puts forward a large-scale in-the-wild lip-reading dataset with natural distribution, termed the AI-assisted Incremental Chinese Lip-reading Database (AICLD), which is continuously expanding. By the time this study was conducted, AICLD—the world’s largest Mandarin lip-reading dataset to date—contained video samples from 5,238 speakers, with a total duration of over 110 hours, a total of 1,255,108 samples, and coverage of 53,958 Chinese words. Each sample is accompanied by comprehensive multi-dimensional metadata annotations, including posture angles, key frames, and reliability. Furthermore, the dataset continues to expand at a rate of over 3,000 samples per day (real-time data is available at http://aicld.ifzu.vip/). A series of innovative experiments were conducted on the basis of this dataset, exploring lip-reading methods from multiple dimensions. It is expected that this will provide a solid foundation for subsequent studies. Moreover, this paper introduces a fully automated, AI-assisted data collection system. This system can convert source videos into lip-reading datasets with complete metadata. This study provides a core open-access resource for Chinese visual speech research and is expected to serve as a methodological framework for the construction of multilingual lip-reading datasets worldwide.