AICLD : AI-assisted Incremental Chinese Lip-reading Database

Abstract

Lip-reading is a key technology that interprets spoken language by visually analyzing lip movements, and it has a wide range of practical applications. When combined with deep learning and large-scale annotated datasets, its performance is significantly enhanced, thereby meeting the requirements of practical applications. In order to promote the advancement of the field of lip-reading, this paper puts forward a large-scale in-the-wild lip-reading dataset with natural distribution, termed the AI-assisted Incremental Chinese Lip-reading Database (AICLD), which is continuously expanding. By the time this study was conducted, AICLD—the world’s largest Mandarin lip-reading dataset to date—contained video samples from 5,238 speakers, with a total duration of over 110 hours, a total of 1,255,108 samples, and coverage of 53,958 Chinese words. Each sample is accompanied by comprehensive multi-dimensional metadata annotations, including posture angles, key frames, and reliability. Furthermore, the dataset continues to expand at a rate of over 3,000 samples per day (real-time data is available at http://aicld.ifzu.vip/). A series of innovative experiments were conducted on the basis of this dataset, exploring lip-reading methods from multiple dimensions. It is expected that this will provide a solid foundation for subsequent studies. Moreover, this paper introduces a fully automated, AI-assisted data collection system. This system can convert source videos into lip-reading datasets with complete metadata. This study provides a core open-access resource for Chinese visual speech research and is expected to serve as a methodological framework for the construction of multilingual lip-reading datasets worldwide.

Detailed contributions

Abstract

Machine-vision lip reading: algorithm design and system development