AICLD : AI-assisted Incremental Chinese Lip-reading Database AICLD:人工智能辅助增量式中文唇读数据库
- 中文唇读数据集长期依赖人工标注、扩容慢,难以支撑可复现的规模对比与公平评测。提出 AI 辅助的全自动化构建方法,覆盖音视频处理、唇区检测跟踪与 ROI 对齐、元数据入库与版本管理。并基于此方法构建目前世界上最大的中文唇语识别数据集AICLD,样本量达1,400,000以上,语料达 5,238 名说话人、110+ 小时并保持日增 3,000+ 样本增速;实时数据可见数据集开放网站:http://aicld.ifzu.vip/。
- 仅有视频数据不足以支撑细粒度错误归因与分层实验,下游评测需要姿态、可靠性等可检索元数据。定义统一元数据 schema,建立姿态角、关键帧、可靠性等字段及抽检与一致性校验机制,保证入库数据可检索、实验可比,提升开放数据集的工程可用性。
- 基于 AICLD 数据集构建涵盖规模梯度、预处理对比、时间分辨率敏感性及关键帧采样的多维实验矩阵。通过系统消融量化数据增益并确立最优处理范式,深度探明模型对大规模数据的表征边界,为论文规模表述的严谨性提供坚实的基准支撑。
摘要
Lip-reading is a key technology that interprets spoken language by visually analyzing lip movements, and it has a wide range of practical applications. When combined with deep learning and large-scale annotated datasets, its performance is significantly enhanced, thereby meeting the requirements of practical applications. In order to promote the advancement of the field of lip-reading, this paper puts forward a large-scale in-the-wild lip-reading dataset with natural distribution, termed the AI-assisted Incremental Chinese Lip-reading Database (AICLD), which is continuously expanding. By the time this study was conducted, AICLD—the world’s largest Mandarin lip-reading dataset to date—contained video samples from 5,238 speakers, with a total duration of over 110 hours, a total of 1,255,108 samples, and coverage of 53,958 Chinese words. Each sample is accompanied by comprehensive multi-dimensional metadata annotations, including posture angles, key frames, and reliability. Furthermore, the dataset continues to expand at a rate of over 3,000 samples per day (real-time data is available at http://aicld.ifzu.vip/). A series of innovative experiments were conducted on the basis of this dataset, exploring lip-reading methods from multiple dimensions. It is expected that this will provide a solid foundation for subsequent studies. Moreover, this paper introduces a fully automated, AI-assisted data collection system. This system can convert source videos into lip-reading datasets with complete metadata. This study provides a core open-access resource for Chinese visual speech research and is expected to serve as a methodological framework for the construction of multilingual lip-reading datasets worldwide.