AICLD : AI-assisted Incremental Chinese Lip-reading Database AICLD:人工智慧輔助增量式中文唇讀資料庫
- 中文唇讀資料集長期依賴人工標註、擴容慢,難以支撐可重現的規模對比與公平評測。提出 AI 輔助的全自動化建構方法,涵蓋音視訊處理、唇區偵測追蹤與 ROI 對齊、元資料入庫與版本管理。並基於此方法建構目前世界上最大的中文唇語識別資料集AICLD,樣本量達1,400,000以上,語料達 5,238 名說話人、110+ 小時並維持日增 3,000+ 樣本增速;即時資料可見資料集開放網站:http://aicld.ifzu.vip/。
- 僅有影片資料不足以支撐細粒度錯誤歸因與分層實驗,下游評測需要姿態、可靠性等可檢索元資料。定義統一元資料 schema,建立姿態角、關鍵影格、可靠性等欄位及抽檢與一致性校驗機制,保證入庫資料可檢索、實驗可比,提升開放資料集之工程可用性。
- 基於 AICLD 資料集建構涵蓋規模梯度、預處理對比、時間解析度敏感性及關鍵影格採樣之多維實驗矩陣。透過系統消融量化資料增益並確立最優處理範式,深度探明模型對大規模資料之表徵邊界,為論文規模表述之嚴謹性提供堅實之基準支撐。
摘要
Lip-reading is a key technology that interprets spoken language by visually analyzing lip movements, and it has a wide range of practical applications. When combined with deep learning and large-scale annotated datasets, its performance is significantly enhanced, thereby meeting the requirements of practical applications. In order to promote the advancement of the field of lip-reading, this paper puts forward a large-scale in-the-wild lip-reading dataset with natural distribution, termed the AI-assisted Incremental Chinese Lip-reading Database (AICLD), which is continuously expanding. By the time this study was conducted, AICLD—the world’s largest Mandarin lip-reading dataset to date—contained video samples from 5,238 speakers, with a total duration of over 110 hours, a total of 1,255,108 samples, and coverage of 53,958 Chinese words. Each sample is accompanied by comprehensive multi-dimensional metadata annotations, including posture angles, key frames, and reliability. Furthermore, the dataset continues to expand at a rate of over 3,000 samples per day (real-time data is available at http://aicld.ifzu.vip/). A series of innovative experiments were conducted on the basis of this dataset, exploring lip-reading methods from multiple dimensions. It is expected that this will provide a solid foundation for subsequent studies. Moreover, this paper introduces a fully automated, AI-assisted data collection system. This system can convert source videos into lip-reading datasets with complete metadata. This study provides a core open-access resource for Chinese visual speech research and is expected to serve as a methodological framework for the construction of multilingual lip-reading datasets worldwide.