Emilia: A Large-Scale Multilingual Speech Generation Dataset for Advancing Human-Like Communication

Saturday 15 March 2025


Recent advancements in speech generation have been driven by the availability of large-scale training datasets. However, these models still struggle to capture the spontaneity and variability inherent in real-world human speech. To address this limitation, researchers have turned to in-the-wild data, which can provide a more natural representation of human communication.


One such dataset is Emilia, a multilingual speech generation dataset derived from in-the-wild speech data. Comprising over 101k hours of speech across six languages – English, Chinese, German, French, Japanese, and Korean – Emilia is the largest open-source speech generation dataset available.


The creation of Emilia involved developing an effective preprocessing pipeline, called Emilia-Pipe, to extract high-quality training data from valuable yet underexplored in-the-wild data. This pipeline includes speaker diarization, noise reduction, and segmentation into intervals of 3-30 seconds.


To evaluate the performance of Emilia, researchers conducted extensive experiments using various speech generation models. The results showed that Emilia significantly outperforms traditional audiobook datasets in generating spontaneous and human-like speech. In particular, the dataset excelled at capturing diverse speaker timbre and speaking styles of real-world human speech.


The relationship between dataset size and speech generation performance was also investigated. The findings revealed consistent improvements with data scaling, although the trend becomes less pronounced as the dataset size exceeds 100k hours. This suggests that there is a point of diminishing returns when it comes to increasing the size of the training dataset.


Emilia’s effectiveness in supporting multilingual and crosslingual speech generation was also demonstrated. The dataset’s large scale and linguistic diversity make it an ideal resource for researchers seeking to develop models capable of generating high-quality speech in multiple languages.


The authors of this study emphasize the importance of scaling dataset size to advance speech generation research. They suggest that future work could focus on enhancing model adaptability to address crosslingual challenges, as well as exploring the potential applications of Emilia in areas such as synthetic spoken misinformation detection and singing voice generation.


In summary, the creation of Emilia represents a significant step forward in the development of high-quality speech generation models. By leveraging in-the-wild data and a sophisticated preprocessing pipeline, researchers have been able to generate a dataset that can help advance our understanding of human communication and improve the performance of speech generation systems.


Cite this article: “Emilia: A Large-Scale Multilingual Speech Generation Dataset for Advancing Human-Like Communication”, The Science Archive, 2025.


Speech Generation, Emilia, Dataset, Multilingual, Speech Data, Human Communication, Preprocessing Pipeline, Speaker Diarization, Noise Reduction, Segmentation


Reference: Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, et al., “Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation” (2025).


Leave a Reply