Tuesday 25 February 2025
A new language dataset has been released, providing a massive boost to research and development in natural language processing (NLP) for low-resource languages like Yoruba.
The dataset, known as Yankari, consists of over 30 million tokens of text from diverse sources, including news outlets, blogs, educational websites, and Wikipedia. The data was collected using a careful approach that prioritized quality, diversity, and ethical considerations.
One of the main challenges in developing NLP resources for low-resource languages is the lack of high-quality training data. This can make it difficult to train accurate language models, which are essential for tasks like machine translation, text summarization, and question-answering. The Yankari dataset aims to address this challenge by providing a large-scale, high-quality corpus of Yoruba text.
The dataset was created using a combination of automated and manual processes. Automated systems were used to collect and clean the data, while native Yoruba speakers reviewed the content to ensure its accuracy and quality. This approach allowed the researchers to scale up the collection process while maintaining the integrity of the data.
The Yankari dataset is designed to be versatile and can be used for a range of NLP tasks. It includes a mix of formal and informal language, making it suitable for training models that need to handle different text styles. The dataset also contains a wide range of topics, from news and current events to culture and education.
The release of the Yankari dataset is significant because it provides a new resource for researchers working on NLP applications in low-resource languages. It can be used to train language models that are more accurate and effective than those currently available. This, in turn, could lead to the development of more sophisticated AI systems that can understand and generate text in Yoruba.
The dataset is not without its limitations, however. For example, it may contain biases and errors that need to be addressed through further processing and evaluation. Additionally, the dataset’s focus on written language means that it may not fully capture the nuances of spoken Yoruba.
Despite these challenges, the release of the Yankari dataset represents a major step forward in the development of NLP resources for low-resource languages. It provides researchers with a new tool to work with and could help to accelerate progress in areas like machine translation and language understanding.
Cite this article: “Yankari Dataset: A Game-Changer for NLP Research on Low-Resource Languages”, The Science Archive, 2025.
Natural Language Processing, Yoruba, Low-Resource Languages, Dataset, Machine Translation, Text Summarization, Question-Answering, Nlp Tasks, Ai Systems, Language Models
Reference: Maro Akpobi, “Yankari: A Monolingual Yoruba Dataset” (2024).







