Unlocking Human-Like Language Understanding in Low-Resource Languages

Thursday 23 January 2025


The quest for a universal language has long fascinated humans, and nowhere is this more apparent than in the realm of artificial intelligence. For years, researchers have been working on developing machines that can understand and generate human-like language, but one major obstacle has stood in their way: the complexity of human dialects.


A new study published in a recent issue of a leading scientific journal aims to tackle this challenge by creating the world’s first large-scale text-to-SQL dataset in Moroccan Darija, a unique blend of Arabic and French. The project, dubbed Dialect2SQL, is an ambitious effort to bridge the gap between low-resource languages like Darija and the high-tech world of artificial intelligence.


The dataset consists of 9,428 natural language questions paired with their corresponding SQL queries, spread across 69 databases in various domains. What sets Dialect2SQL apart from previous attempts is its use of Moroccan Darija as the target language, rather than relying on automatic translation or existing datasets.


To create this dataset, a team of researchers employed a two-step process: first, they used a machine learning model to translate English questions into Darija, and then had three computer science students native in Darija review and edit the translations to ensure linguistic accuracy and SQL relevance.


The results are promising. According to the study, the automatically translated dataset achieved an average character error rate of 17%, while the manual editing process brought this number down to a mere 2%. This suggests that even low-resource languages like Darija can be successfully adapted for use in artificial intelligence applications.


But what does this mean for the future of AI? The creation of Dialect2SQL opens up new possibilities for developing machines that can understand and interact with speakers of Moroccan Darija, a vital step towards creating more inclusive and user-friendly language interfaces. As AI technology continues to advance, it’s likely that we’ll see even more innovative applications of this dataset.


In the meantime, researchers are already exploring ways to expand Dialect2SQL into other Arabic dialects, paving the way for a new era of cross-linguistic communication.


Cite this article: “Unlocking Human-Like Language Understanding in Low-Resource Languages”, The Science Archive, 2025.


Artificial Intelligence, Language, Dataset, Moroccan Darija, Text-To-Sql, Machine Learning, Translation, Sql Queries, Linguistic Accuracy, Low-Resource Languages


Reference: Salmane Chafik, Saad Ezzini, Ismail Berrada, “Dialect2SQL: A Novel Text-to-SQL Dataset for Arabic Dialects with a Focus on Moroccan Darija” (2025).


Leave a Reply