Evaluating Large Language Models in Southeast Asian Languages

Friday 31 January 2025


Language models have revolutionized the field of natural language processing, enabling machines to understand and generate human-like text with unprecedented accuracy. However, these models were primarily developed for English, leaving many languages behind. A new study aims to bridge this gap by creating a comprehensive benchmark for evaluating large language models in multiple Southeast Asian languages.


The researchers created a multilingual multi-task benchmark called MMLU, which includes datasets and tasks from four Southeast Asian languages: Thai, Indonesian, Vietnamese, and Malay. The dataset consists of 99.8k examples, covering various linguistic styles and genres. The team also developed a set of prompt variants to assess the models’ ability to answer questions within context.


The study evaluated several large language models, including Qwen-1.5-7B, Llama-2-7B, Mistral-7B, Gemma-7B, Typhoon-8B, VinaLLaMA-7B, BLOOM-7B1, Sailor-7B, and SeaLLM-7B. These models were trained on various datasets and tasks, including natural language inference (NLI), multiple-choice questions, and text classification.


The results showed that the models performed differently across languages, with some excelling in certain tasks while struggling in others. The team also observed that the models tended to predict only one or two labels, indicating a severe label prediction imbalance. However, after applying contextual calibration methods, this imbalance was significantly mitigated.


The study’s findings have important implications for language model development and deployment in Southeast Asia. By creating a comprehensive benchmark, researchers can identify areas where models need improvement and develop more effective training strategies. This will ultimately lead to more accurate and reliable language models that can be used in various applications, such as chatbots, virtual assistants, and machine translation.


The study’s results also highlight the importance of considering cultural and linguistic nuances when developing language models for non-English languages. By taking these factors into account, researchers can create models that are better suited to specific languages and cultures, leading to more effective communication and understanding between humans and machines.


Cite this article: “Evaluating Large Language Models in Southeast Asian Languages”, The Science Archive, 2025.


Language Models, Southeast Asia, Multilingual, Benchmark, Natural Language Processing, Machine Learning, Large Language Models, Cultural Nuances, Linguistic Styles, Label Prediction Imbalance


Reference: Jia Guo, Longxu Dou, Guangtao Zeng, Stanley Kok, Wei Lu, Qian Liu, “SailCompass: Towards Reproducible and Robust Evaluation for Southeast Asian Languages” (2024).


Leave a Reply