Saturday 01 March 2025
A team of researchers has made significant progress in bridging the multilingual mathematical reasoning gap, a long-standing challenge in artificial intelligence. For years, language models have demonstrated exceptional performance on complex reasoning tasks in high-resource languages such as English and Chinese, but struggled to replicate this success in lower-resource languages like Korean.
To tackle this issue, the researchers developed HRM8K, a benchmark comprising 8,011 bilingual math problems in both Korean and English. This dataset was carefully curated from existing benchmarks and Korean examinations to create a perfectly parallel evaluation structure.
Through systematic analysis of model behaviors on HRM8K, the team identified that the performance disparities primarily stem from difficulties in comprehending non-English inputs rather than limitations in reasoning capabilities. This finding challenges previous studies that suggested using English chain-of-thought reasoning for multilingual questions.
To address this issue, the researchers proposed UST (Understand, Solve, and Translate), a training method that strategically uses English as an anchor for reasoning and solution generation. Fine-tuning models on HRM8K with UST achieved a 10.91% improvement in performance on the benchmark, reducing the multilingual performance gap from 11.6% to 0.7%.
The researchers also demonstrated the effectiveness of applying reward models originally designed for English or Chinese languages to new languages without additional training. This approach showed promising results, with some models even outperforming their English-trained counterparts.
To further evaluate model performance, the team developed a suite of prompts that mimic real-world scenarios, such as translating math problems between Korean and English, generating step-by-step solutions, and conducting pairwise response comparisons.
The results highlight the potential for language models to be fine-tuned on HRM8K with UST to improve their multilingual mathematical reasoning capabilities. This achievement has significant implications for education and research, enabling machines to better assist students and scholars in diverse linguistic contexts.
In addition to its practical applications, this work underscores the importance of developing more nuanced understanding of language models’ limitations and potential biases. By shedding light on these issues, researchers can continue to push the boundaries of AI-powered language processing and develop more effective solutions for real-world problems.
Cite this article: “Closing the Multilingual Gap in Mathematical Reasoning with HRM8K and UST”, The Science Archive, 2025.
Multilingual, Mathematical Reasoning, Ai, Language Models, Korean, English, Benchmark, Hrm8K, Ust, Translation







