Evaluating Large Language Models with the XCOMPS Dataset

Sunday 30 March 2025


A new benchmark for evaluating large language models (LLMs) has been created, providing a more comprehensive understanding of their abilities and limitations. The XCOMPS dataset, developed by researchers at Columbia University and Munich Center for Machine Learning, is designed to assess LLMs’ conceptual understanding, linguistic competence, and ability to generalize across languages.


The dataset consists of 17 languages, including Arabic, Chinese, French, German, Japanese, and many others. Each language has a unique set of concepts, properties, and relationships that are used to create pairs of sentences. These pairs consist of positive and negative examples, which challenge the LLMs’ ability to understand the nuances of language.


One of the key features of XCOMPS is its multi-stage construction process. The dataset was built using a combination of manual translation, machine translation, and human review. This approach ensures that the translations are accurate and culturally relevant, allowing for a more precise evaluation of the LLMs’ abilities.


The researchers used several different methods to evaluate the performance of the LLMs on XCOMPS. These included metalinguistic prompting, direct probability measurement, and neurolinguistic probing. Metalinguistic prompting involves providing the LLM with explicit instructions on how to complete a task, while direct probability measurement evaluates the model’s ability to generate sentences that are grammatically correct and semantically accurate.


Neurolinguistic probing is a more nuanced approach that assesses the LLMs’ internal representations of language. This method involves evaluating the model’s ability to generate sentences that are consistent with the relationships between concepts, properties, and languages. The results show that the LLMs performed well on tasks that required them to generate grammatically correct sentences, but struggled with tasks that required deeper understanding of linguistic structures and relationships.


The study also found that the performance of the LLMs varied significantly across languages. The models performed better in languages such as English and German, which have more complex grammar and syntax. However, they struggled in languages such as Japanese and Chinese, which have simpler grammatical structures but are more challenging to process due to their unique writing systems.


The XCOMPS dataset provides a valuable tool for researchers and developers who want to evaluate the performance of LLMs and improve their abilities.


Cite this article: “Evaluating Large Language Models with the XCOMPS Dataset”, The Science Archive, 2025.


Large Language Models, Xcomps Dataset, Conceptual Understanding, Linguistic Competence, Generalization, Machine Translation, Human Review, Metalinguistic Prompting, Probability Measurement, Neurolinguistic Probing, Language Evaluation


Reference: Linyang He, Ercong Nie, Sukru Samet Dindar, Arsalan Firoozi, Adrian Florea, Van Nguyen, Corentin Puffay, Riki Shimizu, Haotian Ye, Jonathan Brennan, et al., “XCOMPS: A Multilingual Benchmark of Conceptual Minimal Pairs” (2025).


Leave a Reply