Machine Translation Metrics Exposed: A Study Reveals Hidden Biases

Sunday 02 February 2025


A recent study has shed new light on the evaluation of machine translation systems, highlighting a previously overlooked aspect: the way metrics treat different systems differently. Researchers have long recognized that automated evaluation metrics are imperfect and can be biased towards certain types of translations or languages. However, this study shows that some metrics may also exhibit a more subtle form of bias – treating different systems as if they were apples and oranges.


The researchers analyzed data from the 2023 Machine Translation Shared Task, which brought together top performers in machine translation to evaluate their systems using a range of automated metrics. They found that many metrics exhibited significant differences in how they ranked different systems, even when those systems were evaluated on the same set of text passages.


One metric, XCOMET, was particularly egregious in this regard. While it performed well overall, its rankings varied wildly depending on which system was being evaluated. In some cases, a system might receive high praise from XCOMET, while another system with similar performance would be panned. This inconsistency raises serious questions about the reliability of XCOMET as an evaluation metric.


The researchers also found that other metrics, such as GEMBA-MQM and MetricX-23, exhibited similar biases. While these metrics were less extreme than XCOMET in their variability, they still demonstrated a tendency to favor certain systems over others.


So what does this mean for the development of machine translation systems? It suggests that researchers should be cautious when selecting evaluation metrics, ensuring that those metrics are robust and unbiased. It also highlights the need for more comprehensive evaluations, which consider multiple metrics and take into account the complex interactions between systems and languages.


Ultimately, the study’s findings underscore the importance of transparency and accountability in machine translation evaluation. By acknowledging and addressing these biases, researchers can work towards developing more accurate and reliable metrics – ultimately leading to better-performing machine translation systems that benefit everyone.


Cite this article: “Machine Translation Metrics Exposed: A Study Reveals Hidden Biases”, The Science Archive, 2025.


Machine, Translation, Evaluation, Metrics, Bias, Systems, Languages, Performance, Reliability, Accountability


Reference: Pius von Däniken, Jan Deriu, Mark Cieliebak, “A Measure of the System Dependence of Automated Metrics” (2024).


Leave a Reply