Thursday 20 March 2025
DeepL and Supertext, two prominent machine translation providers, have been pitted against each other in a blind A/B test by professional translators. The results suggest that while both systems are capable of producing high-quality translations, Supertext may have an edge when it comes to consistency across longer texts.
The study evaluated the performance of DeepL and Supertext on four language directions: English to German, French, Italian, and back again. Translators were presented with a set of source texts from news websites, including the New York Times and Neue Zürcher Zeitung, and asked to rate the translations produced by each system.
At the segment level, both systems performed similarly well, with translators preferring DeepL’s output in only one language direction. However, when evaluating the translations at the document level, a different picture emerged. Supertext was preferred in three out of four language directions, suggesting that it may be better equipped to handle longer texts and maintain consistency across paragraphs.
One possible explanation for this difference is that Supertext uses an open-source large language model (LLM) as its foundation, whereas DeepL’s technology remains proprietary. This could allow Supertext to leverage a broader context window during translation, resulting in more consistent output over the course of a longer text.
Another factor may be the way each system handles regional differences and nuances. Supertext supports three different German target language variants, which could lead to inconsistencies if not handled properly. In contrast, DeepL’s focus on a single language direction may make it better suited for handling regional variations in translation.
The study also highlights the importance of evaluating machine translation systems at the document level, rather than just segment by segment. This is particularly crucial as LLM-based systems become more prevalent, as they are designed to leverage broader context windows and produce more consistent translations over longer texts.
In addition to its findings on Supertext’s performance, the study also underscores the limitations of current evaluation methods. While A/B tests like this one provide valuable insights into a system’s capabilities, they do not necessarily capture the full range of errors or nuances that may arise in real-world usage. To better understand the strengths and weaknesses of machine translation systems, researchers will need to develop more comprehensive evaluation methodologies.
Ultimately, the results of this study suggest that Supertext may be well-suited for applications where consistency across longer texts is critical, such as technical documentation or academic papers.
Cite this article: “Supertext Edges DeepL in Consistency and Performance in Longer Texts”, The Science Archive, 2025.
Machine Translation, Supertext, Deepl, Consistency, Longer Texts, Professional Translators, A/B Test, Language Directions, Open-Source, Proprietary Technology.







