Saturday 08 March 2025
A team of researchers has made significant strides in developing a high-quality optical character recognition (OCR) tool for Yiddish, a language that has been largely overlooked in the field of artificial intelligence.
Yiddish is a unique language that has evolved from Middle High German and Hebrew, with its own distinct grammar, vocabulary, and script. However, due to its complex history and limited digital presence, developing an OCR system for Yiddish has proven challenging.
The researchers created an annotated OCR corpus for Yiddish, which consists of over 660 pages of text from the Steven Spielberg Digital Yiddish Library. They then developed a new OCR tool called Jochre 3, which uses neural networks to recognize handwritten and printed text in Yiddish.
One of the key challenges in developing an OCR system for Yiddish is the language’s unique script, which combines elements of Hebrew and Latin alphabets. The researchers overcame this challenge by using a custom-made model that can recognize and distinguish between these different scripts.
The Jochre 3 tool uses a top-down approach to page layout analysis, recognizing blocks of text and then segmenting them into individual words and characters. It also uses a combination of convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to recognize handwritten and printed text in Yiddish.
The researchers evaluated the performance of Jochre 3 on a test corpus of over 186,000 tokens, achieving an impressive character error rate (CER) of just 1.5%. This is significantly better than other OCR systems available for Yiddish, which have CERs ranging from 9% to 19%.
The implications of this research are significant not only for the preservation and accessibility of Yiddish texts but also for the broader field of artificial intelligence. The development of high-quality OCR tools for minority languages like Yiddish can help to promote linguistic diversity and cultural heritage.
In addition, the techniques developed in this research could be applied to other minority languages that lack digital resources and OCR systems. This could have far-reaching consequences for the preservation and promotion of endangered languages worldwide.
The researchers are now working on integrating Jochre 3 with a search engine, which will enable users to search and access Yiddish texts online. They also plan to expand their OCR corpus to include more texts from the YIVO Institute for Jewish Research and other sources.
Cite this article: “Unlocking the Secrets of Yiddish: A Breakthrough in Optical Character Recognition”, The Science Archive, 2025.
Yiddish, Ocr, Artificial Intelligence, Language Preservation, Cultural Heritage, Minority Languages, Linguistic Diversity, Neural Networks, Character Recognition, Digital Archives







