Saturday 01 February 2025
Text simplification, a process of making complex texts easier to understand, is an important task in natural language processing. Researchers have been working on developing effective text simplification systems, but most of these systems are designed for languages such as English and Spanish, leaving behind languages like Sinhala, which is spoken by millions of people around the world.
A team of researchers has taken up this challenge and created a dataset and evaluation metrics specifically designed for Sinhala text simplification. The dataset, called SARI, contains over 1,000 pairs of complex and simplified sentences in Sinhala, making it one of the largest datasets of its kind for any language.
The researchers also developed two state-of-the-art models, mBART and mT5, to fine-tune on this dataset. These models use advanced neural networks and are designed to learn patterns and relationships in the data that can help them simplify complex sentences more effectively.
But how do you evaluate whether a text simplification system is working well? The researchers developed several metrics to measure the quality of the simplified output, including SARI, which compares the output against the input sentence as well as references. Another metric, BERTScore, uses contextual embeddings to calculate the similarity between tokens in the output and reference sentences.
The researchers tested their models using these metrics and found that they were able to significantly improve text simplification accuracy compared to previous approaches. They also identified several common errors that can occur during text simplification, such as fluency errors, hallucinations, anaphora resolution issues, and bad substitutions.
To create the dataset, the researchers employed human participants who are native Sinhala speakers and had them annotate complex sentences with simplified versions. The annotators were instructed to preserve the meaning of the original sentence while making it easier to understand.
The researchers also conducted two rounds of error analysis using different evaluators to ensure that the dataset was reliable and accurate. This step is crucial in text simplification, as errors can have significant consequences on the quality of the output.
In terms of implementation details, the researchers used advanced neural networks and fine-tuned them using large-scale computational resources. They also developed a novel approach to paraphrase mining, which involves generating multiple paraphrases for each sentence and then selecting the best one.
Overall, this research is an important step towards developing effective text simplification systems for languages like Sinhala.
Cite this article: “Developing Text Simplification Systems for Sinhala Language”, The Science Archive, 2025.
Text Simplification, Natural Language Processing, Sinhala, Sari Dataset, Neural Networks, Mbart, Mt5, Bertscore, Evaluation Metrics, Paraphrase Mining







