Improving Language Processing Systems with ViSoLex: A Comprehensive System for Normalizing Non-Standard Words in Vietnamese

Thursday 06 March 2025


A new tool has been developed that can help improve the accuracy of language processing systems, particularly those designed to handle informal and non-standard text found in social media. The system, called ViSoLex, uses a combination of pre-trained models and weakly supervised learning techniques to identify and normalize Non-Standard Words (NSWs) in Vietnamese.


Language processing systems, such as chatbots and virtual assistants, rely on their ability to understand and generate human language. However, these systems often struggle with informal and non-standard text found in social media, which can contain a wide range of linguistic variations, including abbreviations, colloquialisms, and misspelled words.


ViSoLex addresses this challenge by developing a comprehensive system for identifying and normalizing NSWs in Vietnamese. The system uses a combination of pre-trained models and weakly supervised learning techniques to identify NSWs and transform them into their standard forms.


One of the key features of ViSoLex is its use of weak supervision, which involves training the model on a large dataset of labeled examples, but also allowing it to learn from unlabeled data. This approach allows the model to generalize better to new and unseen data, improving its overall performance.


The system has been tested on a range of Vietnamese texts, including social media posts, news articles, and online forums. The results show that ViSoLex is able to improve the accuracy of language processing systems by up to 3.74%, compared to traditional approaches.


ViSoLex also includes a range of other features, such as a dictionary lookup service that allows users to search for NSWs and retrieve their standard equivalents, definitions, and examples from a comprehensive dictionary. This feature can be particularly useful for researchers and developers who need to work with Vietnamese language data.


Overall, ViSoLex represents an important step forward in the development of language processing systems capable of handling informal and non-standard text found in social media. Its ability to improve accuracy by up to 3.74% makes it a valuable tool for researchers and developers working on projects that involve natural language processing.


The system’s flexibility and adaptability also make it an attractive option for a range of applications, from chatbots and virtual assistants to text analysis and machine translation tools. As the use of social media continues to grow, the need for systems like ViSoLex will only increase, making it an important tool for anyone working with Vietnamese language data.


Cite this article: “Improving Language Processing Systems with ViSoLex: A Comprehensive System for Normalizing Non-Standard Words in Vietnamese”, The Science Archive, 2025.


Language Processing, Visolex, Non-Standard Words, Normalization, Weak Supervision, Pre-Trained Models, Vietnamese, Social Media, Natural Language Processing, Chatbots


Reference: Anh Thi-Hoang Nguyen, Dung Ha Nguyen, Kiet Van Nguyen, “ViSoLex: An Open-Source Repository for Vietnamese Social Media Lexical Normalization” (2025).


Leave a Reply