Improving Vietnamese Legal Document Retrieval with Synthetic Data

Friday 31 January 2025


A team of researchers has made significant strides in improving Vietnamese legal document retrieval using synthetic data. The project focuses on developing a model that can efficiently retrieve relevant documents from a large corpus of Vietnamese legal texts.


The approach involves generating synthetic queries from the legal texts themselves, rather than relying on manual annotations or human-generated questions. This not only saves time and resources but also enables the model to learn from its own mistakes and adapt more effectively to the complexities of Vietnamese language.


To generate these synthetic queries, the researchers employed a large language model called Llama 3, which was pre-trained on a vast dataset of text from various sources. The model is capable of understanding the nuances of Vietnamese language and can generate questions that are both relevant and challenging.


The generated queries were then used to train a dense retrieval model, specifically designed for passage-level retrieval. This model combines the strengths of two popular neural network architectures, BERT and ColBERT, to create a powerful tool for legal document retrieval.


In their experiments, the researchers found that fine-tuning the model with synthetic data significantly improved its performance on both in-domain and out-of-domain evaluations. The model was able to retrieve relevant documents with high accuracy, even when presented with unseen queries or passages.


The implications of this research are significant, particularly for countries like Vietnam where access to legal information is crucial but often limited by language barriers. By developing a model that can efficiently retrieve relevant documents in Vietnamese, the researchers aim to improve public services and facilitate greater accessibility to legal resources.


Furthermore, the use of synthetic data generation opens up new possibilities for training large language models on low-resource languages like Vietnamese. This approach could potentially be applied to other domains, such as healthcare or education, where accurate information retrieval is critical but limited by language barriers.


The research highlights the potential of synthetic data in improving language model performance and expands our understanding of how these models can be trained and fine-tuned for specific tasks. As language technology continues to evolve, the development of more advanced models like this one will play a crucial role in bridging language gaps and making information more accessible to people around the world.


Cite this article: “Improving Vietnamese Legal Document Retrieval with Synthetic Data”, The Science Archive, 2025.


Vietnamese Legal Documents, Synthetic Data, Language Model, Llama 3, Dense Retrieval Model, Bert, Colbert, Passage-Level Retrieval, Low-Resource Languages, Information Accessibility


Reference: Son Pham Tien, Hieu Nguyen Doan, An Nguyen Dai, Sang Dinh Viet, “Improving Vietnamese Legal Document Retrieval using Synthetic Data” (2024).


Leave a Reply