Unlocking the Secrets of Language: A Deep Dive into Pre-Trained Models Ability to Identify Discourse Cohesion

Tuesday 08 April 2025


The ability of language models to understand and generate human-like text has come a long way in recent years, but there’s still much to be learned about how they work their magic. A new study published recently sheds light on the strengths and weaknesses of these AI-powered language generators, specifically when it comes to identifying and generating discourse cohesion.


Discourse cohesion refers to the ways in which language is used to connect ideas across different parts of a text or conversation. It’s what makes writing flow smoothly from one sentence to the next, and what allows us to pick up on subtle connections between seemingly unrelated ideas. In natural language processing (NLP), identifying and generating discourse cohesion is essential for tasks like summarization, question answering, and text generation.


The study in question focused on seven different types of discourse cohesion phenomena, including repetition, synonyms, reference, substitution, ellipsis, conjunction, and lexical cohesion. The researchers used a dataset of annotated text to train and test various language models, including BERT, RoBERTa, and BART, to see how well they could identify and generate these cohesion phenomena.


The results were mixed. On the one hand, the language models performed well on tasks like repetition and synonyms, where they were able to easily spot similar words or phrases and use them to connect ideas. They also did a good job with reference, which involves using pronouns or other references to link back to previous sentences or ideas.


On the other hand, the models struggled with more complex types of cohesion, such as substitution and ellipsis. Substitution involves replacing one word or phrase with another that has a similar meaning, while ellipsis refers to the omission of words or phrases that are understood but not explicitly stated. The language models had trouble identifying these types of cohesion, often relying on superficial similarities rather than deeper semantic connections.


The researchers also found that the language models performed better when generating cohesion phenomena between adjacent sentences (i.e., those that appear right next to each other) compared to non-adjacent sentences (i.e., those that are separated by other sentences or ideas). This suggests that the models are still learning how to connect ideas across larger distances and may need more training data to improve their skills.


The study’s findings have implications for a range of applications, from natural language generation to text summarization and question answering.


Cite this article: “Unlocking the Secrets of Language: A Deep Dive into Pre-Trained Models Ability to Identify Discourse Cohesion”, The Science Archive, 2025.


Language Models, Discourse Cohesion, Nlp, Bert, Roberta, Bart, Summarization, Question Answering, Text Generation, Natural Language Processing


Reference: Jie He, Wanqiu Long, Deyi Xiong, “Evaluating Discourse Cohesion in Pre-trained Language Models” (2025).


Leave a Reply