Evaluating Controllability in Large Language Models

Saturday 15 March 2025


The quest for controllable language models has reached a new milestone, as researchers have developed a benchmark that tests the ability of AI systems to generate text according to specific instructions.


Large language models (LLMs) have made tremendous progress in recent years, capable of generating human-like text on a wide range of topics. However, their output often lacks control and consistency, making it difficult for users to rely on them for specific tasks.


The new benchmark, called LCTG Bench, aims to address this issue by evaluating the controllability of LLMs across various tasks, including summarization, ad text generation, and pros and cons generation. The benchmark consists of three types of tasks that test different aspects of controllability: format control, keyword inclusion, and phrase removal.


Format control assesses the ability of LLMs to generate text within specific length constraints, such as character count or sentence structure. Keyword inclusion evaluates their capacity to incorporate specific words or phrases into the generated text. Phrase removal tests their ability to exclude certain sentences or paragraphs from the output.


The researchers used nine Japanese-specific and multilingual LLMs, including GPT-4, to evaluate their controllability. The results showed significant gaps between the performance of different models, with some exceling in specific tasks while struggling in others.


For example, one model, GPT-NeoX, performed exceptionally well in summarization and ad text generation tasks, but struggled with pros and cons generation. Another model, Gemini-Pro, demonstrated strong keyword inclusion capabilities, but had difficulty removing phrases from the generated text.


The findings of this study highlight the importance of controllability in LLMs, particularly for applications where accuracy and consistency are crucial, such as content creation or customer service chatbots.


The development of LCTG Bench provides a valuable tool for researchers and developers to evaluate the controllability of their language models. By understanding the strengths and weaknesses of different models, they can design more effective training methods and improve the overall performance of AI systems.


As the field of natural language processing continues to evolve, the need for controlled language generation will only grow more pressing. With LCTG Bench, researchers are one step closer to achieving this goal, paving the way for more sophisticated AI applications that can generate high-quality text with precision and consistency.


Cite this article: “Evaluating Controllability in Large Language Models”, The Science Archive, 2025.


Language Models, Controllability, Benchmark, Lctg Bench, Summarization, Ad Text Generation, Pros And Cons Generation, Format Control, Keyword Inclusion, Phrase Removal


Reference: Kentaro Kurihara, Masato Mita, Peinan Zhang, Shota Sasaki, Ryosuke Ishigami, Naoaki Okazaki, “LCTG Bench: LLM Controlled Text Generation Benchmark” (2025).


Leave a Reply