Evaluating Large Language Models Ability to Generate Code for Data Visualization

Sunday 02 February 2025


A new benchmark has been created to test the capabilities of Large Language Models (LLMs) in generating code for visualizing data, a crucial task in data analysis and exploration. The PandasPlotBench dataset consists of 175 unique tasks that simulate real-world scenarios where users provide natural language instructions to generate code for plotting data from a Pandas DataFrame.


The benchmark is designed to evaluate the effectiveness of LLMs as assistants in visual data exploration, filling a gap in current evaluation tools and expanding their scope. The dataset includes tasks with varying levels of complexity, such as summarizing plots, generating code for popular libraries like Matplotlib and Seaborn, and even struggling with less well-represented libraries like Plotly.


The benchmark’s results show that state-of-the-art proprietary LLMs and large open models perform well in generating code for plotting data from Pandas DataFrames. However, their knowledge of Plotly is still limited, with around 22% of failed attempts. The study also explores the impact of task length on plotting capability, finding that significant shortening of tasks does not significantly affect performance.


The authors hope that this benchmark will help researchers improve user experience in data visualization and analysis by providing insights into how LLMs can be used as assistants in visual data exploration. The dataset is available online, along with the code for running the benchmark and generating plots.


This benchmark has significant implications for the development of LLMs and their applications in data analysis. By testing the models’ ability to generate code for plotting data from Pandas DataFrames, researchers can better understand how these models can be used as assistants in visual data exploration. The results also highlight areas where LLMs need improvement, such as working with less popular visualization libraries.


The study’s findings suggest that LLMs are capable of generating high-quality code for plotting data from Pandas DataFrames, even when provided with concise instructions. However, the models still struggle with tasks that require more complex plotting or those that involve less well-represented libraries like Plotly.


Overall, this benchmark provides a valuable tool for evaluating the capabilities of LLMs in generating code for visualizing data and highlights areas where these models need improvement. As LLMs continue to evolve, this benchmark will be an essential resource for researchers seeking to improve user experience in data visualization and analysis.


Cite this article: “Evaluating Large Language Models Ability to Generate Code for Data Visualization”, The Science Archive, 2025.


Large Language Models, Code Generation, Data Visualization, Pandas Dataframe, Plotting, Matplotlib, Seaborn, Plotly, Benchmarking, Natural Language Processing


Reference: Timur Galimzyanov, Sergey Titov, Yaroslav Golubev, Egor Bogomolov, “Drawing Pandas: A Benchmark for LLMs in Generating Plotting Code” (2024).


Leave a Reply