Tuesday 18 March 2025
Physicists have long sought to develop a comprehensive benchmark that would allow them to evaluate the abilities of large language models (LLMs) in solving complex problems in undergraduate-level physics. A new paper has made significant strides towards achieving this goal, introducing UGPhysics, a large-scale and comprehensive benchmark designed specifically for evaluating undergraduate-level physics reasoning with LLMs.
The researchers behind UGPhysics recognize that traditional benchmarks have fallen short of capturing the breadth and depth of undergraduate-level physics problems. They sought to create a more comprehensive evaluation tool that would test an LLM’s ability to reason about physical concepts, apply mathematical techniques, and derive solutions to complex problems.
UGPhysics comprises 5,520 undergraduate-level physics problems in both English and Chinese, covering 13 subjects with seven different answer types and four distinct physics reasoning skills. The benchmark includes a range of problem types, from mechanics and thermodynamics to electromagnetism and quantum mechanics. To ensure accuracy, the researchers rigorously screened the data for leakage.
To evaluate the performance of LLMs on UGPhysics, the researchers developed a Model-Assistant Rule-based Judgment (MARJ) pipeline specifically designed for assessing answer correctness. This pipeline allows them to compare the output of different models with the expected solution and measure their accuracy.
The results are striking: while some models performed reasonably well across various skills and subjects, others struggled significantly. The best-performing model achieved an overall accuracy of 49.8%, emphasizing the importance of developing LLMs with stronger physics reasoning skills beyond mathematical abilities.
One notable example is a problem involving electromagnetic eddy currents, where most models failed to apply the correct formula for calculating the effective mass of a hole. This highlights the need for LLMs to develop a deeper understanding of physical concepts and their applications.
Another challenge lies in semiconductor physics, where models often struggle to derive the effective mass of a hole using the effective Rydberg formula. In this case, even the best-performing model achieved only 0.07 m0, far from the correct value of approximately 9.52 × 10−2m0.
These findings underscore the importance of developing more advanced LLMs capable of tackling complex physics problems. By pushing these models to improve their performance on UGPhysics, researchers can ultimately create more accurate and reliable tools for a wide range of applications, from scientific research to education.
Cite this article: “Developing Comprehensive Benchmarks for Large Language Models in Physics: The UGPhysics Challenge”, The Science Archive, 2025.
Large Language Models, Benchmark, Physics Problems, Undergraduate Level, Evaluation Tool, Physical Concepts, Mathematical Techniques, Solution Derivation, Accuracy, Performance Assessment.







