Assessing Artificial Intelligences Understanding of Physical Social Norms

Monday 31 March 2025


The quest for a more human-like understanding of physical social norms has long been an elusive goal in artificial intelligence research. While machines can excel in specific tasks, their ability to grasp the subtleties and complexities of human behavior remains limited. A new benchmark, EGONORMIA, aims to bridge this gap by providing a comprehensive evaluation framework for vision-language models (VLMs) to assess their understanding of physical social norms.


EGONORMIA is built upon a dataset of 1,853 egocentric videos featuring commonplace human activities in various contexts. Each video is paired with two related questions that evaluate both the prediction and justification of normative actions. The benchmark encompasses seven categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility.


To create EGONORMIA, a novel pipeline was developed to generate high-quality action descriptions, correct behaviors, distractor behaviors, correct justifications, and distractor justifications. A team of human annotators reviewed the dataset to ensure its quality and consistency. The resulting benchmark is designed to test VLMs’ ability to understand and reason about physical social norms in a variety of situations.


The results are striking: current state-of-the-art VLMs struggle to achieve even modest performance on EGONORMIA, scoring an average of 25% across all tasks. This is not due to a lack of computational resources or processing power, but rather the models’ limited understanding of physical social norms. The benchmark highlights significant gaps in performance between different categories, with safety and coordination/proactivity being relatively easier, while communication/legibility and politeness pose greater challenges.


Further analysis reveals that even among the top-performing models, there is a 10% gap in performance between the best- and worst-scoring taxonomy categories. This suggests that VLMs are still far from achieving human-level understanding of physical social norms. The results also show that closed-source models outperform open-source alternatives, with a mean accuracy of 40.3% compared to 28.3%.


The development of EGONORMIA offers several insights into the current limitations of VLMs and provides a framework for future research. By leveraging this benchmark, researchers can design more effective approaches to improve VLMs’ normative reasoning capabilities. The dataset’s diversity and wide range of contexts also make it an invaluable resource for exploring the complexities of human behavior.


Cite this article: “Assessing Artificial Intelligences Understanding of Physical Social Norms”, The Science Archive, 2025.


Artificial Intelligence, Social Norms, Benchmark, Vision-Language Models, Human Behavior, Egocentric Videos, Action Descriptions, Justifications, Normative Reasoning, Machine Learning


Reference: MohammadHossein Rezaei, Yicheng Fu, Phil Cuvin, Caleb Ziems, Yanzhe Zhang, Hao Zhu, Diyi Yang, “EgoNormia: Benchmarking Physical Social Norm Understanding” (2025).


Leave a Reply