Saturday 27 September 2025
The quest for a unified framework for video scene graph generation has long been an elusive one. Researchers have traditionally tackled this problem through separate approaches, either focusing on coarse-grained box-level or fine-grained panoptic pixel-level representations of visual content. However, these siloed methods often require task-specific architectures and multi-stage training pipelines, making it difficult to generalize across different levels of visual granularity.
A new study aims to bridge this gap by introducing UNO, a single-stage, unified framework that jointly addresses both box-level and panoptic pixel-level video scene graph generation within an end-to-end architecture. The core of UNO is an extended slot attention mechanism that decomposes visual features into object and relation slots, allowing for the modeling of temporal interactions between objects.
To ensure robust temporal modeling, the researchers have introduced object temporal consistency learning, which enforces consistent object representations across frames without relying on explicit tracking modules. Additionally, a dynamic triplet prediction module links relation slots to corresponding object pairs, capturing evolving interactions over time.
The authors evaluate UNO on standard benchmarks for box-level and pixel-level video scene graph generation, demonstrating that the unified framework achieves competitive performance across both tasks while offering improved efficiency. This breakthrough has significant implications for applications such as video understanding, video reasoning, and robotic reasoning, where the ability to extract structured representations of dynamic visual content is crucial.
The key innovation here lies in UNO’s ability to minimize task-specific modifications and maximize parameter sharing, enabling generalization across different levels of visual granularity. By leveraging object-centric representation learning and temporal consistency learning, the framework can effectively model complex spatio-temporal relationships between objects without requiring explicit tracking or segmentation.
The authors’ approach is built upon a foundation of existing work in computer vision and machine learning, drawing inspiration from advances in object detection, segmentation, and relation prediction. The resulting architecture is robust, efficient, and capable of capturing the nuances of dynamic visual content with unprecedented accuracy.
As researchers continue to push the boundaries of what is possible with video scene graph generation, UNO serves as a testament to the power of unified frameworks in unlocking new possibilities for computer vision and machine learning. By providing a single, end-to-end solution that can tackle both box-level and panoptic pixel-level representations, UNO has opened up new avenues for exploration and innovation in this rapidly evolving field.
Cite this article: “UNO: A Single-Stage, Unified Framework for Video Scene Graph Generation”, The Science Archive, 2025.
Video Scene Graph Generation, Unified Framework, Computer Vision, Machine Learning, Object Detection, Segmentation, Relation Prediction, Spatio-Temporal Relationships, Temporal Consistency Learning, Object-Centric Representation Learning