Saturday 24 May 2025
A massive dataset of YouTube videos has been released, providing a treasure trove of information for researchers seeking to understand the online harm that can spread through social media platforms. The dataset, dubbed MetaHarm, comprises over 60,000 potentially harmful YouTube videos, carefully annotated by domain experts and crowdworkers.
To create this comprehensive resource, researchers employed a three-step approach. First, they searched for videos using search queries related to various forms of online harm, including hate speech, misinformation, and addiction. Next, they manually reviewed the top results to ensure they were indeed harmful. Finally, they enlisted the help of 544 crowdworkers recruited through Amazon Mechanical Turk to annotate the selected videos.
The resulting dataset includes binary classification (harmful vs. harmless) and multi-label categorization for six specific harm categories: information, hate speech, addictive content, clickbait, sexual content, and physical harm. This level of detail will enable researchers to probe deeper into the nature of online harm, identifying patterns and trends that might not have been apparent otherwise.
The dataset’s creators acknowledge the importance of addressing the limitations of their work. They note that their approach may not capture the full breadth of harmful content on YouTube, particularly videos that are already removed or inaccessible. Additionally, they emphasize the need for responsible data release practices to ensure the protection of individual privacy and to prevent misuse of the data.
The MetaHarm dataset has far-reaching implications for researchers seeking to develop more effective methods for detecting and mitigating online harm. By providing a large-scale, annotated dataset, these experts can now explore various approaches to automated content classification, such as machine learning algorithms or natural language processing techniques. This knowledge can be applied to other social media platforms, helping to stem the spread of harmful information and promoting a safer online environment.
Moreover, the dataset’s release underscores the importance of transparency in research and data sharing. By making their methodology and results publicly available, the researchers demonstrate a commitment to accountability and reproducibility, paving the way for future studies that can build upon this foundation.
Ultimately, the MetaHarm dataset represents a significant step forward in our understanding of online harm and its consequences. As researchers continue to mine this treasure trove of information, we can expect new insights into the complex interplay between social media platforms, user behavior, and the spread of harmful content.
Cite this article: “MetaHarm: A Comprehensive Dataset for Understanding Online Harm on YouTube”, The Science Archive, 2025.
Youtube, Online Harm, Dataset, Metaharm, Social Media, Hate Speech, Misinformation, Addiction, Clickbait, Sexual Content







