Saturday 15 March 2025
Language is a powerful tool that can be used for good or bad, and in recent years, researchers have been working to develop ways to detect hate speech online. One of the biggest challenges in this area is creating a dataset that accurately represents the complexities of language and the various forms it can take.
A new paper published recently has made significant strides in addressing this challenge by creating a dataset called STATE TOXICN, which contains span-level target-aware toxicity extraction for Chinese hate speech detection. This dataset is a major step forward in developing more effective methods for detecting hate speech online.
The dataset includes over 2,000 examples of toxic language, including insults, slurs, and other forms of hateful content. Each example is labeled with information about the target group being attacked, such as race, ethnicity, or gender, as well as the type of toxicity present in the language.
One of the key features of STATE TOXICN is its focus on span-level detection, which means that it can identify specific phrases or sentences within a larger piece of text that are toxic. This is important because hate speech often takes the form of subtle and nuanced language that can be easy to miss if you’re not looking for it.
The dataset also includes a lexicon of Chinese hateful slang, which is a critical component in detecting hate speech online. The lexicon contains over 1,000 examples of hateful terms and phrases, including their meanings and the targeted groups they are directed at.
To evaluate the effectiveness of STATE TOXICN, researchers used it to train several different models for hate speech detection. They found that these models were able to achieve high levels of accuracy in detecting toxic language, even when faced with complex or subtle forms of hate speech.
The results of this study have important implications for the development of artificial intelligence systems designed to detect and prevent hate speech online. By creating a dataset like STATE TOXICN, researchers can develop more effective methods for identifying and removing hateful content from online platforms.
Overall, STATE TOXICN is an important step forward in the development of hate speech detection technology. Its focus on span-level detection and its inclusion of a lexicon of Chinese hateful slang make it a powerful tool for researchers and developers looking to create more effective systems for detecting and preventing hate speech online.
Cite this article: “STATE TOXICN: A Novel Dataset for Detecting Hate Speech in Chinese Language”, The Science Archive, 2025.
Hate Speech, Language Detection, Chinese, Toxicity Extraction, Dataset, Span-Level Detection, Artificial Intelligence, Online Platforms, Language Technology, Machine Learning.







