Enhancing Offensive Language Detection with Data Augmentation and Knowledge Distillation

Offensive language detection has received important attention and plays a crucial role in promoting healthy communication on social platforms, as well as promoting the safe deployment of large language models. Training data is the basis for developing detectors; however, the available offense-related dataset in Chinese is severely limited in terms of data scale and coverage when compared to English resources. This significantly affects the accuracy of Chinese offensive language detectors in practical applications, especially when dealing with hard cases or out-of-domain samples. To alleviate the limitations posed by available datasets, we introduce AugCOLD (Augmented Chinese Offensive Language Dataset), a large-scale unsupervised dataset containing 1 million samples gathered by data crawling and model generation. Furthermore, we employ a multiteacher distillation framework to enhance detection performance with unsupervised data. That is, we build multiple teachers with publicly accessible datasets and use them to assign soft labels to AugCOLD. The soft labels serve as a bridge for knowledge to be distilled from both AugCOLD and multiteacher to the student network, i.e., the final offensive detector. We conduct experiments on multiple public test sets and our well-designed hard tests, demonstrating that our proposal can effectively improve the generalization and robustness of the offensive language detector.


Introduction
In this era of booming social media, inappropriate content with offense has become increasingly common on the web, such as racial discrimination, sexism, violent crimes, etc., leading to a series of negative impacts.Moreover, as large language models (e.g., Blenderbot [1], EVA [2,3], PanguBot [4], GLM [5], and ChatGPT [6]) evolve into new human-computer interaction platforms, they are inevitably hindered by offensive content during deployment.It becomes crucial to build offensive detectors to identify and filter inappropriate content automatically [7][8][9][10][11].
The performance of the offensive detector depends heavily on the quality and quantity of the training data [9,12,13].For Chinese offensive detection, previous works mainly focus on building supervised datasets and compiling benchmark detectors, such as detecting sexist [14], profanity [15], offensive [16], and targeted bias [17].However, when the benchmark detectors are deployed in real-world applications, their performances suffer significantly due to the more diverse and complex scenarios.It is mainly caused by the following 2 factors.
•The first is the limited data coverage of training corpus.Owing to the complexity and diversity of the offensive language, it is challenging to cover all cases in training data; thus, the model may encounter certain unexpected situations in actual deployment, resulting in a decrease in detection accuracy.Besides, the data scale of available Chinese datasets ranges from 9k to 37k (as shown in Table 1), lagging greatly behind English datasets such as Jigsaw's 2 million data (https://www.kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification/data).Insufficient data exacerbates the distribution differences between the training and deployment surroundings, resulting in limited detectable scope [18].For instance, while Chinese Offensive Language Dataset (COLDataset) [16] focuses on offensive language related to race, gender, and region, it remains underexplored for out-of-domain topics, such as disability and body shaming.
•The second is that the detector struggles with hard samples.We discovered that existing detectors are usually tricked by implicit samples, such as being very sensitive to counterattack samples containing black words or being fooled by microattacks, resulting in mis-predetections and weakened robustness [12,16,17,19].We call these implicit samples with covert representation as hard cases.The difficulty posed by them stems largely from the fact that the existing training data might be overwhelmed by easy cases and the proportion of hard cases in the training data is insufficient, making it difficult for the detector to learn and recognize them.
The most practical way for improving detector performance in real-world deployments is to use large-scale, high-quality supervised data for training [9,12,13].Nevertheless, there are very few public Chinese datasets available, and the cost of creating large-scale supervised datasets is prohibitively expensive due to the scarce distribution of undesirable content in the real world [11] and the time and labor required for manual annotation.This has significantly hampered the research and development of Chinese offensive detection, leading to the absence of universally acknowledged detectors, such as the Perspective API for English (https://perspectiveapi.com/), to date.
The aim of this study is to develop a robust and generalizable Chinese offensive detector.In order to achieve this, we propose a large-scale automated labeled dataset, AugCOLD, which contains 1 million data and is an expansion of the previously proposed COLDateset [16].AugCOLD is gathered from 2 data sources: crawling from real-world data and prompt-based generation from large language models.This is primarily due to the following considerations.Firstly, enormous amounts of real-world data cover a broad range of topics, and integrating them as candidates can expand data coverage.Second, utilizing prompt-based generation can increase data diversity, particularly when augmenting hard samples.
To maximize information utilization of AugCOLD, we employ the application of multiteacher knowledge distillation to distill knowledge from both teachers and unsupervised data to the student detector, thus boosting the detector's performance.The multiple teachers are trained with public Chinese datasets [16,17] and translated English datasets (https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification)[20].With these teacher models, soft labels of AugCOLD are generated and then serve as training signals to guide the training of the student model.We conduct experiments on various test benchmarks to verify the efficacy of the proposed AugCOLD and multiteacher knowledge distillation frameworks.The results show that our solution contributes to the robustness and generalization of the offensive language detecter, which performance even surpasses the teacher models.
The contributions of this work are 3-fold: •We create and release AugCOLD (Augmented Chinese Offensive Language Dataset).It contains 1 million unsupervised data gathered from real-world data crawling and model generation.
•We present a multiteacher knowledge distillation framework to maximize the utilization of unsupervised data and enhance the detector's performance.
•We conduct extensive experiments on several benchmark datasets, and the results show that our proposal can effectively improve the robustness and generalization of the offensive detector.

Offensive language detection
Detecting offensive language, also known as toxic detection, is crucial to maintaining a healthy conversation environment on social platforms.In addition, the increasing popularity of large models in recent years has brought broad attention to inappropriate contexts, particularly offensive language, making offensive detection a vital component of furthering the safe deployment of large models.
Offensive language detection is aimed at recognizing and identifying offensive content, such as insults, rudeness, profanity, and hate speech [7,16,21,22].This task has drawn substantial attention from academics and industries.Recent studies have demonstrated that deep learning models have superior performance and data-driven methods are gradually becoming the mainstream methods for offensive detection [9,12,13,18,23].Many works are continuously committed to the development of supervised datasets.Wulczyn et al. [24] formulate this task as a binary classification problem and propose The Wikipedia Toxic Comments datasets to investigate personal attacks in social media.For identifying condescension in context, the TalkDown dataset is proposed [25].Dinan et al. [9] collect adversarial data using the build-break-fix method to build a more robust safety detector.During human-detector interactions, these data are manually collected and subsequently used to enhance the performance of the detector.Xu et al. [23] collect the Bot-Adversarial Dialogue dataset by eliciting unsafe responses from conversational models using their Bot-Adversarial Dialogue system.Those generated data are utilized to refine the detector and then further filter unsafe content from generation.Besides binary classification, some works focus on a more fine-grained classification of offensive language, such as the Offensive Language Target Identification dataset [21], the Unhealthy Comment Corpus [26], the AdHomInTweets dataset [19], and the Offensive language and stance classification dataset (ToxiChat) [27], etc.In the Kaggle competition (https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification),a large-scale dataset with more toxic types is provided, including toxic, severe toxic, obscene, threat, insult, and identity hate, providing researchers with a detailed taxonomy reference for future optimization.For detecting and classifying malevolent responses, Zhang et.al [28] present the Malevolent Dialogue Response Detection and Classification benchmark dataset.They propose a taxonomy with a finer granularity that includes 10 kinds of malevolent responses, such as unconcernedness, threat, and obscenity.These works and publicly available datasets have significantly advanced the study of offensive language.

Offensiveness in Chinese
Although offensive language has been studied a lot, little emphasis has been placed on offense in Chinese.This is mostly limited by the resources that are available.Baidu Text Cencer (https://ai.baidu.com/tech/textcensoring) is currently one of the most popular tools for identifying potentially harmful content in Chinese, including pornography, violence, terrorism, political sensitivity, and abuse.However, recent studies have revealed that its accuracy in detecting offensive content is only about 63% due to its sensitivity to keywords and its inability to handle more implicitly harmful utterances [16].
Most recently, some work base resources have been built to alleviate the dilemma of resource scarcity.Table 1 shows, as best we know, all the relevant datasets.Yang et.al [15] focus on profane keywords such as "Bi*tch" and "h*ll" in Taiwanese local dialects and propose the TOCP (NTOU Chinese Profanity) dataset for detecting and rewriting Chinese profanities terms.TOCP has 16k sentences and is an augmentation of their previous work [30], which contains 2k sentences.Tang et al. [29] develop COLA, a Chinese dataset for identifying offensive language, which consists of fine-grained insulting language, antisocial language, and criminal language.This dataset is highly relevant to the scope of our research, but it is currently unavailable to the public.Ginger et al. [14] present the first Chinese sexism dataset, Sina Weibo Sexism Review (SWSR) dataset, for identifying gender-related inappropriate content.They consider 4 sexist expressions, including appearance-based stereotypes, cultural-based stereotypes, microaggression, and sexual offense.Observing the data, we found that the offense in SWSR is better hidden, making its detection more challenging.Deng et al. [16] has made available the first open-source Chinese offensive language dataset, COLDataset, including 37k contents and covering topics of gender, race, and region.They also account for attacks on individuals and groups, anti-bias content, and other cases that are not offensive.Zhou et al. [17,31] present a Chinese dialogue bias dataset CDialBias and explore implicit attitudes toward target groups.They account for bias at the sentence and context levels and provide more detailed annotations, including bias, anti-bias, neutral, and bias-irrelevant content.
The efforts of these works have significantly advanced the study of inappropriate content in Chinese.However, the data quantity and scope coverage of Chinese resources is much inferior to those of English resources.Therefore, this paper aims to develop and release a large-scale unsupervised dataset AugCOLD.We expect that it will be able to cover as much diverse data as possible in order to ease resource restrictions and encourage further research on Chinese offensive language.

Knowledge distillation
Knowledge distillation is a common approach of model compression [32] that can improve the performance of a small network by transferring the knowledge of a larger neural network to a smaller network.This method has proven effective for a variety of tasks [33][34][35] including image classification and speech recognition [36].Moreover, related studies have proved that using multiple-teacher networks for knowledge distillation can achieve better performance than a single teacher [37] because different teachers usually focus on different fields and multiteacher networks can provide more information.When training data with reliable labels are insufficient for knowledge distillation, some researchers suggest combining knowledge distillation with unsupervised learning approaches to optimize the performance of detectors [38,39].Particularly, the teacher network is employed to assign soft labels to unsupervised data and then use them as supervision signals to guide the optimization of the student model, thus obtaining satisfactory performance.For instance, Li et al. [40] apply this method to the semisupervised relation extraction task and demonstrate that it could improve the performance of the basic model with minimal computation.
Motivated by these works, in this paper, we explore the multiteacher knowledge distillation framework to enhance the performance of the final offensive detectors.Specifically, we employ the existing relevant dataset to train numerous teachers and use them to assign soft labels for AugCOLD, thus distilling knowledge from both teachers and AugCOLD to, medium to direct the training of the student network, thus improving its performance.Specifically, we employ the existing relevant dataset to train numerous teachers and use them to assign soft labels for AugCOLD, thus distilling knowledge from both teachers and AugCOLD to the final detector.

Datasets
To evaluate the performance of proposed model, we conduct experiments on 3 public datasets, including COLDataset), Chinese social bias dialog dataset (CDialBias), and Chinese sexism dataset (SWSR).
CDialBias includes dialogue-level context-sensitive samples and sentence-level samples.Since this work mainly focuses on offensiveness at the sentence level, only sentence-level data in CDialBias are chosen as the test set.
Moreover, we create 2 additional test sets to more thoroughly validate the detector's performance.One is AugTest, which is AugCOLD-like model-generated synthetic data.It contains 200 manually labeled data and can be used to evaluate the detector's capacity to monitor the offensive generation of large models.The other is HardTest, which is a more challenging test set consisting of 1,315 samples.It is developed to evaluate the performance of the detector on hard samples, and the details are given in Robustness on hard samples.

Multiteachers and student model
In the multiteacher distillation framework, we fine-tune the pretrained language model with diverse datasets to obtain numerous teacher models.
The student model is the final detector MuDA, which is trained by knowledge distillation with AugCOLD.All experiments in this work are executed using a single NVIDIA V100 32G GPU.
Macbertbase model (https://huggingface.co/hfl/chinesemacbert-base) is adopted as the backbone for both student model and teacher models.We finally built 6 teacher models using 2 Chinese datasets and several translated English datasets.
• COLD-R Mac .COLDataset is proposed for Chinese offensive language detection [16] and contains 37k comments with binary offensive labels.Considering that training data in COLDataset is semiautomatically labeled, we recheck the labels and correct any noticeable errors.COLD-R Mac is fine-tuned in this revised version COLD-R.
• CDialBias Mac .Cdialbias focuses on social bias in dialog and consists of 28k context-response pairs.During fine-tuning, the context and response are concatenated and fed into the model, with the output being a binary label indicating whether or not bias attitude is detected.
• TransJigsaw Mac .Jigsaw dataset includes varied toxicity subtype attributes (e.g., severe toxicity, obscene, threat, insult, identity attack, and sexually explicit) and covers diverse identity attributes.We pick 109k samples and translate them into Chinese with the Baidu General Translation API, which are then used to fine-tune the Macbertbase model.
• TransSIBC Mac .The Social Bias Inference Corpus (SIBC) contains 27,957 samples and is proposed to learn why some statements are deemed potentially unjust.
We translated this dataset into Chinese using the Baidu General Translation API and then using its offensiveness label to fine-tune the MacBertbase model.
• TransCN Mac .Counterspeech is a sort of response to hateful speech that tries to counter the negative message and prevent the spread of hate speech conveyed by the original speakers.
Previous research has shown that sensitive terms are commonly used in counterspeech, such as when emphasizing the harmfulness of hate speech, causing the detector to mistake the content as offensive.To this end, we select and translate 2 counterspeech datasets into Chinese: CONAN dataset [42] and hate speech intervention dataset [43], containing 32k data in total.We believe that these data will enable the teacher model TransCN Mac to recognize hate speech and counterspeech.
• MixData Mac .Although the aforementioned datasets differ in annotation dimension and annotation schema, they are all related to offensive language in some way.Hence, combining them to create a larger dataset and feeding them to the model will enable the model to acquire more information and perform better on related tasks.Therefore, we mix the aforementioned supervised datasets and train the sixth teacher model using the mixed datasets.

Compared methods
To evaluate the effectiveness of the knowledge distillation framework based on the augmented dataset, we compare it against a number of related offensive detectors.The first consists of the aforementioned 6 teacher models trained on various datasets, which represent the performance of related benchmark detectors.In addition, we compared the following popular employed detection methods: • BaiduTC: Baidu Text Censor is a public API.It aims to identify harmful content including pornography, violence, terrorism, political sensitivity, and abuse (https://ai.baidu.com/tech/textcensoring).
• GPT3.5 (text-davinci-003): Recent research has shown that large language models perform satisfactorily in zero-shot classification tasks when prompted by instructions.Thus, we explore the performance of GPT3.5 on Chinese offensive detection.An example of prompt-based zero-shot classification are as follows: -English translation: Please decide whether the following text is offensive, biased, or unethical.Text:Women are secret slaves in the current marriage system.Answer: It contains insults, prejudices, and stuff that violates moral ethics.
•COLD Mac : Macbertbase model fine-tuned on COLDetector by using the original labels of training set.This version differs from COLD-R Mac in that the latter does additional label checking on the semiautomatically labeled training data to boost the label's reliability.
•MultiT Avg : Ensemble of multiple teachers.The average score assigned by each teacher is taken as the final offensive score, which is then used to generate the final prediction: offensive if the score p ≥ 0.5.
•MultiT Maj : Ensemble of multiple teachers.The final prediction is assigned by majority voting, and if 3 teachers predicted the sample as offensive, then offensive will be the predicted label.

Main results of experiment
We conduct experiments to verify whether the proposed unsupervised data AugCOLD and multiteacher distillation framework MuDA can effectively improve the performance of attack detection The experimental results are presented in Table 2.
When γ = 1.0, i.e., only the soft label generated by multiple teachers is used as the supervision signal during knowledge distillation, the proposed MuDA outperforms 6 teacher models in most cases.In particular, the average accuracy/F1-score of MuDA on the 4 test sets is 0.7961/0.7529,which is much better than COLD-R Mac trained on 3.2k supervised data (0.7685/ 0.7393), and even better than the Mix mac model (0.7723/0.7461), which is trained on all supervised data (about 216k data).In addition, the performance of MuDA is comparable to that of the multiteacher ensemble model Avg-Multi (average score is 0.7971/0.7549),despite having just one-sixth the number of parameters.It indicates that, in the process of knowledge distillation, the student model MuDA can successfully inherit the knowledge from multiple teachers and unsupervised dataset AugCOLD.
MuDA Mix is obtained by fine-tuning MuDA (γ = 0.7) on all supervised data and achieves further performance gains.MuDA Mix reaches the best average accuracy (0.8023) and the best performance on 3 Chinese datasets (COLDataset, CDialBias, and AugTest).Nonetheless, the above performance gains are not excessively high.This is because, in the first knowledge distillation process optimizing Muda, knowledge from multiteachers and unsupervised data AugCOLD has been distilled to Muda.When optimizing MuDA Mix , the training data of multiteachers are secondly used, thus there is limited new information that can be provided in these data.We believe that if the supervised data utilized in retraining is data that the multiteacher has not seen before, there will be satisfying performance gains.This perspective is proven in Analysis of generalization.
We further investigate the importance of soft labels during knowledge distillation.As shown in the Eq. 3, γ represents the weight of soft labels considered in the loss function during model training.To this end, we compared the impact of γ on the performance of model distillation.The results are shown in Fig. 1.We divide the process of distillation into 2 steps.The first step involves knowledge distillation based on the unsupervised dataset AugCOLD, whereas the second step involves continuing distillation on all supervised data.At the first stage, when γ increases, the overall performance of MuDA on each dataset shows an upward trend, and the performance tends to be stable when γ is between 0.7 and 1.0.Notable is that when γ = 0., i.e., when only the pseudo hard label is used as a supervisory signal, the average accuracy/F1 is 0.7865/0.7220.However when γ increases to 1.0, the average score increase to 0.7961/0.7529,which clearly demonstrates the significance of soft labels.In the second step, when γ ≠ 0, i.e., when utilizing a combination of hard and soft labels with a particular weight, the total performance could be more stable and satisfactory.

Analysis of generalization
To further validate the generalization of proposed model MuDA, we conduct further experiments on the SWSR dataset.SWSR dataset contains 8,969 comments that are labeled as sexist or nonsexist, and sexism comments cover the subcategories of stereotypes based on appearance or cultural background,  microaggression, and sexual offense.In neither the original COLDataset nor the expanded version AugCOLD does this data type gain special consideration.Therefore, SWSR is taken as the out-of-domain samples for investigating the generalizability of the detectors.We investigate the generalization of MuDA on the SWSR test set, as well as MuDA's performance when fine-tuned with varying volumes of SWSR training data.Cross-entropy loss function is used for optimizing.
Experimental results are shown in Table 3.The accuracy of the initial MuDA on the SWSR test set reaches to be 0.7489.While performing additional fine-tuning on the same quantity of data, the performance of updated MuDA is always superior to that of updated Macbertbase.Notably, when using 2k training data for fine-tuning, the updated MuDA SWSR accuracy can reach 0.8025, which is comparable to the performance of SWSR Mac fine-tuned on the entire dataset (accuracy is 80.13 while 7k data utilized).This reveals that proposed MuDA distilled from the multiteacher network performs well when applied to other domains and could be improved further by fine-tuning with a minimal quantity of supervised data.

Collection of HardTest
To further evaluate the model's robustness, we construct a more difficult test set to evaluate its performance on hard samples.We gather data using the following guidelines: •Samples with covert offense and are difficult for detectors to process, such as microaggressions.
•Samples that are easily mispredicted, such as counterspeech, which is frequently mispredicted as offensive due to the presence of black keywords or offense-related phrases.
To this end, we select hard samples from the test set of available datasets, including COLDataset, CDialBias, SWSR, and the translated version of SIBC, to further investigate the effectiveness of proposed MuDA.We gather a total of 1,315 samples, 652 of which are safe and 663 are offensive.The following are the specific data sources.
1. COLDataset: We pick 200 safe samples with label AntiBias and 200 samples with offensive scores ranging from 0.33 to 0.67.The scores are assigned by COLDetector [8].Finally, we gather 300 offensive and 100 safe samples as hard samples.
2. CDialBias: We pick 200 safe samples with label AntiBias or Neutral and 300 [16] offensive samples with label Bias from the utterance-level data.
3. SWSR: We selected 201 samples with label Micro-aggressive and 101 hard safe samples with COLDetector.
4. SIBC: SIBC provides manually labeled offensive scores.We select samples with offensive scores between 0.33 and 0.77 and then manually pick 113 samples (including 51 safe and 62 offensive samples) to avoid the noise brought by the translation process and culture difference.

Performance analysis on hard samples
In this section, we analyze the performance of offensive detection on hard samples.The results are shown in Table 4 and some cases are given in Table 5.Compared with COLD Mac , the performance of MuDA on hard samples has been steadily improved, in which accuracy is up to 63.50% (+4.03%),Macro-F1 is up to 63.42% (+4.23%).According to overall metrics, MuDA outperforms all teacher models except Mix Mac and is even comparable with the ensembled teacher model Maj-MultiT and Avg-MultiT.MuDA's performance is further enhanced after fine-tuning on the supervised data and MuDA Mix reaches 0.6350 accuracy and 0.6342 F1-score, which is 4.03% and 4.34% higher than COLD Max .This shows that the multiteacher knowledge distillation with AugCOLD can effectively enhance the robustness of offensive detector.
Hard samples, however, continue to pose substantial challenges to present detectors.Our detector achieves an average accuracy of 0.8023 on the general test set (as shown in Table 4) but only 0.6350 on the hard samples (as shown in Table 6).This suggests that understanding and detecting hard samples deserves further study to develop more powerful detectors.

Conclusion
In this paper, we presented an unsupervised offensive language dataset, AugCOLD, containing millions of data acquired by data augmentation techniques.In terms of quantity and variety, it significantly outperforms related publicly available Chinese datasets.Furthermore, to maximize the utilization of unsupervised data, we develop the multiteacher knowledge distillation framework to distill knowledge from both multiteacher and AugCOLD to the resulting detector.By conducting a large number of experiments, we demonstrated that our proposal could effectively enhance the generalization and robustness of the offensive language detector.In this work, we perform data augmentation by generating synthetic data with few-shot prompts on GLM-10B and GLMlarge [5].

Prompt design
We construct various 2-shot prompts in particular to broaden and diversify the scope and variety of the augmented data.Prompts consist of seed samples with annotated labels from COLDataset [16] and CDialBias [17].Two types of prompts are designed based on the following 2 strategies.
•Prompt with binary label constraint.Create prompts by selecting seed samples with the same label (offensive or not) at random.For example, 2 offensive samples that refer to different topics or different target groups.Such prompts steer the model to generate offensive content while retaining its potential to produce data on a wider range of topics and target groups.This is advantageous for expanding the data coverage.
•Prompt for triggering hard cases.Seed samples for augmenting hard cases are mainly picked from CDialBias.This dataset focuses on social bias and considers several attitudes including bias, neutral, and anti-bias.Among them, biased expressions are comparatively subtle in comparison to other offenses, such as insults.Neutral and anti-bias expressions are nonoffensive but are more likely to be misclassified as offensive than other safe expressions.Therefore, these data can be utilized as seed samples for augmenting hard cases.

Quality filtering
For the synthetic data generated by the language model, its quality is difficult to guarantee.Therefore, we use Perplexity (PPL) to control text fluency.PPL is usually used to evaluate the performance of language models.For the test sentences in fact real and correct, the model will assign a lower perplexity for them, which denotes that the model is not perplexed by them and understand them well.Therefore, we believe that if a relatively reliable model is used to score the generated text, its PPL value can reflect its fluency to a certain extent.However, recent work finds that very low PPL cannot represent very high quality [41].It is because the repetition of words or phrases will sharply down the PPL value, while repetition often occurs in generated texts.In this way, we cautiously use the PPL metric to filter the unfluency generations and only keep the synthetic data with PPL values between 10 and 100.Some examples of prompts for data augmentation are shown in Table 6.

Selection from real-world data
Besides model generation, we collect real-world data to enlarge the diversity of AugCOLD dataset, mainly through the following 2 ways.
We take the above datasets as candidates and then score them with the classifier, COLDetector [16], to determine whether or not each sample is offensive.Each sample is assigned a score between 0 and 1, indicating the probability that the sentence is offensive.Following that, we relatively uniformly pick samples from each score interval, such as 0-0.1, 0.1-0.2,etc.These data with varying scores are added to the AugCOLD dataset, making the data more varied.

Data selection with keywords
According to prior research, the automatic detection of offensive content could be hindered by the presence of sensitive keywords [16].This is due to the fact that sensitive words might exist in both offensive and nonoffensive samples, and even the most offensive sensitive terms have a high likelihood of appearing in safe samples, such as anti-bias statements.Nevertheless, because the majority of samples containing keywords in the training data are offensive, once the model detects sensitive words in the input, it tends to disregard other features and incorrectly predict the input as offensive.This results in a high recall score but a low level of precision for the offensive detector.
To alleviate this problem and further increase data coverage, we collect data by keyword-matching method.Specifically, we crawl a large amount of data from platforms like Weibo and Zhihu.Due to the low density of offensive-related data, we manually collected 2.6k blacklist terms covering keywords related to offenses such as abusive, discriminatory, pornographic, and intimidating.Then, we selected 96k candidate offensive samples with keyword matching and randomly selected 24k candidate safe samples.

AugCOLD dataset
We develop AugCOLD dataset, which includes 1,090k samples and is almost 29 times larger than the initial COLDataset.Detailed data statistics of AugCOLD dataset are presented in Table 7.

Lexical diversity
We investigate lexical diversity, where the number of unique unigrams in AugCOLD is double that of COLDataset (4.6k vs. 9.4k) and where the number of unique 5 g is about 33 times that of COLDataset (1,363k vs. 45,794k).This demonstrates that AugCOLD has a large increase in sample diversity and coverage.This is owed, in part, to the inclusion of real-world data, which brings the augmented dataset closer to the actual deployment scenario.

Offensiveness
To better explore the offensive distribution, we analyze the offensiveness of AugCOLD dataset.Utilizing N teacher models, we obtain multiple probability outputs (P 1 , P 2 , …P N ) for each sample and then calculate the average (AvgScore) and maximum (MaxScore) offensive scores for each sample: We count the number of examples whose offensive score fell within each range, and the results are shown in Fig. 2. In general, samples with an average toxicity score (AvgScore) between 0.3 and 0.7 can be considered more challenging for the detector, and this portion of the data accounts for approximately 42%, showing that the simple sample will not overwhelm the dataset.
It can be observed that AugCOLD dataset can cover a wider range of offensive levels, hence satisfying the diversity requirement of offensiveness distribution.

Quality generated data
Due to the limitations of the language model's generation capability, the augmented synthetic data may contain repetitions and grammatical faults.To verify the quality of augmented synthetic data, we randomly select 200 samples and manually evaluate their fluency.Of them, 185/200 (92.50%) are considered to be fluent and easily mistaken as human-written data.After PPL filtering,

Multiteacher Knowledge Distillation Framework
Limited by the quality and quantity of training data, existing Chinese offensive detectors confront significant challenges in terms of generalization on new topics and robustness to hard cases when they are deployed.Recent studies have shown that unsupervised data with pseudo-labels can improve the performance of detectors.Motivated by this, we construct a large-scale unsupervised dataset AugCOLD and explore the application of Multiteacher Knowledge Distillation with the Augmented dataset (MuDA).With such a framework, as shown in Fig. 3, we can distill knowledge from both unsupervised data and multiple teachers to boost the performance of student model.To achieve the above goals, the construction of unsupervised datasets and the training of multiteacher networks are the 2 most important parts.

Construction of unsupervised dataset
Unsupervised data should be diversified and broad in scope.Yet, gathering such information is a significant undertaking.Because of the healthy communication environment in social networks, the diffusion of offensive samples in the actual world is highly limited.Second, available datasets are overburdened with simple examples, making it challenging for the compiled detector to deal with complex samples such as concealed toxicity and counterspeech.To address the difficulties stated above, we construct unsupervised data AugCOLD, which is an extension of COLDataset [16].To maximize data coverage and diversity, we collect data from 2 sources: real-world data crawling and data augmentation with generation models.
It is important to highlight that during the data collection process of AugCOLD, we obtain raw labels that are automatically assigned based on the label constraints in prompt-based generation and the predictions from detector/keyword-based data selection.However, in our pilot experiments, we have identified inherent inaccuracies in these raw labels.This can be attributed to the limitations of the detectors or the possibility that the generated samples might not strictly adhere to the labeling instructions provided in the prompt.Therefore, we have made a decision to exclude these raw labels and rely solely on the augmented data generated by AugCOLD.The details of AugCOLD development are given in section AugCOLD Development.

Building the multiteacher network
The multiteacher network has multiple independent offensive detectors that are usually trained on various datasets, guaranteeing that they can successfully handle a variety of inputs, even hard cases, and then giving the student model strong robustness and generalization.Considering that the Chinese data is limited in quantity and scope, we employ both Chinese data and English translation data to train the teacher model in order to make it capable of handling a variety of input cases.
With the pretrained teacher models, unsupervised data can be scored and then soft labels are generated.These soft labels are served as a training signal and guide the training of the student model, thereby improving the detector's robustness and generalization.
Specifically, in our distillation framework, N independent binary classification models served as teachers:  in which ŷs is the predicted probability of student model, CE(•) is cross-entropy loss, KL(•) is the Kullback-Leibler divergence loss, and w i and γ are the hyperparameters.In experiments, w i is set to 1/N.

Fig. 1 .
Fig. 1.Statics of offensive scores in AugCOLD dataset.We counted the number of examples for which the average score (AvgScore) or maximum score (MaxScore) of the N teacher models fell inside each range.AvgScore = 1 N ∑ N i=1 P i , MaxScore = max (P i ), and P i is the offensive score assigned by teacher model T i .

Fig. 3 .
Fig. 3. Accuracy and macro-f1 score with varying weights γ of soft labels considered in loss function.The results of 2-step knowledge distillation are shown: distillation on AugCOLD and continuing distillation on all supervised data.

Table 1 .
Comparison between proposed AugCOLD and other related Chinese datasets.

Table 2 .
Examples of generated samples in AugCOLD.The content marked in blue is from the dataset CDialBias and COLDataset.

Table 5 .
Analysis of MuDA's generalization on SWSR dataset.SWSR Mac and MuDA SWSR are the resulting models of MacBertBase and MuDA fine-tuned with varying volumes of SWSR training data.

Table 6 .
Experimental results on HardTest.Overall denotes the macro scores.The highest scores are highlighted in bold.

Table 7 .
Case study on gathered HardTest.Each example has a binary "True Label", with "1" denoting offensive content.This table includes the offensiveness probability assigned by COLD-R Mac and MuDA Mix , as well as the prediction from InstrucGPT and BaiduTC.191samplesremained, of which 183/191 (95.81%) are fluent.This reveals that PPL filtering may effectively exclude poor-quality samples and enhance the quality of the remaining data.Examples of augmented synthetic data are shown in Table2.