Indonesian Hate Speech Text Classification Using Improved K-Nearest Neighbor with TF-IDF-ICSρF

. Purpose : Freedom in social media gives rise to the possibility of disturbing users through the sentences they send, which is limited by the Electronic Information and Transactions Law (UU ITE). This research aims to find an effective method for classifying hate speech text data, especially in Indonesian, with many categories expected to minimize this case. Methods : This study used 1.000 data from Twitter with five labels, including religion, race, physical, gender and other (invective or slander). The process started with several steps of preprocessing, data transformation using TF-IDF-ICSρF term weighting and data mining using an Improved KNN algorithm. Then, the results were compared with the TF-IDF and KNN methods to evaluate the differences. Result : Using TF-IDF-ICSρF and Improved KNN algorithms gets an average accuracy value of 88.11%, 17.81% higher compared with the same data and parameters to the K-Nearest Neighbor and TF-IDF algorithms, which get results of 70.30%. Novelty : Based on the comparison results, TF-IDF-ICSρF and Improved KNN methods can effectively classify hate speech sentences that have many labels with fairly good accuracy.


INTRODUCTION
Today's digitalization impacts almost all lines, including social, educational, government, economic and communication.Nowadays, digitalization is increasingly rapid, especially in the field of communication with the use of social media as a means of public communication, with various opinions widely and freely expressed on various social media.Social media is a manifestation of digital development that cannot be separated from everyday human life.Social media can function as a platform for communicating, exchanging information and voicing opinions.When used, social media can have a positive or negative impact and influence a person's life.Sentences expressed on social media are sometimes conveyed in a structured manner, which then causes difficulties in analyzing the information conveyed, especially if you have to process large amounts of data.Opinions expressed by the public through social media can be used as a reference to determine the public's subjective perceptions of various discussion themes such as public figures, places, conditions, and others.For example, when an event occurs that is widely reported and then gets a response from the public, this response can be used to measure subjective perceptions or emotions conveyed by the public and indirectly can also determine the level of public awareness in using social media [1]- [5].
One of the social media platforms that is often used to express hate speech is Twitter [6], [7].The hate speech that is expressed is often inseparable from SARA (tribe, religion, race and inter-group) motives.This statement is said because it shows that users use the Twitter social media platform to convey ideas, criticize, disseminate information quickly and have arguments between people [8]- [11].fellow social media users.Indonesia has the Electronic Transaction Information Law (UU ITE), which can be used to restrict perpetrators of hate speech from entering the legal realm.Screening for someone to socialize can be another way to reduce hate speech.Every year, cases of violations of the ITE Law continue to experience significant dynamics.The continued growth in the number of internet users every day shows that there is a possibility that cases of violations of the ITE Law also have the potential to continue to increase [12]- [16].
Text classification is a way to group text into predetermined categories or classes based on certain criteria.Text classification is part of text mining and one way to predict class categories from text data.Classification methods that are often used to classify text data include the Support Vector Machine (SVM), Naive Bayes Classifier (NBC), Decision Tree, Neural Network, Logistic Regression, Random Forest, and K-Nearest Neighbor (KNN) methods [17]- [26].Based on [27], the Improved K-Nearest Neighbor (Improved K-NN) algorithm, a variant of the K-Nearest Neighbor methods, can classify data with many labels quite well.
Apart from the classification method used, text classification is also influenced by the transformation method to convert a set of text data into numbers so that a classification algorithm can process it.This method is called term weighting.Several term weighting methods that have been used, including TF-IDF, TF-IDF-ICF, TF-IDF-ICSρF, TF-IDF-ICSρF-IHSρF, TF-IGF, TF-RF and BM-25F [28]- [32].Based on [30], the TF-IDF-ICSpF method is quite good at classifying data with many classes and an unbalanced number because the density of each word in each class will be calculated.In that research, several variants of the TF-IDF method were used to classify hadith data with many classes.The TF-IDF-ICSpF method obtained the highest accuracy compared to other methods, namely 88%.
Based on several studies on text data classification, the methods applied in implementing an algorithm, from data selection to weighting, greatly influence the accuracy obtained.An Improved K-Nearest Neighbor algorithm was chosen for this study because its k-value was not fixed for each class, so the results obtained are for classifying data of various classes more efficiently.The author proposes that the difference in the research is using the TF-IDF-ICSρF weighting method to classify text data of hate speech sentences on Twitter.The reason for carrying out this study, in general, is to minimize the problem of the large number of hate speech being spread through social media platforms to create comfort among social media users and specifically to test the text data classification model for hate speech on Twitter using the Improved-KNN algorithm with term weighting-TF-IDF-ICSρF so that knowledge can be obtained about how good the model produced by this method is.

METHODS
The method for carrying out processing in this study is explained in this session.From dataset preparation, a series of text preprocessing stages to clean data from things that interfere with the data processing process, transforming data into numerical form using the term weighting method to data mining and evaluation.The stages of this study are visualized in Figure 1.

Dataset
This study uses the Indonesian multi-label hate speech and abusive language detection dataset from [12].Dataset visualization is presented in Figure 2, which shows the data distribution from the dataset.The total data in the dataset amounts to 13,169 tweets containing hate speech, abusive sentences, and those containing neither, then divided into three types of labeling: by level, target, and category.Labels based on the level are divided into three classes: weak, medium, and strong.Labels based on targets are divided into two classes: hate speech aimed at individuals and groups.Labels based on categories are divided into five classes: religion, ethnicity, race, physical, and others (invective or slander).

Figure 2. Multi-Label Hate Speech and Abusive Language Dataset visualization
The type of labeling used in this study is based on categories, having five classes: religion, race, physical, gender, and others (invective or slander).An example of sample data is presented in Table 1.The total data for this type amounts to 5,401 tweets, but from this data, some tweets have more than one class category.What is used in this study is data that has only one class category so that the classification will be more specific; which of our data used 1,000 tweets to process.

Text Preprocessing
Text preprocessing is a stage of changing unstructured data into structured data.Text preprocessing aims to make it easier to process data for processing at the next stage.Figure 3 shows the stages of implementing text preprocessing, which are often carried out The following is an explanation of the stages in the text preprocessing process: 1) Cleaning is the process of cleaning attributes unrelated to the data's information, such as hashtags, URLs, mentions, and emoticons.2) Case Folding is a stage that changes all letters to lowercase.Case folding only processes the letters "a" to "z" or the letters of the alphabet.3) Tokenizing is the process of separating each word that makes it up.Each word will then be given a weight based on its value.4) Normalization is the process of changing misspelled words or non-standard words into the standard language, such as the word "ngeliat" to "see." 5) Stemming is the process of changing all words into basic words.The rules for changing a word into a base word are based on the rules of morphology in Indonesian.

TF-IDF-ICSρF
TF-IDF-ICSρF is a development of the TF-IDF method.TF-IDF is a method that is generally used to determine the relationship of words (terms) to a document or sentence by assigning a weight or value to each word.The Term Frequency Inverse Document Frequency (TF-IDF) method combines the concept of the Inverse Frequency of a word in a document and the Inverse Document Frequency that contains that word.The weight calculation uses Term Frequency Inverse Document Frequency (TF-IDF), first calculating the TF value per word with the weight of each word.The Term Frequency Inverse Document Frequency (TF-IDF) calculation is explained in the following equation: 1) The Term Frequency (TF) value is obtained from the frequency of appearance of feature t in the document: 2) The Inverse Document Frequency value is obtained from the logarithm of the number of documents n divided by documents df containing feature t: 3) The Term Frequency Inverse Document Frequency (Wt) value is obtained by replacing the TF value with IDF: ICSρF calculates document density in the category space based on each term.ICSρF is multiplied by TF-IDF to produce TF-IDF-ICSρF.This term weighting method is used to improve classification performance.Identical documents associated with a particular term may be a subsection of a particular category.TF-IDF-ICSρF begins by calculating the class density (Cρ), namely counting documents that contain terms in a particular category () with the following equation: Then proceed with calculating the classroom density, namely the sum of the densities of all existing classes (ρ), with the following equation: Then the results of the class density (ρ()) are inverted with the following equation: The next step is to multiply the results of the inverse class density of term i (ρ()) with TF-IDF, as in the following equation:  ××  (  ,  ,  ) =      × (log 1 +  1 + (  ) + 1) Improved KNN Improved K-Nearest Neighbor (IKNN), KNN with different k-values for each class, differs from KNN classification which uses a fixed k-value.The distribution of documents from various classes in the training data set is generally uneven.Some classes may have more samples than others.Therefore, a fixed value of k may cause classes with a more significant number of samples to be selected, even though this is not correct.Using  initial nearest neighbors, the probability that a document belongs to a particular class can be determined using n nearest neighbors for that class, where n (the new k value) comes from the initial k, according to the size of that class in the training data set, where  is shown in the following equation: Next, the probability of the test document entering each available class will be calculated, which is represented by the following equation: K-Fold Cross Validation K-Fold Cross Validation is a method that can check overfitting in a model.Overfitting is a large deviation in predicting data, while model fit is essential in classification and regression problems.This aspect shows how far a model is in predicting new data that was not previously trained.Figure 4 illustrates the process of implementing K-Fold Cross Validation.

Figure 4. K-fold cross validation
Data divided into k parts allows each part of the data to stop predicting the data more quickly than if it were not divided first.In K-Fold Cross Validation, the model that has been created is divided into k parts that are the same or close in size.The model's accuracy will be tested using test data at each fold, and continuing to the next fold until completion.Accuracy will be totaled and divided by the number k.

Confusion Matrix
Confusion matrix is a tool that analyzes how a classification model recognizes tuples from different data.Confusion Matrix is a reference for representing predictions and actual conditions from data produced machine learning.Confusion Matrix can provide comparative information on classification results such as calculating accuracy, precision, recall and F-Measure values.The illustration is shown in Figure 5. Below is an explanation of each value:

RESULTS AND DISCUSSIONS
This section will explain the results of data processing from the planned method.To apply the method and visualization of this study using the PHP and JavaScript programming languages.The PHPML library assist in its development to implement machine learning, and the Sastrawi library for stemming purposes in the preprocessing process.After the tweet data has been prepared, it enters the text preprocessing stage to clean the tweet corpus data.Table 2 shows the results of the text preprocessing of sample data, which has gone through the cleaning, case folding, tokenizing, normalization, and stemming processes.After the data is ready to be processed, it enters the transformation stage using the term weighting method, in this case, TF-IDF and TF-IDF-ICSρF.This weighting method produces a weight for each term contained in the corpus.Table 3 shows an example of the weight of each word produced.From this data transformation, a total of 2,601 terms were successfully extracted.In other words, 2,601 features will represent each corpus row.Then, the data is processed using the k-fold cross-validation method with a value of k=10 and two schemes, the first using TF-IDF with KNN and the second using TF-IDF-ICSρF with Improved KNN.In the first scheme, the highest accuracy was obtained from the eighth fold with a value of 76 After the results of each scheme have been obtained, the next step is the evaluation process by comparing the average of these results.Table 6 shows comparative statistics of the results obtained using the TF-IDF-ICSρF term weighting method with Improved KNN that can produce higher accuracy values when classifying data with a large number of classes and the amount of data in each class is not balanced.The comparison results show that the TF-IDF-ICSρF term weighting method and the Improved KNN classification algorithm are pretty effective in classifying text data with many class labels, in this case, five classes.Several results from previous research have several advantages and disadvantages in terms of the application of the methods used in classifying data into existing classes.Some use the term weighting method, which does not have the characteristic weights for each class but uses a classification algorithm that calculates them.Some use the term weighting method, which represents class weights but does not implement an algorithm that calculates them.Some have used both, but the data does not have many classes.In this study, the term weighting method is applied, which calculates each class's weight representation and implements an algorithm that can solve multiclass problems.The data used has many classes and the results of this series of methods are quite effective.

CONCLUSION
A study regarding the classification of hate speech sentences using the Improved K-Nearest Neighbor and TF-IDF-ICSρF algorithms has been successfully carried out and obtained good results.Using the Improved K-Nearest Neighbor and TF-IDF-ICSρF algorithms to classify hate speech sentences obtained an average accuracy of 88.11%, 17.81% higher than the K-Nearest Neighbor and TF-IDF algorithms with the same data and parameters.That is 70.30%.Based on accuracy results, it can be concluded that the Improved K-Nearest Neighbor and TF-IDF-ICSρF algorithms can effectively be used to classify hate speech sentences and obtain excellent and quiet accuracy.Suggestions for future studies are expected to use more relevant data along with class labels given following linguistic rules and specified rules, then also try to apply the feature extraction method so that when using many data, it is more effective because more data means more features extracted which means longer processing time required.

Table 2 .
Text preprocessing results

Table 3 .
Weight of each term .53% and precision, recall, and f1-score values of 77.82%, 76.27%, and 76.64%.The lowest accuracy was produced from the seventh fold with a value of 62.24% and precision, recall, and f1-score values of 64.35%, 63.92%, and 63.96%.Then, the second scheme produced the highest accuracy obtained from the 10th fold with a value of 92.09% and precision, recall, and f1-score values of 94.96%, 91.32%, and 92.41%.The lowest accuracy was produced from the seventh fold with a value of 82.90% and precision, recall, and f1-score values of 85.70%, 82.42%, and 83.43%.Table4and Table5show the details of the results of the two schemes.

Table 6 .
Comparison of average results