Sentiment Analysis for Detecting Cyberbullying Using TF-IDF and SVM

Social media has become a new method of today’s communication in a new digitalize era. Children and adults have used social media a lot in interacting with others. Therefore social media has shifted conventional communication into digital one. This digital development on social media is a serious problem that must be faced because it has been found that there are more and more acts of cyberbullying. This act of cyberbullying can attack the psychic, causing depression up to suicide. The dangers of cyberbullying are troubling and cause concern to the community. Therefore, this study will analyze the sentiment on the comments contained on social media to find out the value of sentiment from comments on social media platforms. The comment data will be processed at the sentiment analysis stage, with the following steps are: preprocessing stage, Term Frequency-Inverse Document Frequency (TF-IDF), and the Support Vector Machine (SVM) classification method. Comment data to be classified as 1500 data taken using crawling data through libraries in python programming and divided into 80% data training and 20% data testing. Based on the results of the test, the accuracy value is 93%, the precision value is 95%, and the recall value is 97%. In this research, a system model design is also carried out where the system can be integrated with the browser to open a user page on the classification of comments that have been input into the system.


Introduction
For decades, the internet has been a part of life that can dynamically change the nature of a person such as children and adults [1], [2]. Internet is a type of network that connects information and communication globally. The internet is also an alternative way to obtain information sources directly [3]. The rapid growth of the social network has changed the meaning of friendship, relationships, and social communication. People have been interacting through social media such as Facebook, Twitter, Myspace, and YouTube that are accessed simultaneously [4]. From the rapid growth of social media, cyberbullying becomes one of the serious problems in social networks, especially for teenagers and adults [2]. Cyberbullying is defined as an aggressive and deliberate act to harm someone committed by a group or individual by using a form of electronic contact repeatedly or from time to time against a victim who cannot easily defend himself [5]. People have begun to realize that the incidence of cyberbullying has increased in recent decades, and some research shows that half of teenagers and society experience cyberbullying [6]. Even the effects of cyberbullying contribute to depressive stress, decreased self-esteem, despair, and suicidal desire among adolescents [7].
Social media is a medium to communicate its existence not only through media text but also users can use image and video media. It is from these materials that the media is widespread on the internet with the reach can quickly spread widely. With this capability there are many opportunities and opportunities from the internet shown, but there are concerns about increased online activity that could lead to the onset of deliberate crime and harassment such as cyberbullying. Social media apps are already very popular among everyone and the growing popularity of social media platforms is also increasing cyberbullying that occurs through social media [8], [9]. This cyberbullying phenomenon certainly gets special attention from the public and social media users, the role of information technology is a particular concern for researchers to develop technology to detect cases of cyberbullying. In detecting cyberbullying, the researcher can use the application of data and data mining concepts in finding text patterns, the process of analyzing text, and the process of summarizing useful information [10]. Even in research with the naïve Bayes method, the detection of cyberbullying words can be classified into three categories, namely physical bullying, social bullying, and verbal bullying [11].
Many researchers have conducted a study on the identification of cyberbullying patterns from social media and the internet using sentiment analysis. There are also a variety of approaches to sentiment analysis. Research conducted by Hernandez Li [12] using a semantic approach based on Spanish linguistic rules found there is a polarity of cyberbullying detection in the Spanish language. The data used are 60,798 messages on Twitter and each message was given a sentiment analysis label that showed whether the message is a message of Very Positive, Positive, Neutral, Negative, Very Negative, and no feeling. Results from the study using sentiment analysis were able to distinguish these messages and found that 35.22% expressed no feelings, 34.12% expressed positive feelings, and 18.56% expressed negative feelings.
Research conducted by Dina [13] to detect posts from social media twitter related to cyberbullying data to detect the highest accuracy by using the data dialects of modern Arabic, Egyptian, and Arabic. To improve accuracy the researchers added two additional factors to the sentiment analysis, namely emoji and user history that represents user data. There are 4 stages used by researchers, namely Twitter Data Collection, Feature Extraction (i.e. Twitter-based, sentiment analysis, emotions, and user history), Cyberbullying detection, and lastly classification. From the results of the experiment, researchers revealed that the accuracy obtained was 73% for negative words from cyberbullying and 85% accuracy obtained from neutral and positive data.
Research conducted by AlHarbi [14] proposes automatic detection for cyberbullying using a sentiment analysis and lexicon approach. The project uses java programming and data sets that have been prepared for its testing. Data is collected from Twitter APIs, Microsoft-Flow, and YouTube comments. Once the data was classified, researchers used PMI, Chi-Square, and Entropy. Based on this lexicon approach, it can be concluded that the tweet comments used show that PMI outperforms with a value of 81%, compared to Chisquare with a value of 62.11%, and Entropy with a value of 39.14%. The results of this study show that the PMI approach provides the best performance to detect cyberbullying compared to Chi-square and Entropy approach.
Another study was conducted by Ahmad et al [15], who conducted a sentiment analysis of Twitter using a Support Vector Machine. In his analysis, Ahmad analyzed the performance of SVM using 2 datasets that were classified and used for comparative analysis, namely recall precision and F-measure. The results show that the results of the SVM performance depend on the data set and the input so that from the 2 data found there are differences in the results from the recall precision and the F-measure.
Based on the results of previous research it can be concluded that research on cyberbullying approach with sentiment analysis has been done by Hernandez, Dina, , Al Harbi and Ahmad. However, this study used different approaches and gave different results. The results of this study are expected to establish a cyberbullying comment accuracy with the system and can be the best approach to detect cyberbullying comments on social media platforms. This study will examine cases of cyberbullying using comments from various social media, using Term Frequency-Inverse Document Frequency (TF-IDF) and Support Vector Machine (SVM). TF IDF is an important measure of document data that contains words [16]. This process is used for calculating a term in any document containing a corpus. This method combines 2 weight calculation concepts, namely TF (Term Frequency), which functions to calculate the frequency of appearance of a word in a document, and IDF (Inverse Document Frequency) which functions to reduce the weight of a term if many appearances of these terms are spread across the corpus [17]. SVM can provide good performance for classification accuracy which compared to other data classification algorithms [18]. Introduced by Vapnik in 1992, SVM is a method that functions to analyze data and to identify patterns, used to classify and analyze regressions. SVM also has a function to find the best hyperlane that is used as a separator between 2 data classes by minimizing classification errors and maximizing their geometric margins [19]. The study on sentiment analysis data classification using TF-IDF and SVM is to distinguish cyberbullying and non-cyber bullying classes so that we can test the accuracy of the SVM algorithm in cyberbullying cases. The use of SVM data results in better accuracy values and applied in the case of data classification and regression [20], [21].

Research Method
At this stage of the study, researchers performed several structured stages, can be seen in the  At this stage, researchers collected data sets taken from Instagram accounts that are most widely discussed on the internet. The data taken is Indonesian language comment data. Indonesian has experienced many developments due to the contact absorption of language between Indonesian and slang. the process of mixing codes and the interference between Indonesian and slang is what causes the development of slang in the community [22]. The comments were taken based on recent posts and only 1 sample account was targeted by the experiment. Sample Instagram comments on 1 account can be seen in the Table 1.

Data Labeling
In data labeling, the researcher made a questionnaire that would be distributed to 100 respondents who were experts in good and correct Indonesian. This sample consists of 1500 comment data which the respondent will provide data label 1 or 0. Label 1 means that the word comment contains a negative word which means bullying and label 0 means that the comment contains positive words which means the word does not contain bullying. Data labeling can be seen in the Table 2

Preprocessing
Preprocessing text is a stage to reduce several word forms into one word form. The main purpose of preprocessing is to obtain key features of the data set documents that have been collected to increase the relevance between words and documents and relevance between words and classes [23]. In the preprocessing stage several stages are done, namely case folding, data cleansing, tokenizing, word replacing, stop word removal, and stemming [24]- [26]. Data Preprocessing can be seen in the Table 3 and Table 4.

Case Folding
At this stage, if the contents of the comment contain uppercase letters, then the letters are changed to lowercase letters, which can be seen in the Table 3.

Data Cleansing
At this stage, the data in the comment will delete all punctuation marks (delimiter), symbols, and emoticon codes. Therefore all writings can be read clearly without a symbol, which can be seen in the Table 3.

Tokenizing
Tokenizing is a process of removing whitespace in a sentence. Therefore all comment words experience language normalization, which can be seen in the table 3.

Word Replacing
This stage is the stage for selecting non-standard language words into standard language according to the Kamus Besar Bahasa Indonesia. Therefore the commentary words become a language that the system can read. Table 4 is the result of word-replacing.

Stopword Removal
This stage is carried out to select words that are not important in the comment. The result of this stopword is meaningful commentary words in the content of the comments. For this reason, researchers make references to the Indonesian stopword dictionary obtained from various sources, which can be seen in the Table 4.

Stemming
The process of stemming is changing the commentary data which adds to the basic words. Stemming stages can be seen in the Table 4. Case folding, data cleansing, tokenizing, word replacing, stopword removal, and stemming stages are used as preprocessing stages. The results of which can be seen in the Table 5.  (TF-IDF) TF-IDF is a numerical statistic that shows the relevance of keywords with several documents that have been collected. Therefore they can be identified or categorized. TF-IDF is also a combination of two different words namely; Term Frequency and Inverse Document Frequency. TF is used to measure the number of times a word term is in a document. IDF is used to give lower weight to words that occur frequently and to give larger words to words that occur rarely [27]. At this stage, the TF-IDF feature is carried out in the weighting stage on each word that appears in the commentary words. The data in the Table 6 is a sample of the TF-IDF calculation.

Result and Discussion
At this stage, researchers formed a system that specifically detects cyberbullying comments from various sources from social media. The system takes comment data from Twitter, Instagram, and Facebook.

Data Labeling
The next stage is the dataset that has been collected for data labeling conducted by 100 respondents through questionnaires. For caption labeling 1 means containing cyberbullying comment content and label 0 for comments that do not contain cyberbullying content. The sample label, which can be seen in the Table 8.

Preprocessing
This stage is a stage of data normalization to datasets that have been collected for the data classification process. The first stage is the case folding stage whose implementation can be seen this algorithm of case folding. The next stage is data cleansing that is removing punctuation (delimiter) or symbols contained in comments. Here's the algorithm implementation of the data cleansing.

Sample of data cleansing
If (remove_mention): Cleantext = re.sub ('@\S+', '',cleantext) In the next stage, the process of tokenizing or withholding several word terms taken from the comments that have been collected, and can be seen in the following tokenizing algorithm.

Sample of tokenizing algorithm
If (tokenization): Tokens = nltk.word_tokenize (cleantext) Cleantext = ' '.joins(tokens) At the word replacing stage, it changes the non-standard language to the standard language according to KBBI. For slang words changed manually by creating a separate dictionary, which can be seen in this word replacing algorithm

Sample of word replacing
For slang, formal in slangs.items() Cleantext = re.sub(r'\b%s\b'%re.escape(slang), formal, cleantext) Stopword removal is a step to delete unnecessary or meaningless words. At this stage, a stopword dictionary library is needed. So that if in a comment there is no important sentence data, it will be deleted automatically. which can be seen in this sample of stopword removal algorithm. The last stage is stemming, stages to change the affixed comment words into basic words. The way these stemming works is to remove the prefix, suffix, and a combination of prefix and suffix. This stage also requires a library for eliminating words. , and can be seen in this sample of stemming algorithm

TF-IDF
This stage is a stage in weighting words/terms that require a library sklearn.feature_extraction.text. Here is an example of the TF-IDF algorithm.

Support Vector Machine (SVM)
At this stage, the calculation of the weight of each comment that has been generated in the TF-IDF stage is classified using SVM. In this SVM algorithm sample,

Data Testing Confusion Matrix
The final step in this research is to test the accuracy of the model using the Confusion Matrix. This confusion matrix is a method used to calculate the accuracy of data documents so that algorithm learning can be maximized [29]. The goal is to determine the validation of the model that has been carried out at the training stage. The tests performed will produce a matrix with false negative, false positive, true negative, and true positive values. Besides, the results obtained from this test are accuracy, precision, and recall values [30]. Python confusion matrix implementation uses the Sklearn library. This is a sample implementation of program code in the confusion matrix process.
Sample of Stemming Algorithm from sklearn.metrics import classification_report, confussion_matrix print(confusion_matrix(Y_test,prediction)) print(classification_report(Y_test,prediction) ) The dataset used in this study were taken from various social media platforms, namely Twitter, Facebook, and Instagram, totaling 1500 comment data. Data labeling was done manually by several Indonesian language experts at various universities with Indonesian language majors. Labeling is carried out by distinguishing comments containing cyberbullying and noncyberbullying. Table 9 the data sample from the crawling process.  The results above stated that the value of accuracy is 93%, precision is 95%, and recall 97%. To complete this study, researchers made a design system related to this cyberbullying data. In this system, the user can input the data set that has been analyzed in the previous stage. In this system the data set tested will be displayed on the system and the user will see the results of classification consisting of usernames, comments, and labels.

Conclusion
In this study, data labeling, preprocessing, TF-IDF, and SVM process used 80% data training and 20% data testing. At the preprocessing stage, researchers analyzed the data into 6 stages, namely case folding, data cleansing, tokenizing, word replacing, stopword removal, and stemming. This stage aims to reduce the words of comments to clear standard words. The comment data collected was 1500 data, with 1200 comments as training data and 300 comments as data testing. From the above results of the model experiments studied can be concluded that the model can already separate the data between comments containing cyberbullying and comments that do not contain cyberbullying with a higher value. The results obtained for the sentiment method of analysis using data labeling, preprocessing, TF-IDF, and SVM have differences in value with research conducted by previous research. In particular, this study experimented with classified data and obtained a high degree of accuracy. Although there are different methods in previous studies, the results presented by each researcher can distinguish cyberbullying comments on social media. Based on this study can also be concluded that the SVM algorithm is applicable to classify cyberbullying and non-cyber bullying comments in applications program.