Hate Speech Detection for Indonesia Tweets Using Word Embedding And Gated Recurrent Unit

.


INTRODUCTION
Social media as a means of communication can disseminate information quickly and widely, making it not only as a means of friendship and various information, but also used as a means of trading, dissemination of government policies, political campaigns and religious preaching [1].With the increasing activity of social media users, it does not rule out the possibility of cyber crime such as the dissemination of information containing hate speech.Hate speech on the social media can be in the form of words that contain hatred in writing and shown to individuals or groups to the detriment of the targeted party.Detecting hate speech is very important to analyze public sentiments from certain groups towards other groups, so as to prevent and minimize unwanted actions or things [2].
Detection of hate speech for Indonesian language has been done before, using bag of words, namely word n-gram and character n-gram.Machine learning algorithms used for classification are, Bayesian Logistic Regresion, Naive Bayes, Support Vector Machine and Random Forest Decision Tree.Currently, the highest F-measure was achieved when using word n-gram, especially when combined with Random Forest Decission Tree (93.5%),Bayesian Logistic Regresion (91.5%) and Naive Bayes (90.2%) [3].Detection of hate speech of Indonesian language can also be done using backpropagation neural network algorithm with a combination of lexicon based and bag of words features with the highest accuracy obtained at 78.81%.[4].In this paper, we propose the combination of word embedding as our feature and Gated Recurrent Unit (GRU) as our classifier for hate speech detection in Indonesian Tweets.

METHODS
In this section, we discuss architecture and methods used to detect hate speech in Indonesia tweets.The main stages in this research are three parts, preprocessing, feature extraction and classification, as can be seen in Figure 1.Each of this part is described in the following subsection.

Preprocessing
Preprocessing stage is very important in classification to get the best model.The tweet processing consists of several steps: 1) Escaping html characters; 2) Removal of punctuation; 3) Split attached words; 4) Case folding; 5) Tokenization; 6) Convert slangwords; 7) Removal of stop-words.Escaping html character aims to remove URL link and also html character that often found in tweets.Remove of punctuation is used to delete special characters that are often found in tweets such as hastag (#), @user, retweet (RT).Beside that at this stage will also removing punctuation.Split attached words,we humans in the social forums generate text data, which is completely informal in nature.Most of the tweets are accompanied with multiple attached words like RainyDay, PlayingInTheCold etc.These entities can be split into their normal forms using simple rules and regex.Case folding is a proses of converting all characters into lowercase.Tokenize is a task of splitting text into smaller units.Convert slangwords, is a process to be transformed of a majority slang words into standard words.The next step is stopword removal.Stopwords is words on uninformative, these words will be remove based on the existing stoplist dictionary.This research is using the stop-word list from Rahmawan [5].

Feature Extraction
Word2Vec is the name of the word vector representation defined by Mikolov et al. [8].The main basis or component for generating vector values in word2vec is artificial neural networks built from CBOW and Skip-gram architectures.Before word2vec can represent the vector value for each word, word2vec will first create a model of the word distribution during training using Indonesian documents collected from Wikipedia.The number of documents used is 1,120,973.In order to build the word2vec feature model, there are three processes involved, i.e., vocabulary builder, context builder, and neural network.Figure 2 shows the three processes in the word2vec model building.

Vocabulary builder
The vocabulary builder is the first building block of the word2vec model.It takes raw text data, mostly in the form of sentence.The vocabulary builder is used to build vocabulary from text corpus.It will collect all the unique words from corpus and build the vocabulary.In this vocabulary builder section, the data used is document that has been downloaded from the Wikipedia.The result of the vocabulary builder process is a dictionary of words with a word index and the occurrence value of each word [7].

Context builder
The context builder uses output of the vocabulary builder.Contenxt builder is a process to find out the relationship between the appearance of one word with other words around it by using the concept of the context window or also called a sliding window.In general, the size of context windows in NLP is 5 to 8 neighboring words.If we choose the size of the window content is 5, then 5 words that appear on the left and right of the center word.In this research, the size of the content window used is 5. Table 1 gives an example of the content window with the window size of 1.The underlined word is the center word.The results of the content window of the content builder will be used in the next process, namely the neural network section.Furthermore, the results of the content window of the content builder will be used in the next process, namely the neural network section.

Neural networks (CBOW and Skip-Gram architecture)
Word2vec uses an artificial neural network architecture formed from CBOW and Skipgram architectures.This artificial neural network is used to conduct training so that each word can be represented by a vector.In this case the neural network architecture uses 3 layers, input layer, hidden layer and ouput layer [8].In this research, the hidden layer contains 200 neurons and the output layer has the same amount as the input layer.Input for the network is the value of each word that has been converted into one-hot encoding.Figure 3 shows the neural network architecture to generate word2vec.

Classification
This research used GRU to detect hate speech in the Indonesian language.GRU is a variation on the LSTM that is simpler than LSTM, and in some cases produce equally excellent results.As LSTM, GRU (Gated Recurrent Unit) aims to solve the vanishing gradient problem which comes with a standard recurrent neural network.GRU combines the forget gate and input gate into one update gate and has an additional reset gate as shown in Figure 4.The GRU is increasingly popular and many use it to solve NLP problems [9].The function update gate    is to decide how many previous units must be kept, as can be seen in Equation (2).
When   is plugged into the network unit, it is multiplied by its own weight   .The same goes for ℎ −1 which hold the information for the previous t-1 units and is multiplied by its own weight   .

The function reset gate 𝑟 𝑡
is used from the model to decide how much of the past information to forget, as can be seen in Equation (3).This function is the same as the one for the update gate    .The difference comes in the weights   ,   and the gates usage.
Activation function of candidates ouput ℎ   ̃calculates the value of the unit before it is decided to be updated or not and (∘) shows the Hadamard product multiplication element.Function activation candidate ouput can seen in Equation ( 4).

RESULTS AND DISCUSSION
This research used Twitter hate speech in Indonesian language tthat have been collected and labelled by [3].The number of tweet data is 713 data, 260 tweets is labeled as hate speech 453 is labeled as non hate speech tweets.

A. Comparison Word2vec with TF and TF-IDF
In the first experiment we will try to compare the word2vec feature with TF and TF-IDF to find out the ability of word2vec as a feature in the classification model.Supervised algorithms that will be used for this experiment Support vector machines, Naive Bayes, Bayesian Logistic Regression and Random Forest.This experiment is carried out based on the assumption that word2vec has a better ability to detect hate speech compared to other features, namely, TF and TF-IDF.Experiment results can seen in Table 2.This table shows that the highest accuracy of the word2vec feature is achieved by using the random forest algorithm, with an accuracy value of 79.80%.This accuracy is lower than the accuracy of using TF and TF-IDF on all algorithms.This baseline experiments show that for classical algorithms used in this research, word2vec feature result in lower accuracy compared to TF and TF-IDF features.

B. Determining Learning Rate
Experiments to determine the learning rate were performed with single GRU layer by setting number of neuron 200, and epoch 100.The amount of the learning rate is certainly not too big and not too small.The choice of a large learning rate will make the learning process not too optimal, while the learning rate value that is too small can cause the training process to be less good in time complexity.The leaning rate value is set to 0.001, 0.0001, and 0.00001.

B. Determining number of neuron in hidden layer
This experiments is used to determine the optimal number of neuron in hidden layer by setting the learning rate to 0.001.We use GRU with 1 and 2 architectures.The number of neurons in hidden layers to be tested is 128, 200, 250 and 300.The result of our experiment can seen Figure 6.The result shows that the addition or reduction of the number of neurons can affect the accuracy of the model.The initial accuracy GRU with 1 layer is 90.28% and increases with the addition of the number of neurons.In contrast, the accuracy obtained by GRU with 2 layer is the highest with the value of 92.96 when the number of neurons is 200.The addition of the number of neurons can not increase the accuracy.

C. Overall model performance
Table 4 shows the best performance of the GRU and classical algorithms used in our experiment.All of the values in this table is the average values from the experiment using 10fold cross validation on 713 training data.The best performance of GRU model is achieved by GRU with 2 layer, with the learning rate of 0.001, 200 neurons in the hidden layer, which has the accuracy of 92.96%.This shows that the ability of the GRU model is better because the GRU model is built by having an update gate and a reset gate that can store and dispose of previous data.Function that is owned by the update gate and reset gate makes the GRU model can know the information in the previous time and the current time information so that it can increase accuracy in determining the class on the tweet.

Figure 3 .
Figure 3. Neural network input and output structures to create word2vec

Figure 5 .
Figure 5.The effect of learning rate on loss value

Figure 6 .
Figure 6.Effect of number of hidden layer neurons on accuracy

Table 2
Comparison of word2vec against TF and TF-IDF Hate Speech Detection In Indonesia Language Use Word Embedding... (Junanda Patihullah) 49