Detecting spam comments on Indonesia’s Instagram posts

. In this paper we experimented with several feature sets for detecting spam comments in social media contents authored by Indonesian public ﬁgures. We deﬁne spam comments as comments which have promotional purposes (e.g. referring other users to products and services) and thus not related to the content to which the comments are posted. Three sets of features are evaluated for detecting spams: (1) hand-engineered features such as comment length, number of capital letters, and number of emojis, (2) keyword features such as whether the comment contains advertising words or product-related words, and (3) text features, namely, bag-of-words, TF-IDF, and fastText embeddings, each combined with latent semantic analysis. With 24,000 manually-annotated comments scraped from Instagram posts authored by more than 100 Indonesian public ﬁgures, we compared the performance of these feature sets and their combinations using 3 popular classiﬁcation algorithms: Na¨ıve Bayes, SVM, and XGBoost. We ﬁnd that using all three feature sets (with fastText embedding for the text features) gave the best F 1 -score of 0.9601 on a holdout dataset. More interestingly, fastText embedding combined with hand-engineered features (i.e. without keyword features) yield similar F 1 -score of 0.9523, and McNemar’s test failed to reject the hypothesis that the two results are not signiﬁcantly diﬀerent. This result is important as keyword features are largely dependent on the dataset and may not be as generalisable as the other feature sets when applied to new data. For future work, we hope to collect bigger and more diverse dataset of Indonesian spam comments, improve our model’s performance and generalisability, and publish a programming package for others to reliably detect spam comments


Introduction
As reported in [1], Instagram has 22 million monthly active users in Indonesia from 500 million users worldwide. With 95 million photos and videos posted every day, Instagram has obviously become a prominent photo and video sharing platform in the world. However, just like e-mails, the high number of activities mean that there is a chance that people could get more attention even when they post totally irrelevant comments which we also know as spam.
Indonesia, in particular, has this problem where some users put advertising on public figures' posts as comments. This problem has gone too far to the fact that the first 50 comments on a public figure's post could be all spam 1 . Like e-mail spam, we believe that these comments can be very annoying and should be automatically filtered.
To the best of our knowledge, we have not seen any investigation on Instagram's spam on Indonesian accounts before. The case might be platform specific, but the insight from this study 1 e.g. username: @cita citata, post id: BIjHGG3BUEK could hopefully be used to similar cases, e.g. spam comments on blogs or e-mail spam. This is because the machine learning algorithms we used in our study which have been proven to work well in general. Our focus is less on finding the algorithm that works best on our dataset and more on identifying the salient features to detect spam comments.

Related Work
While we might be able to detect spam to some extent by using origin-based filters (e.g. using IP or email address), in the case of spam email, content-based filtering is more common to do [3]. The same thing also applies to spam comments, as people can easily create new accounts for this spamming purpose. However, content-based filtering also comes with some problems. For instance, spammers could obfuscate their message (e.g. by writing "f r 3 3" instead of "free") to trick the filter [2]. Machine learning approach, which can automatically filter the spam by building adaptive model, has become more popular nowadays. Before sending the features to be learned by the algorithms, according to [2], the structure of a spam filter can be grouped into: (i) tokenization, which extracts the words in the message body; (ii) lemmatization, reducing words to their root forms; (iii) stop-word removal, eliminating some words that often occur in many messages; (iv) representation, which converts the set of words present in the message to a specific format required by the machine learning algorithm used.
Nevertheless, as it was also pointed out in the study, not all of the steps are mandatory. As the name might suggest, words are the features in content-based spam filtering. The problem now is how to represent them properly for the algorithms. Some of the representation that has been used in content-based spam filtering are bag-of-words (BoW) [5], term frequencyinverse document frequency (TF-IDF) [6], and binary representation of word occurences [4]. In [4], it was also considered to use of upper case words as one of the features.
Since using BoW and TF-IDF can result in a sparse matrix, we can reduce the dimension by using Latent Semantic Analysis (LSA). As explained in [9], this method applies Singular Value Decomposition (SVD) to the matrix so that we can learn "expected contextual usage of words in passages of discourse." LSA also helps us to make the learning process faster.
In more recent work [7], we can see the improved version of skip-gram model [8], "where each word is represented as a bag-of-character n-grams." This method enriches the word vectors with subword information on rare words. It also has the advantage of not needing any preprocessing of the data.

Features
We used several techniques for representing the comments as follows: (i) Binary Bag-of-Words with LSA; (ii) TF-IDF with LSA; (iii) Word2Vec using skip-gram model.
The first two representations were made using the library provided in scikit-learn 2 [10]. We set the minimum document frequency (min df parameter) to be 5. We then reduced the dimension of BoW and TF-IDF representations to 100 using LSA. The chosen dimension is the same as the default output dimension in the Word2Vec library we used in this study, i.e. fastText 3 [7]. We did not apply any lemmatization or stop-word removal in our study.
The word vectors produced by Word2Vec are 100-dimension vectors for each word in the document. We then took the average of all word vectors that constitutes each document, yielding 100-dimension document vectors. These vectors can then be used as the features in our experiment.
These features alongside some numerical features that can be easily extracted from the text, e.g. number of tokens, number of upper case words, number of numerical characters, percentage of emoji, and the length of the text (see Table 1), were then passed onto a number of machine learning algorithms. We call these as basic features in our experiemnt. We also utilised some hand-engineered keyword patterns in our experiment (see Table 2), which later on in this paper are known as keywords features.

Algorithms
In this study, we used Naïve Bayes (NB), Support Vector Machine (SVM) with RBF kernel, and XGBoost as the algorithms to classify the data. We based our selection of algorithms on a generic review of machine learning algorithms performance in [11] where SVM with RBF kernel and Random Forest turned out to be performing quite well in many cases. XGBoost [12] as a variation of decision trees method that shares some similarities with Random Forest, such as in the column sampling method and ensemble trees concept, became our choice of implementation because it has been proven to be successful in many machine learning competitions. On the other hand, we also incorporated the results from Naïve Bayes algorithm as the baseline for our study because of its simplicity and efficacy in practice [13]. We held out 20% of the dataset to be used as the test set later on in our study. We evaluated our models using cross-validation and concurrently tuned the hyperparameters. Using consistent random number generator, the hyperparameters we tested in our study can be seen in Table 3.
The best models were then tested on the test set to know how well our model performs in general.

Evaluation Metrics
Since the distribution of the classes in the dataset is imbalanced, we used precision, recall, and F 1 -score to evaluate our models. These metrics are formulated as: where T P, T N, F P, F N denote true positives, true negatives, false positives, and false negatives respectively.
In this study, we focused on getting the highest F 1 -score. Therefore, most of the comparison would be based on F 1 -score. We only provide the precision and recall for the best model achieved.

Dataset
We collected 24,602 comments from 500 posts authored by 104 Indonesian public figures. We annotated those comments ourselves by looking at the original post to see the relevance of the comments. We focused on comments that are promoting websites or products to be categorised as spam. Eventually, we got 22,743 ham and 1,859 spam in the dataset.

Results
Our preliminary result from using only basic features as mentioned in Table 1 was promising. We got 0.7775 of F 1 -score using XGBoost. We can see in Figure 1 that the logarithmic value of number of characters by itself can be a discerning feature. We achieved an even better result using the features from hand-engineered keyword patterns (see Table 4). However, as the keywords can be altered in the future by the spammers to trick the classifiers, we cannot rely solely on this result. Thus, we continued our experiments with the representations mentioned in subsection 3.1.
By using only word vectors extracted from the text, we can see in Table 5 that the algorithms, with the exception of Naïve Bayes, managed to beat the corresponding algorithms with our handengineered keywords. Using word vectors from fastText resulted in the best models from three representations.
Our next effort was combining the features in Table 1 with the word vectors. We wanted to see how adding the basic features and our hand-engineered keyword patterns could improve the models. The result for this experiment can be seen in Table 6. Note that "+b" means we were  We can see that adding our proposed basic features improved all the F 1 -scores. However, adding keyword patterns to word vectors with basic features did not improve the performance significantly. We confirmed this by using McNemar's test defined as follows: where b and c denote the number of difference in class predictions (test 1 negative and test 2 positive, and vice-versa). The p-value we got from McNemar's test on SVM classifiers is 0.22, which means the difference is not statistically significant (p > 0.05). This result indicated that using word vector representation with semantic analysis is sufficient to detect important features from the text. Having said that, the best F 1 -score in our experiment was from adding basic and keyword patterns features to SVM with fastText: 0.9601.

Conclusions and Future Work
Our experiments show that employing fastText can produce robuster models. This technique can also minimise the time it takes for the feature extraction step. However, there is still a possibility to improve the document vectors by taking the maximum or signed maximum values for each dimension of the word vector instead of averaging them out.
This study also corroborates the thorough investigation in [11] where it is stated that SVM-RBF and Random Forest performs quite well on many cases. These algorithms combined with