Site Agnostic Approach to Early Detection of Cyberbullying on Social Media Networks

The rise in the use of social media networks has increased the prevalence of cyberbullying, and time is paramount to reduce the negative effects that derive from those behaviours on any social media platform. This paper aims to study the early detection problem from a general perspective by carrying out experiments over two independent datasets (Instagram and Vine), exclusively using users’ comments. We used textual information from comments over baseline early detection models (fixed, threshold, and dual models) to apply three different methods of improving early detection. First, we evaluated the performance of Doc2Vec features. Finally, we also presented multiple instance learning (MIL) on early detection models and we assessed its performance. We applied timeawareprecision (TaP) as an early detection metric to asses the performance of the presented methods. We conclude that the inclusion of Doc2Vec features improves the performance of baseline early detection models by up to 79.6%. Moreover, multiple instance learning shows an important positive effect for the Vine dataset, where smaller post sizes and less use of the English language are present, with a further improvement of up to 13%, but no significant enhancement is shown for the Instagram dataset.


Introduction
The use of social networks has been continuously increasing, with millions of people communicating through them every day. Although social networks are powerful tools that have a positive effect on many users, they also have the potential to negatively affect many others. Smith et al. in [1] define cyberbullying as "an aggressive, intentional act carried out by a group or individual using electronic forms of contact, repeatedly or over time against a victim that cannot easily defend him or herself", which has a special impact among teenagers due to their mistaken threat perception [2,3].
The negative effects of cyberbullying can go from lower school performance, absenteeism, truancy, dropping out, and violent behaviour [4] to potentially devastating psychological effects such as depression, low self-esteem, suicide ideation, and even suicide [5]. Therefore, the early detection of cyberbullying in social networks has become a critical task to mitigate and reduce the negative effects on its victims [6]. Moreover, the frequency and high propagation rate allowed by technology makes it crucial to identify a cyberbullying scenario to stop the aggression and support the victims, as well as to identify the aggressors.
The main aim of this work is to improve the early detection of cyberbullying on social networks, focusing only on users' comments, in order to provide a general approach that could be potentially applied to any social media. As presented in [6], the use of comments and user features has been explored, but considering the results obtained as baseline, and to perform a site agnostic approach to the early detection problem, no user features have been used.
We consider the results previously obtained in [6] using the fixed, threshold, and dual early detection models and standard syntactic and semantic features (i.e., term frequencyinverse document frequency and latent Dirichlet allocation) for cyberbullying as baselines, and we explore several alternatives to improve the outcome. For this purpose, we extend our experiments to two different datasets, and the features extracted are limited to the comments (without considering user or media features). Moreover, we propose a new set of features (i.e., Doc2Vec) and we provide two improvements to the baseline models.
In particular, we identify and address the following research questions: • RQ1: Can Doc2Vec features significantly improve performance for the early detection models? • RQ2: Can multiple instance learning improve results for the early detection models?
Our contributions can be summarised as follows. (1) We study the cyberbullying early detection problem on social media, from a generic perspective focusing exclusively on users' comments. (2) We analyse the positive impact of Doc2Vec features on early detection models. (3) We introduce multiple instance learning (MIL) to the early detection models and evaluate the performance improvements.

Related Work
The detection of cyberbullying using machine learning algorithms has attracted considerable attention from researchers in the last few years, with a growing need for the automatic detection of cyberbullying, especially on social media.
A few interesting works in this field include, for example, some authors' work on the detection of cyberbullying using textual, audio, and visual features. Their results suggest that audiovisual features can help improve the performance of purely textual cyberbullying detectors [7]. Dinakar et al. [8] use term frequency-inverse document frequency in combination with the identification of profane words to detect cyberbullying on YouTube comments. Zhong et al. [9] combine bag of words (BoW) features with Word2Vec to detect cyberbullying on Instagram. They also include images and caption features, concluding that they can serve as a powerful predictor for future cyberbullying. Other authors use support vector machines (SVM) and a lexical syntactical feature to predict offensive language, achieving high precision values [10]. Also interesting is the work presented in [11], which focuses on network-based features and shows how they can be very helpful for the detection of aggressive user behaviour.
Rafiq et al. provide a dataset for the Vine social network and explore the detection of cyberbullying using several machine learning models, achieving the best results with AdaBoost, closely followed by random forest [12,13]. Following this work, Hosseinmardi et al. [14] used an Instagram dataset to further explore cyberbullying detection by incorporating multimodal text and image features as well as media session data. The results obtained by SVM outperformed other classifiers.
Arif in [25] and Singh et al. in [26] recently provided extensive reviews of relevant research works using machine learning techniques to detect cyberbullying on social media. From a generic perspective, most works use content-based features (such as Tf-idf or Word2Vec) and sentiment analysis as a basis, while others also incorporate different features extracted from the user, such as social features (e.g., followers) or their profile (e.g., age or gender).
Interestingly, although some works consider time features relevant for cyberbullying detection [27][28][29][30], to the best of our knowledge, very little effort is devoted to the early detection of cyberbullying. In fact, most works focus on improving performance regarding how successfully a cyberbullying post or session is identified, and therefore, standard evaluation metrics are used, such as accuracy, precision, recall, F-measure, or area under the curve [31][32][33].
Among those works that provide a time-aware evaluation, we find [34], where the authors focus on an early detection of cyberbullying to minimise the damage caused to victims. A basic time-aware evaluation is defined by dividing the original dataset in 10 independent subsets (or chunks), which are sequentially processed by the models. Performance is measured using F1-score at each chunk, but models are not penalised for late detections, and best scores are obtained mainly after processing at least half the dataset.
One of the research questions in [35] refers explicitly to the capacity of the proposed method, named HENIN, to provide early detection of cyberbullying. However, the authors perform a basic time-aware evaluation using precision and accuracy at k, and no penalty is introduced for late detection. Although the authors in [36] claim to be working on the early detection of cyberbullying, they in fact focus on computer network cyberattacks, and the time-aware performance evaluation is just based on the time required to make the prediction. Also in [37], an evaluation based on precision, recall, and F1 is used for the early detection of stress and depression.
Finally, this work is mostly related to our previous work [38], where we study the early detection of cyberbullying on the Vine social network considering owner, media session, video and comment features. The threshold and dual early detection models, which constitute our baselines, are proposed, and a strict time-aware evaluation is contemplated, which is also used in this work. However, in this article, we aim to provide a generic approach to the early detection of cyberbullying that could potentially be applied to any social network.

Methods
The main purpose of this work is to provide a general perspective for the early detection of cyberbullying on social media. For this reason, we focused our research on the common ground provided by different social media: user comments.
For all experiments, we considered a 5-fold cross-validation scheme to assess the performance and robustness of the models. For each model configuration, each experiment was repeated 5 times by drawing new training and test splits, and results were averaged over 5 rounds. Significance tests were performed using the Welch two-sample t-test, and we considered it statistically significant if p-value is below 0.01.

Problem Definition
This section provides a formal definition of the cyberbullying early detection problem on social media sessions.
We define S = {s 1 , s 2 , . . . , s n } as the set of media sessions that may suffer from cyberbullying, where |S| = n denotes the total number of sessions. Each social media session s ∈ S is composed of a sequence of posts, denoted as P s , along with a binary indicator l s that specifies if the session is considered cyberbullying (l s = true) or not (l s = f alse).
For each social media session, the sequence of posts increases throughout time, and is denoted as: where the tuple < P s i , t s i >, i ∈ [1, m] represents the i-th post for media session s and t s i is the timestamp when the post P s i was produced. We define each post, P s i , by considering the set of features described in Section 3.2. Each social media session is independent, and therefore, the number of posts for each session, m, may be different.
The objective of the cyberbullying early detection problem consists of, given a social media session s, detecting whether the session can be considered cyberbullying but examining as few posts from P s as possible. Therefore, the output for early detection models will be binary (i.e., 0 for non-cyberbullying or negative; 1 for cyberbullying or positive), along with the number of posts processed, k, to reach the final decision. Note that when processing post i and all previous posts P s 1 , . . . P s i−1 ), the model may produce a nondefinitive decision (i.e., delay), which implies that more posts must be processed to reach a final decision.

Features
Previous works, as seen, for example, earlier in [6], have used specific social media features in their models, but this is precisely what we intend to avoid. Therefore, we did not take into consideration any number of features regarding users' followers, media session features such as number of likes or sharing, nor specific media features such as content or emotions extracted from different multimedia (e.g., pictures or videos).
On the contrary, our main and only focus refers to users' comments on the media provided by the different social networks. Therefore, from the comments for both Instagram and Vine datasets, we considered the following groups of features based on previous works: • Impolite comments: Following [14,39], we identified profane words for each post, where profane words are obtained from a dictionary. For each comment, we calculated its percentage of profane words, and we defined a comment as rude if it contained at least one profane word. For each session, we computed the percentage of rude comments. • Sentiment analysis: Following [13], we measured polarity (positive and negative) and subjectivity for each comment. For each session, we calculated average values and the percentage of negative and subjective comments. • Tf-idf: For each comment, we built a Tf-idf (term frequency-inverse document frequency) representation, following [6]. We performed a grid search for optimal hyperparameters using 10% of the Instagram dataset. Stopwords were removed and unigrams, bigrams, and trigrams were considered. The 100 most frequent n-grams were extracted as features, a minimum document frequency of 2 was required, and terms that appear in more than 90% of the documents were ignored. Features were computed for each individual post and partial session. • Latent Dirichlet allocation (LDA): Following [6,13], we computed semantic similarity using LDA. Optimal hyperparameters were obtained through grid search: 10 topics and a learning rate of 0.9. Features were computed for each individual post and partial session.
One of the contributions of this work is to explore the impact of Doc2Vec features in cyberbullying early detection. Quoc and Mikolov propose [40] an unsupervised method to provide fixed-length representations from variable-length pieces of texts. We considered that it may be suitable to extract information from social media posts, as it proved to be useful for text feature extraction in topic analysis [41,42].
As in previous features, we performed a grid search to tune hyperparameters: a distributed bag of words algorithm with negative sampling was used for training, the maximum window distance between two terms was set to 5, and 100 features were extracted. Doc2Vec features were extracted for each individual post and for each partial session.

Datasets
In this research, we used two datasets tagged for cyberbullying from two different social networks: Instagram and Vine.
The Instagram social network is a media-based social network mainly focused on images, where users can post and comment on them. The Vine social network is based on a mobile application that allows users to record and edit short videos to publish on their profiles, while other users can watch and comment on these videos.
The Instagram dataset was collected and described in detail by Hosseinmardi et al. [14,39]. The authors started with a sample of 25 thousand Instagram users with public profiles, and from there, a collection of more than 3 million media sessions was extracted. Sessions were limited to those with at least 15 comments, and to achieve a more balanced dataset, only sessions with at least one profane word in their comments were considered.
As a result, 2218 media sessions (with their comments) were labelled for cyberbullying using human contributors. Each contributor received a level of trust based on a series of tests and quizzes. A weighted version of the majority voting method was applied, where a minimum trust of 60% is required to keep the session in the dataset, leaving a total of Instagram sessions.
A similar approach was followed to create the Vine dataset [12,13]. The sample started with more than 650 thousand media sessions, which, after removing those sessions with fewer than 15 comments, was reduced to 436 thousand sessions. Then, the sessions were grouped in terms of the percentage of comments including profane words, and a representative subsample for all groups was considered for labelling, obtaining a total of 969 media sessions. Each one of these sessions was tagged by 5 reviewers, and sessions with a confidence level below 60% were discarded from the dataset, obtaining a final number of 747 Vine sessions. Table 1 presents the main statistics for both datasets. The Instagram dataset is bigger, both in terms of sessions and comments, while the Vine dataset provides a higher rate of comments per session than Instagram. We also analysed the main language used on the different social media networks. It is interesting to note that Vine sessions show a higher variability of languages, while most Instagram comments are in English. We consider that this could be motivated by the smaller size of the comments in the Vine dataset (see Figure 1), which points towards the use of more abbreviations and slang.   Figure 1 shows the distribution of the number of words per comment for both datasets, whereas Figure 2 shows the distribution of profane words. From the graphs, we observe that Instagram comments are much longer than those on Vine. With an average of nearly 10 words per comment on Instagram, and a long queue with posts reaching nearly 400 words, a relatively high standard deviation of 17.46 is produced. On the other hand, Vine averages 5.7 words per comment (maximum is around 30 words) and a standard deviation of 5.41, all in all indicating a much smaller post size for this social network. Moreover, Figure 1a shows a slight tendency to use more words on bullying comments on Instagram, which is just minorly present at the tail of the word distribution for the Vine dataset ( Figure 1b).

Early Detection Metrics
Timeaware precision or TaP will be considered as the evaluation metric, as presented in [43] for cybersecurity evaluation purposes.
The timeaware precision or (TaP) is defined as follows at point o for an entity e i : The output of the metric TaP for a set of entities E is obtained by the aggregation of individual entities e i by means of an average score, as shown next: where λ is the decay parameter to define how harsh the penalisation applied is, and k corresponds with the amount of items being processed at that specific point. Moreover, as previously studied in [43] but not included here for sake of simplicity, an α parameter is included to allow for the balance of cases in different problems.
TaP metric definition, in contrast to other time-aware metrics, is problem-related, and its parameters do not depend on the specific dataset but on the problem being evaluated. This characteristic allows for a more comprehensive approach to the evaluation of early detection models.

Baseline
As early detection baseline models, we consider those described in previous works [6,44,45] that have reported relevant results when evaluated with early detection metrics: fixed, threshold, and dual. To the best of our knowledge, no other previous results using time-aware metrics such as TaP as evaluation metrics for early detection models have been presented.
As introduced, there are few cyberbullying early detection studies employing early detection metrics solely using comment data. On account of the state-of-the-art results, we can cite the following research using the Instagram and Vine datasets. The authors of [27] obtained the best results for the Instagram dataset, using linear regression with an F1 of 0.64 ± 0.03 (precision 0.79 ± 0.03 and recall 0.55 ± 0.03). This is also shown in [29] with an F1 value around 0.65 (precision 0.873 ± 0.04 and recall 0.517 ± 0.05). In the case of the Vine dataset, an F1 of 0.59 ± 0.04 was also reached with linear regression (precision 0.62 ± 0.05 and recall 0.57 ± 0.05). These values, although not directly comparable, can give a reference through the values obtained in [6] for the Vine dataset.
For all our experiments, the following machine learning models were used for fixed, threshold, and dual: AdaBoost (ADA), extra trees (ET), logistic regression (LR), naïve Bayes (NB), and support vector machine (SVM). • A fixed early detection model consists of a simple adaptation of a standard machine learning model by considering a fixed number of input comments. Meanwhile, a delay is produced until this number is reached; hence, a final prediction will be generated by the model [6]. In our experiments, we consider different values for the number of input comments: 1, 5, 10, 15, and 25. A different performance result is obtained for each fixed point and for each machine learning model. For simplicity, we will denote each model with a subscript at a decision point (e.g., ADA 5 or ET 15 ). • A threshold early detection model uses a machine learning model, and sets a positive threshold (th+) and a negative threshold (th−) based on the class probability returned by the model to decide when to make a final decision (i.e., cyberbullying or noncyberbullying) [6]. That is, for a specific media session s, after processing post i of s, P i , the model tests whether the class probability is higher or equal to th+ to emit a cyberbullying prediction, and if not, tests whether it is higher or equal to th− to produce a non-cyberbullying prediction. Otherwise, a delay output is generated, and further posts must be processed. • A dual early detection model relies on two independent machine learning models, with one model (m+) trained to detect cyberbullying cases and another model (m−) trained to detect non-cyberbullying cases. Positive and negative thresholds are also required, but in this case are associated to its own model. Therefore, after processing P i by m+, the class probability must be higher or equal to th+ to be considered cyberbullying, and likewise, when processed by m−, it must be above or equal to th− to be considered non-cyberbullying. In any other case, a delay output is produced.
In this work, we also study the use of multiple instance learning (MIL) for the early detection models. The MIL paradigm assumes that labels are assigned to sets or bags (i.e., media sessions, in our case), but training examples are considered ambiguous: a single object is expected to have multiple alternative instances (i.e., posts, in our case) that describe it, and only some of those may be responsible for the object classification [46]. More specifically, we focus on instance-space methods that obtain bag labels by aggregating individual instance features [47]. In our experiments, we considered the following bag aggregation functions: minimum, maximum, average, median, and average between minimum and maximum values. Therefore, an initial MIL phase is introduced that will provide a new media session representation, and then baseline early detection models can be executed as usual.

Performance Evaluation
The following sections present the results for each of the three sets of experiments performed. First, we study how the inclusion of Doc2Vec features affect all the proposed models. Finally, we incorporate multiple instance learning to fixed early detection and threshold models.

Baseline
In this section, we present the baseline results using the TaP metric, as introduced in [43] for all three models used on the experiments: the fixed model, threshold model, and dual model.
Firstly, in Table 2, it must be noticed that even if the general values differ, the best point of decision for each fixed model is almost the same for all the models and datasets. This can be explained by the fact that the amount of information processed is almost the same for all models and both datasets (i.e., point 10, in eight cases out of ten), which leads to finding the optimal point of decision related to the penalisation applied by the metric. On Instagram, the best baseline performance is obtained with LDA for naïve Bayes (at point 10) reaching 0.3755. On the other hand, for the Vine dataset, the best baseline performance combines Tf-Idf and LDA features for naïve Bayes (also at point 10), obtaining TaP = 0.3794. Table 2. TaP for fixed model using only baseline features: Tf-idf (TI) and latent Dirichlet allocation (LDA). The first column corresponds to features and the second column to point of decision. Results are grouped by datasets (i.e., Instagram and Vine), and each column corresponds to one machine learning model. Best score for each model is underlined and best score for each dataset is highlighted in bold.  Table 3 shows the results for the threshold model, where more differences can be seen, as variation in thresholds does not correlate directly with the amount of items processed to make a decision. This contrasts with the results shown previously, but can be explained due to the fact that a higher threshold could mean a lower probability of making the wrong choice, but also could increment the number of items processed to take the final decision. As for the results, focusing on Instagram, the best baseline score is 0.7561, obtained with Tf-idf features, using naïve Bayes, and the threshold early detection model parameters are th+ = 0.5 and th− = 0.9. For the Vine dataset, LDA features are used, also using the naïve Bayes model, and the threshold parameters are also th+ = 0.5 and th− = 0.9. The performance obtained for the best baseline features in this dataset is TaP = 0.399.

Features
Finally, in Table 4, even taking into account the difficulty in the interpretation of the results due to the complexity of the dual model (comprising one model for positive cases and other for negative cases, each one with its own features and thresholds), it can be seen that the same pattern applies, with slight differences in the configuration of the models, as stated for the fixed and threshold baselines. For instance, the same values are obtained for all combinations of negative features in the same models, as is the case of NB for both the Instagram and Vine datasets. For these results, different threshold configurations were tested, using low values for the positive threshold and high values for the negative threshold, as in the threshold early detection model. However, for sake of simplicity, in Table 4, we show the results for th+ = 0.5 and th− = 0.9, as this configuration achieved the best results. Table 3. TaP for threshold model using only baseline features: Tf-idf (TI) and latent Dirichlet allocation (LDA). The first column corresponds to features, the second column to positive threshold, and the third to negative threshold. Results are grouped by datasets (i.e., Instagram and Vine), and each column corresponds to one machine learning model. Best score for each model is underlined and best score for each dataset is highlighted in bold.  Table 4. TaP for dual model using only baseline features: Tf-idf (TI) and latent Dirichlet allocation (LDA). The first column corresponds to positive features and the second column to negative ones. Results are grouped by datasets (i.e., Instagram and Vine), and each column corresponds to one machine learning model. Best score for each model is underlined and best score for each dataset is highlighted in bold. In this case, for the Instagram dataset, the best TaP outcome is 0.7561 when just Tf-idf are used as positive features, no matter the combination of negative features selected for the naïve Bayes model. In the case of the Vine dataset, all combinations of negative features also present the same output, also reaching the highest value at 0.399 for naïve Bayes when LDA features are used.

Doc2Vec Features
In this first set of experiments, we evaluated whether Doc2Vec features can significantly improve performance. In all the experiments, sentiment and profane analysis features were included, and different combinations of syntactic and semantic features (i.e., Tf-idf, LDA, and Doc2Vec) were tested. Figure 3a,b show the results for the fixed early detection model at points 1, 5, 10, 15, and 25 for the Instagram and Vine datasets, respectively. The X-axis represents the syntactic and semantic features extracted from the posts. The first three items correspond to standard features (i.e., Tf-idf, LDA, and the combination of both), which constitute our baseline. The remaining items correspond to Doc2Vec and its combination with Tf-idf and LDA features. From Figure 3a, we observe how the use of Doc2Vec features (stand-alone or in combination) improves performance, especially for the logistic regression and SVM models, and independently from the point when the fixed model makes the decision. This behaviour is clearer on the Vine dataset (Figure 3b), as there is a higher improvement for all machine learning models, except naïve Bayes.
Introducing Doc2Vec features immediately improves performance, but the best-performing model on the Instagram dataset is SV M 10 , combining Doc2Vec with LDA and reaching 0.529 (versus previous 0.3755 value), while for Vine, the use of Doc2Vec reaches up to 0.5716 with SV M (versus 0.3794 previously obtained) also at point 10.
Following this, we replicated the same experiments for the threshold early detection model. Figure 4a,b show the results for both datasets. Since this model requires positive and negative threshold values, we show different line types for different thresholds. We limited our results to low positive thresholds (i.e., 0.5 and 0.6) and high negative thresholds (i.e., 0.8 and 0.9), since they have provided good results in previous works [6]. The rationale behind these values is clear: to provide an easy-to-surpass threshold for the early detection of cyberbullying cases with little information, while negative cases require a higher degree of confidence.
From the figures, we observe a similar behaviour as in the previous case, with an important performance improvement. By introducing Doc2Vec, in combination with the other features, the performance significantly improves, except in the case of naïve Bayes and extra tree, where this difference is lower. With this model, on the Instagram dataset, that improvement is only clearly present for the th+ = 0.6 and th− = 0.8 configuration, while on Vine, a slight increase in the results is present for th+ = 0.5, with the inclusion of Doc2Vec in some of the combinations of features. In this last case, it obtains worse results without the use of TI features. It must be noticed that in any case, the best results for naïve Bayes are higher than the ones achieved with the other models. On Instagram, naïve Bayes is the best-performing model, reaching up to 0.7585, using Doc2Vec with Tf-idf features when a low positive threshold is used(th+ = 0.5), no matter the negative threshold used (th− = 0.8 and th+ = 0.9).
On the other hand, Vine's best-performing threshold model uses AdaBoost, and requires Doc2Vec and LDA features to score 0.7097 (th+ = 0.5 and th− = 0.9), significantly improving the baseline performance.
Finally, in this set of experiments, we validated the Doc2Vec behaviour for the dual model, and the results are shown on Figure 5a,b. Each column represents the results for one machine learning model, while rows correspond to positive features. Since the best scores were achieved using Doc2Vec as positive features, we limited the rows to those cases. As in the previous figures, negative features are represented on the X-axis, and different values for positive and negative thresholds are represented on each graph.
Focusing on the Instagram dataset ( Figure 5), the best-performing models are Ad-aBoost and naïve Bayes, and th− = 0.9 tends to achieve better results in most cases. The best score is 0.7585, which corresponds to naïve Bayes using Doc2Vec and Tf-idf as positive features for all negative feature combinations with a low positive threshold (th+ = 0.5).
With respect to the Vine dataset (Figure 5b), we observed a limited impact on all machine learning models for some feature combinations, both positive and negative. Just low negative thresholds (th− = 0.8) present a notable variation for logistic regression model. In the rest of the cases, the performance remains quite stable. AdaBoost and extra trees achieve good results in general and present a consistent behaviour, independently of positive and negative thresholds. In fact, the best score (0.7167) is obtained by AdaBoost, using all features as negative and Doc2Vec and LDA as positive, with th+ = 0.5 and th− = 0.9. This result significantly improves the dual model baseline performance for the Vine dataset.  Table 5 summarises the main results for this first set of experiments, comprising the inclusion of Doc2Vec features. We observe that its incorporation significantly improves the results obtained by the early detection models for both datasets. Interestingly, the performance improvement is more important on the Vine dataset. We consider that this may be motivated by the smaller post sizes (in terms of word count) and the limitations of Tf-idf and LDA to extract valid features, while Doc2Vec is able to better capture the information available. Regarding the machine learning models, naïve Bayes for the Instagram dataset and AdaBoost for Vine provide the best results. Another interesting point is the fact that the threshold and dual models significantly improve performance over the fixed model for all cases, except for the threshold model on the Vine dataset. However, there is no significant statistical difference in performance between them. In the remaining experiments, we focus on the threshold early detection model, since it requires a simpler configuration, and the performance results can be considered equivalent to the dual model.

Multiple Instance Learning
In the final set of experiments, we study whether the use of MIL has a positive impact on the early detection of cyberbullying measured with TaP early detection metric. With this aim, we incorporated MIL to the fixed early detection model by adding a bag representation of the posts processed. Those bags of labels were generated on a previous step to ML model training. This addresses the problem of ambiguity generated by single objects while entities with multiple items are supposed to have multiple alternative instances. The features from the posts in the bag are aggregated using the following functions: minimum, maximum, average between maximum and minimum, arithmetic mean, and median.
To help the comparison, we included an aggregation function, denoted as None, that represents a fixed early detection model without MIL. In all our experiments, we ran models with the combination of all features defined: profane words, sentiment analysis, Tf-idf, LDA and Doc2Vec. Figure 6a Regarding the Instagram dataset (Figure 6a), the use of MIL, in many cases, reduces the performance of the model. There are some slight improvements for all models (except logistic regression) at the higher points. In fact, the best score using MIL is obtained by SVM at point 10 using the minimum as aggregation function, and corresponds to 0.3384. However, there is no statistical difference in comparison with the fixed model without MIL, and it is significantly worse than the threshold and dual models.
However, when considering the results for the Vine dataset (Figure 6b), there is a significant change, especially for the AdaBoost and extra trees models, with MIL models providing major improvements. The best score in this case is 0.7926, achieved by AdaBoost at point 10 using the arithmetic mean, which is significantly better than the results without MIL for the fixed, threshold, and dual models.
We consider that this important performance improvement by using multiple instance learning for the Vine dataset is directly related to the reduced post sizes on this social network and the lower use of standard English. On average, each post is composed of about 5 words, which provides limited information to be extracted by the different syntactic and semantic features for each of them. However, the use of a simple aggregation function (e.g., average) for a number of posts allows for better combination of the information from the different posts than the concatenation of the posts themselves. On the other hand, Instagram posts are larger and mostly in English, and therefore, the features extracted provide sufficient information from each post, making the aggregation less relevant.
We also conducted experiments testing different sampling alternatives for the training set, instead of using all post sessions. In particular, we analysed using the first 10 posts, the last 10 posts, or 10 random posts, with no relevant improvements obtained for the fixed early detection model using MIL.
We also explored the impact of MIL on the threshold early detection model performance. The experiments were limited to the AdaBoost and extra trees models, since they provided the highest improvements on our previous experiments, and again, we used all features in our models. In this case, due to the important change on the features introduced by MIL, we considered the whole range for positive and negative thresholds: 0.5, 0.6, 0.7, 0.8, and 0.9. Figure 7a The behaviour on the Instagram dataset (Figure 7a), as expected, is below the results achieved by the baselines provided in Section 4.4.2 for the fixed, threshold, and dual models (being significantly worse for the last two). Increasing the negative threshold improves performance, and the median aggregation function is consistently providing the best results in all cases, but there is no significant improvement in TaP results for this dataset.
Regarding the Vine dataset (Figure 7b), we observe how AdaBoost provides a better performance than extra trees. As in the previous case, increasing the negative threshold improves performance, while for the positive threshold, the best values are obtained with 0.5 and 0.6. This is a meaningful change with respect to the standard threshold model, motivated by the aggregation of information from multiple posts, which produces an increase in the class probability for the cyberbullying cases detected. From the aggregation functions, the maximum concentrates the higher scores.
In fact, the best score of 0.8149 is obtained by AdaBoost using the maximum and th+ = 0.5 and th− = 0.9. Again, this model significantly improves performance over the best fixed, threshold, and dual models in Section 4.4.2, although there is no statistical difference with respect to the fixed model from our previous experiments.

Results and Discussion
All experiments were performed with two different social network datasets and evaluated under an early detection metric. This allowed for a proper evaluation from the early detection's point of view, as the delay is taken into account in the evaluation. In addition, the experiments were performed on datasets of different natures, as can be seen in the outcome of the analysis. They were performed solely with textual information, disregarding platform-dependent information. This draws attention to the syntactic and semantic features extracted from posts.
Considering as a starting point the early detection models and features from our previous work [6], we have proven how Doc2Vec features significantly improve performance for all early detection models, with threshold and dual performing similarly. The best TaP scores achieved for the Instagram and Vine datasets were, respectively, 0.7585 and 0.7167, improving the best respective baseline models in both cases.
We have also shown how the introduction of multiple instance learning in early detection models has an important positive effect on the Vine dataset results, motivated by the smaller post sizes and lower use of standard English language, while there is no significant improvement on Instagram. The threshold model is able to reach a score of 0.8149 with the AdaBoost model for the Vine dataset.
In particular, in the Doc2Vec features section (Section 4.4.2), we assess the performance improvement for all early detection models. We find that it provides an important performance enhancement for threshold and dual models. As there is no significant statistical difference between them, the latter would be preferable due to its simpler configuration. Finally, in section MIL Section 4.4.3, we incorporate multiple instance learning to the fixed and threshold models. It does not provide a better performance for the Instagram dataset, but it improves the TaP score for the Vine dataset. This is related to the reduced sizes of the posts in the dataset and the lower use of English.

Conclusions and Future Work
In this work, we have explored the cyberbullying early detection problem on social networks from a generic perspective, focusing only on users' comments.
To summarise, the inclusion of Doc2Vec features improves all early detection models' performance, in particular threshold and dual. Moreover, multiple instance learning has shown better results for the Vine dataset, with smaller post sizes and lower use of the English language.
In future works, we expect to continue exploring multiple instance learning by using more advanced instance-space MIL models, such as expectation-maximisation diverse density (EMDD), and also considering bag-space and embedded-space models, such as normalised set kernel (NST-SVM) or multiple instance learning via embedded instance selection (MILES). In that sense, different combinations of these methods after grouping by post types and subject matter could be studied. Related to this, different kinds of filters could be applied in order to obtain specific characteristics of those interactions.
We also consider that the early detection problem can be applied in multiple and different environments, and hence, we expect to provide a generic approach to this problem with a suitable evaluation methodology and metrics.

Conflicts of Interest:
The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: