Predicting Adolescents’ Educational Track from Chat Messages on Dutch Social Media

We aim to predict Flemish adolescents’ educational track based on their Dutch social media writing. We distinguish between the three main types of Belgian secondary education: General (theory-oriented), Vocational (practice-oriented), and Technical Secondary Education (hybrid). The best results are obtained with a Naive Bayes model, i.e. an F-score of 0.68 (std. dev. 0.05) in 10-fold cross-validation experiments on the training data and an F-score of 0.60 on unseen data. Many of the most informative features are character n-grams containing specific occurrences of chatspeak phenomena such as emoticons. While the detection of the most theory- and practice-oriented educational tracks seems to be a relatively easy task, the hybrid Technical level appears to be much harder to capture based on online writing style, as expected.


Introduction
While some social variables, such as gender and age, have often been studied in author profiling (see e.g. the overview paper by Reddy et al. (2016)), educational track remains largely unexplored in this respect.The goal of this paper is twofold: we aim to develop a model that accurately predicts adolescents' educational track based on their language use in social media writing, and gain more insight in the linguistic characteristics of youngsters' educational background through inspection of the most informative features for this classification task.The paper is structured as follows: we start by discussing related research (Section 2).Next, we describe the corpus, as well as the three main types of Belgian secondary education, i.e. the three class labels in the classification experiments (Section 3).Finally, we discuss our methodology (Section 4) and present the results (Section 5).

Related Research
Related work on this topic is scarce; only some studies in education profiling can be found, and they examine the impact of tertiary (and not secondary) education, on text genres other than social media writing.Furthermore, Dutch is never the language of interest.Estival et al. (2007), for instance, approached tertiary education profiling as a binary classification task (none versus some tertiary education) for a corpus of English emails.They obtained promising results with an ensemble learner (Bagging algorithm) using characterbased, lexical and structural text features while explicitly excluding function words.Pennebaker et al. (2014), however, stressed the importance of function words in a related task: they linked students' writing in college admission essays to their later performance in college.Obtaining higher or lower grades appeared to be associated with the use of certain function words, belonging to either 'categorical' or 'dynamic' writing styles.In previous work on language and social status, Pennebaker (2011) had already pointed out the importance of pronouns: he described a more frequent use of you-and we-words as more typical of high status, as well as a less frequent use of I-words.When we expand the scope of previous research from profiling studies to other related linguistic fields, we again conclude that this specific topic is underresearched.There are many studies on the characteristics of (youngsters') computermediated communication (CMC) (see e.g.Varnhagen et al. (2010), Tagliamonte andDenis (2008) and many more) and even some on the interaction between CMC and education (see e.g.Vandekerckhove and Sandra (2016) for the impact of CMC on school writing).However, the impact of educational track on adolescents' online writing is not addressed.For this specific topic, we can -to our knowledge -only refer to our previous sociolinguistic work focusing on youngsters with distinct secondary education profiles, in which we have shown that teenagers in practice-oriented tracks tend to deviate more from formal standard writing on social media, by using more typographical chatspeak features (e.g.emoji), more nonstandard lexemes (e.g.dialect words) and more non-standard abbreviations (Hilte et al., 2018a,b).While for all examined linguistic features, these differences were very consistent between the two 'poles' of the continuum between theory and practice, i.e.General and Vocational students, the Technical students did not always hold an intermediate position, but their chat messages showed a rather unpredictable linguistic pattern (Hilte et al., 2018a,b).We investigate in this paper whether these sociolinguistic results are confirmed in machine learning experiments.

Data Collection
Our corpus consists of Flemish1 adolescents' private chat messages, written in Dutch on the social media platforms Facebook Messenger and Whats-App.The data were collected through school visits during which the students were informed about the research, and could voluntarily donate chat messages.We asked for the students' (and for minors, their parents') consent to store and analyze their anonymized texts.

Methodology
In this section, we describe the preprocessing of the data and the feature design (resp.Sections 4.1 and 4.2) as well as the experimental setup (Section 4.3).

Preprocessing
Since we will predict educational track on a participant-level, we must ensure to have sufficient data (and thus a fairly representative sample of online writing) for each participant.For this purpose, we deleted the participants who donated fewer than 50 chat messages.Next, we divided the remaining corpus in a training set (70% of the participants), and a test set (15%).A second test set (15%) was put aside for future experiments.This division was random but stratified, i.e. every subset contained the same proportion of participants per educational track.

Feature Design
The features used in the classification experiments consist of general textual features and features representing the frequency of typical chatspeak phenomena.
The general features include frequencies for token n-grams (uni-, bi-and trigrams) and character ngrams (bi-, tri-and tetragrams).In addition, average token and post length and vocabulary richness (type/token ratio) are taken into account as well.Finally, we use the dictionary-based computational tool LIWC (Pennebaker et al., 2001) in an adaptation for Dutch by Zijlstra et al. (2004) to count word frequencies for semantic and grammatical categories.While counts for individual words are already captured by the token unigrams, these counts per category can allow for broader generalizations for words which are semantically or functionally related.However, we note that the accuracy of this feature might not be optimal, as the social media texts are very noisy (and contain many non-standard elements, e.g. in terms of orthography or lexicon), whereas LIWC is based on standard Dutch word lists.
The set of chatspeak features contains counts for occurrences of several typographic phenomena.It includes the number of character repetitions (e.g.'suuuuuper nice!!!') and combinations of question and exclamation marks (e.g.'what?!').The number of unconventionally capitalized tokens is added as well (alternating, inverse or all caps, e.g.'AWESOME').The final typographic features are emoticons and emoji (e.g.:), <3), the rendition of kisses and hugs (e.g.'xoxoxo'), hashtags for topic indication (e.g.'#addicted') and 'mentions' for addressing a specific person in a group conversation (e.g.'@sarah').We also add an onomatopoeic variable, i.e. the number of renditions of laughter (e.g.'hahahahah').Another typical element of chatspeak are non-standard abbreviations and acronyms (e.g.'brb' for 'be right back').The final feature concerns language or register choice per token, in order to explicitly take into account the authors' use of words in a different language or linguistic variety than standard Dutch.We count the number of standard Dutch, English, and non-standard Dutch (e.g.dialect) lexemes.While the other chatspeak features are detected with regular expressions (typographic and onomatopoeic markers) or predefined lists (abbreviations), this lexical feature is extracted using a dictionary-based pipeline approach.For each token, we first checked if it was an actual word (and not e.g. an emoticon).Next, we checked if it occurred in a list of standard Dutch words and named entities.If not, we checked its presence in a standard English word list.Finally, if the token was absent again, it was placed in the 'non-standard Dutch' category.Figure 1 shows a sample of authentic chat messages from the corpus, illustrating the use of several chatspeak features.
For each participant, an individual feature vector was created containing the counts for all of these features.We proceeded with relative counts (to normalize for submission size) by dividing the absolute counts by the author's total number of tokens (e.g. for token unigrams, emoji, ) or n-grams (for n-gram frequencies).For initial dimensionality reduction, we applied a frequency cutoff, only taking features into account that are used at least 10 times in the corpus, by at least 5 different participants.

Experimental Setup
We compared different models to predict Flemish adolescents' educational track based on their social media messages.The classification algorithms we tested were: Support Vector Machines, Naive Bayes (Multinomial, Gaussian and Bernoulli), Decision Trees, Random Forest, and Linear Regression.For all classifiers, we used the Scikit-learn implementation (Pedregosa et al., 2011).For each model, we searched for the optimal parameter settings through a randomized cross-validation search on the training data.We searched for optimal values for classifier-bound parameters (e.g.kernel for SVM), as well as an optimal feature scaler (no scaling, MinMax scaling or binarization) and an optimal percentile for univariate (chi-square based) feature selection, chosen from a continuous distribution.We compared the models' performance in 10-fold crossvalidation experiments on the training data.

Results
In Section 5.1, we discuss the best model resulting from the 10-fold cross-validation experiments on the training data and compare it to different baseline models.In addition, we inspect the most informative features for the task.In Section 5.2, we discuss additional experiments which provide further insight in the classification problem.

Model Performance and Feature Inspection
The best performing model in CV-setting on the training data is a Multinomial Naive Bayes classifier, with optimized parameters: the value for the smoothing parameter alpha is 0.98, and the model uses the 12.50% best features (according to chi-square tests).The features were binarized.The classification report (Table 2) indicates that the performance is good, with a value of 0.68 for (prevalence-weighted macro-average) precision, recall and F-score (std.dev.0.05).While precision is very similar for the three educational levels, recall is good for General Education, but slightly worse for the Vocational and much worse for the Technical level.Consequently, the model seems to miss many Technical profiles, confusing them with the other educational tracks.
The confusion matrix (Table 3) shows that most (64%) misclassified Technical profiles were incorrectly labeled as the more theory-oriented General track, rather than as the more practice-oriented Vocational track (36%).
As Table 5  the first model reaches an average F-score of 0.60 (see Table 4 for the detailed classification report), the BoW-model achieves a lower score of 0.55, and particularly underperforms in the detection of Technical profiles, with an F-score of 0.38 (vs 0.50 for the full model).
In order to better understand the differences and similarities between both models, we compared their feature sets (after feature selection was applied) and inspected the 1000 most informative ones, using information gain as ranking criterion.
While we expected that the most informative features for the BoW-model would be lexical and the ones for the full model stylistic, this analysis suggests that in both models, many of the most informative selected features are specific occurrences of chatspeak markers.For the BoWmodel, which uses only token unigrams as features, many of the most informative tokens contain one or more chatspeak features (e.g.colloquial register, a spelling manipulation, an emoticon, character repetition, etc.).Some other informative tokens seem to be more content-than stylerelated, revealing topics such as hobbies, specific locations, friends and school.Strikingly, although the full model contains abstraction of chatspeak phenomena (e.g. total count for emoticons), specific occurrences of these genre markers are still most informative.bic roots, who spell it in many different ways.Because of these alternative spellings, 'wallah' does not appear among the most informative tokens in the BoW-model.However, for the full model, several related character n-grams (e.g.'wlh', 'wll') do.
Next, we compared the full model to a stylistic model using only chatspeak features (both abstractions and specific occurrences), and no token or character n-grams.This stylistic model performs slightly worse on both the training set (F-score = 0.64, std.dev.0.04) and unseen data (F-score = 0.59) (see Table 5).However, inspection of the most informative features in this feature set provides further insight in the education profiling task.Many of the most informative features are again specific occurrences of stylistic phenomena (e.g.specific emoticons, specific lexemes containing letter repetition).Some abstract representations of online writing style characteristics appear among the top-1000 features too (such as the total use of character repetition, of onomatopoeic laughter, acronyms, English words, mentions and hashtags, and emoticons), but much less prominently.These findings suggest that even in a purely stylistic model, abstract representation of certain style features is not informative enough for education profiling, and appears to be less important than the use of these features within specific tokens or contexts.adolescents, aged 13-16 (F-score = 0.69 in crossvalidation, std.dev.0.09; and 0.55 on unseen data).This might be due to the fact that the older teenagers have been together in the same peer networks and class groups for a longer time, and might write more similarly on social media.Furthermore, some of the younger students might actually still change educational track.

Conclusion
We conducted classification experiments to predict educational track for Flemish adolescents, based on their social media writing.These first results are promising and indicate that the task is doable.However, although the best model strongly outperforms a probabilistic baseline, its performance is similar to that of a simple BoWmodel.This might give the impression that lexical features are still very important; however, inspection of the most informative features revealed that many of the most informative tokens contain stylistic features typical of the informal online genre.The most informative features for the full model suggest that abstraction of these stylistic chatspeak features (or at least, the current implementation) is still of lesser importance than specific occurrences.
While the distinction between General and Vocational high school students appears to be relatively easy to make, the detection of students in the intermediate Technical track is much harder.This could indicate that these students are truly a hybrid class with subsets of students that are simply not that different from their peers in more theory-or more practice-oriented tracks, respectively.In addition, related research shows that these students' online writing is rather unpredictable and does not follow a clear pattern (Hilte et al., 2018a,b).
In future work, we want to experiment with additional algorithms, such as ensemble methods, and with a post-level rather than a participant-level approach (in order to have more data samples at our disposal).We also want to improve the current feature design and particularly the abstract representation of style features, because as van der Goot et al. ( 2018) write, abstract features may increase generalizability to other corpora (and even genres and languages) in author profiling tasks, compared to lexical models.Finally, we want to further investigate the creation of different classifiers for different subgroups of participants (e.g.boys versus girls).Finally, we stress that this profiling task is not only relevant in a Belgian context, since the educational tracks serving as class labels correspond to several countries' secondary education programs.Furthermore, the inclusion of stylistic features -i.e.chatspeak phenomena occurring in any language -adds to this generalizability.While specific lexemes or specific realizations of chatspeak markers may not always be relevant in other languages or corpora, the abstract stylistic features are more universal on social media.We argue that these models for education profiling, when further improved, could be used in different languages and applications.For instance, the addition of an educational compound can increase existing profiling tools' performance, which can be important in different tasks (e.g. the detection of fake accounts on social media, and many more).

Supplementary Materials
Because of the decision of our university's ethical committee, in line with European regulations to ensure the adolescents' privacy, we cannot make the dataset publicly available.The code will be made available.

Figure 1 :
Figure 1: Example messages from the corpus.

Table 1 :
Distributions in the corpus.
troduced, the BoW-model obtains almost identical scores in cross-validation: it yields an overall precision, recall and F-score of 0.67 (std.dev.0.03).There is, however, a difference in how well both models generalize to unseen data.While

Table 5 :
Comparison of the different models and baselines.
on a sub-token level (e.g. the n-gram 'sss' captures repetition of the letter 's' in different words).We can illustrate a clear advantage by the Arabic word 'wallah' (meaning 'I swear on God's name'), which is often used by our participants with Ara-

Table 6 :
Classification report for binary task (in crossvalidation).

Table 8 :
Comparison of the models for separate groups.