Supervised learning and resampling techniques on DISC personality classification using Twitter information in Bahasa Indonesia

Purpose – Gathering knowledge regarding personality traits has long been the interest of academics and researchers in the fields of psychology and in computer science. Analyzing profile data from personal social media accounts reduces data collection time, as this method does not requireusers to fill any questionnaires. A pure natural language processing (NLP) approach can give decent results, and its reliability can be improved by combining it with machine learning (as shown by previous studies). Design/methodology/approach – Inthis,cleaningthedatasetandextractingrelevantpotentialfeatures “ as assessed by psychological experts ” are essential, as Indonesians tend to mix formal words, non-formal words, slang and abbreviations when writing social media posts. For this article, raw data were derived from a predefined dominance, influence, stability and conscientious (DISC) quiz website, returning 316,967 tweets from 1,244 Twitter accounts “ filtered to include only personal and Indonesian-language accounts ” . Using a combination of NLP techniques and machine learning, the authors aim to develop a better approach and more robust model, especially for the Indonesian language. Findings – The authors find that employing a SMOTETomek re-sampling technique and hyperparameter tuning boosts the model ’ s performance on formalized datasets by 57% (as measured through the F1-score). Originality/value – The process of cleaning dataset and extracting relevant potential features assessed by psychological experts from it are essential because Indonesian people tend to mix formal words, non-formal words,slangwords andabbreviationswhenwritingtweets.OrganicdataderivedfromapredefinedDISCquiz website resulting 1244 records of Twitter accounts and 316.967 tweets.


Introduction
Personality is what distinguishes individual humans from each other and defines their tendencies in their reactions and actions.Although academics and researchers have long attempted to gather knowledge about personality, it remains an evergreen area of research today.Analysis of individuals' personal social media accounts offers a promising approach, as this method does not require users to complete any questionnaires, thereby reducing necessary time and increasing credibility.Social media usage is increasing every day, and thus a huge amount of textual and visual data is uploaded to the Internet daily [1].For such tasks, Twitter and Facebook are among the two most popular social media platforms, as they provide accessible application programming interfaces (APIs) that might be used in conjunction with external testing applications for corpus and data collection [2].The Indonesian language, also known as Bahasa Indonesia, is an Austronesian language that is the official language of Indonesia and one of the official languages of ASEAN; according to 2021 Statista data, it is currently the 11th most spoken language in the world [3].Twitter has a tremendous number of tweets from users, especially in Indonesia, and this is beneficial for personality profiling; it has been proven that data volume correlates positively with profiling accuracy [4].
Companies have increasingly prioritized the selection of new prospective workers based on their personalities, as they perceive particular attitudes and characters as indicative of good work performance.No companies want to risk potential losses from employees' misconduct and bad behavior [5].In Indonesia, industrial production measures the output of businesses in the industrial sector (including in manufacturing, mining and utilities).Several models are popularly used for personality assessment in industrial societies, including the Big Five (OCEAN) personality model, the Myers-Briggs Type Indicator (MBTI) and the Keirsey Temperament Sorter.This study, however, uses the dominance, influence, stability and conscientious (DISC) assessment framework, as it explicitly concentrates on behavioral preferences and thus is more applicable, explanatory and comprehensible than the other models mentioned above [6,7].First proposed by William Moulton Marston, the DISC model divides individuals' feelings and behaviors into four different dimensions: dominance, influence, stability and conscientious [8].Kim Yun-Yong et al. investigated office workers' DISC behavior style and its effect on organizational commitment, job satisfaction and job performance, finding that persons with "dominant" personalities tend to have better job performance than individuals characterized as "steady" [9].An investigation by Fariha Tabasum et al. found a significant positive relationship between the personality of a salesperson and consumers' perceptions.They also showed that, where a salesperson has an attractive personality, the sales of specific products and services increase [10].Likewise, Joy Eberechukwu Agodi, Emmanuel Onyedikachi Ahaiwe and Aniekan Eyo Awah found a strong positive relationship between sales performance and personality traits.Successful salespersons, they noted, were empathetic, assertive and ambitious [11].
This study is a continuation of preliminary research conducted by Utami et al. that used a pure natural language processing (NLP) approach for profile analysis [12].Adi et al. used three machine learning techniques-stochastic gradient descent (SGD), gradient boosting and stacking-to conduct personality recognition using Indonesian-language Twitter posts, finding that SGD and super learner are better than XGBoost in this case [13].Machine learning is a growing branch of artificial intelligence that learns from data patterns to make decisions without human intervention.Machines can be trained to cognize and assess individuals' personalities [19].Similarly, by employing support vector machine and linear regression with an LFM-1b dataset, user demographics might be identified based on music listening information [14].Another study also used datasets from Twitter, Facebook and YouTube to recognize individuals' personalities using the decision tree and support vector machine approaches [15].In such cases, as mentioned by Gu et al. [16], it is necessary to use ACI NLP to clean the dataset and extract relevant potential features (as assessed by psychological experts) as Indonesians tend to mix formal words, non-formal words, slang and abbreviations when writing social media posts, with formal words dominating their compositions [17].We aim to develop a better approach, combining NLP techniques and machine learning to form a more robust model.As Tadesse et al. mentioned in their research, using social network features for personality prediction can return better results than using only linguistic features [18].

Related studies
Empirical studies of job satisfaction, organizational commitment and job performance have been conducted for many years.For example, a survey conducted in D City between January 28 and May 30, 2010, which collected data from 315 office workers and analyzed it using SPSS/WIN 17.0, found personality has a significant influence on organizational commitment, job satisfaction and job performance [9].A study conducted by Fariha Tabasum et al., employed random sampling through SPSS software to conduct correlation and reliability analysis of questionnaire data collected from 172 respondents, finding that customer perception and sales are influenced by salespersons' personality traits-particularly their agreeableness.These findings were recommended to help managers develop deeper insights regarding their sales strategies, thereby enabling them to develop optimal approaches [10].
Joy Eberechukwu Agodi, Emmanuel Onyedikachi Ahaiwe and Aniekan Eyo Awah found that a strong and positive relationship exists between empathy, assertiveness and ambitiousness and sales performance.They thus underlined the need to improve the integrity, trust, capability and confidence of salespersons by setting specific targets rather than comparing individual salespersons or comparing individuals against the rest of the team [11].
A study by Utami et al. found that a pure NLP approach could be used to predict the personality traits of Twitter users, but needed to be enhanced using a better approach [12].In four experiments, of 139 users validated by psychological experts, the best accuracy rate (37.41%) was returned using a not stemmed-not weighted keyword vocabulary.Several different machine learning methods have been used by researchers for prediction.For example, one study employed advanced classifiers such as XGBoost and ensemble for prediction, finding that ensemble has high accuracy (82.59%) for real-time Twitter datasets [19].Significant improvements can be made by achieving a 1.0 ROC AUC score with SGD and super learner in research for personality recognition on Twitter in the Indonesian language [4].Tommy Tandera et al. experimented on traditional machine learning algorithms such as Naive Bayes, SVM, logistic regression, gradient boosting and LDA, using three features (LIWC, SPLICE and SNA).They proved that the SVM algorithm had the highest average accuracy in manually gathered datasets, but the results did not differ much from other algorithms [20].

Materials and methods
This study was conducted with the guidance of two psychological experts as well as previous successful works on related issues.Both hold masters' degrees in industrial psychology and are involved mostly in maintaining psychological standards while developing the instrument and defining the features.The most challenging part of the study was the textual feature preparation, which we handled by using NLP techniques (with an Indonesian-language NLP toolkit, which is relatively limited compared to English-language ones) to clean and preprocess the data.An overview of this study's methodologies is presented in Figure 1 and described in Section 3 below.

Supervised learning and resampling techniques
3.1 Ground truth data collection An Internet-based DISC instrument, developed based on industrial psychology experts' direction, was used to collect ground truth data.This remains the golden standard for data collection, as it is the closest to what experts usually do during interviews.Explicit requests for visitors' Twitter accounts and disclaimers on the academic usage of collected data were given.This quiz was based on the assessment instruments used by industrial psychology experts during live interviews in their regular business process, using a different format while still maintaining the instrument's assessment ability and credibility [8,21].The aforementioned DISC instrument is included as supplementary material and accessible using the provided link (see Appendix).

Feature assessment
During the data collection process, experts were also consulted regarding the potential features collected from the literature reviews and the state-of-the-art classification techniques.Eight features were identified as potentially useful for DISC profiling analysis [16].Some were approved by experts, who also added other potential features; ultimately, nine features were identified (see Table 1).

Overview of the methodologies ACI
A dump of the DISC website provided the records of 3,132 people who accessed and answered the quiz.Using the Twitter username entered by the participants, we gathered corresponding posts from the previous three months.Unfortunately, not all of the records contained a Twitter username, and not all of the recorded Twitter accounts were valid for data collection, either because they were non-existent or protected.We further filtered accounts to remove bots, non-personal accounts and non-Indonesian speaking accounts, thereby reducing noise.Ultimately, data were collected from 1,244 Twitter accounts, producing a corpus of 316.967 tweets which are all analyzed.

Data preprocessing and feature extraction
Because Twitter users often write in non-standard forms of Indonesian, it was inevitable that collected tweets were unready for immediate classification.As such, data cleaning and preprocessing were first necessary.For this, we employed InaNLP (the Indonesian language NLP toolkit) to formalize the non-standard form of language including handling abbreviations or shortened form of a word to its original form [22] and see its impact on model performance.After the data were cleaned, we extracted features from the dataset based on the information collected in raw textual form.

Distribution check and pair plot
The collected quiz results contained the following distribution of classes: 0 S 0 : 499, 0 C 0 : 359, 0 I 0 : 229, 0 D 0 : 157.Where classes are unevenly distributed, the model usually performs poorly when used to classify a more general condition.To overcome this issue, oversampling (duplicating samples from the minority class) or undersampling (deleting samples from the majority class) techniques are often used to adjust the class distribution of a data.The pair plot distribution of some initial features is shown in Figure 2.

Results and discussion
4.1 Model fitting using default parameter and initial data distribution Perfect balance rarely occurs in class distribution, and thus immediate usage in modelbuilding will not produce accurate and robust model.Based on the best results identified by previous studies, we handpicked several base classifier models to be used.As expected, initial data distribution performed poorly, as seen from the SVC (support vector classifier) example in Table 2 and Figure 3. Using default parameters and initial data distribution gave us benchmarks for improvement, and this was realized by using hyperparameter tuning and resampling on the data.Several resampling methods have been found to perform well with imbalanced datasets.Based on related studies, we handpicked several of the best performing methods: random under sampling, SMOTE and SMOTETomek [23][24][25][26][27].To automate the process of finding the best hyper-parameters tuning and resampling technique, a grid search approach was used.In the grid search approach, hyperparameter tuning is performed in order to determine the optimal values for a given model; here, it is implemented using GridSearchCV from scikit-learn [28].A comparison of hyperparameter tuning and data resampling performance, as shown from the GridSearchCV results, is provided in reducing each class to the same number.As no new information is introduced, any underlying issues with absolute rarity are not addressed [27].Meanwhile, the synthetic minority oversampling technique (SMOTE) oversamples the data by introducing new, non-replicated data to the minority classes (in this case, the I and D classes) from the five nearest neighbors [27].The SMOTE preprocessing algorithm is considered the de facto standard in the framework of learning from imbalanced data [29].
SMOTETomek is a good way to avoid the disadvantages of the SMOTE and Tomek Link techniques.The SMOTETomek technique is applied using the library from imbalanced_learn and includes a SMOTE function for oversampling as well as a Tomek Link function for undersampling [29].The algorithm flow of the SMOTETomek method is to combine SMOTE and Tomek Link to form a pipeline [25].

Best model evaluation
After identifying the best performing scenario using SVC and SMOTETomek resampling technique, we did a five-fold cross-validation using the best hyperparameter values.Figure 4 shows how the class distribution transformed after resampling.As seen in It is shown that the formalized dataset performed slightly better, with a 0.009 difference in the F1-score.Details are presented in Table 5.

Ethics and privacy
In this study, users' privacy and general ethics were serious concerns, especially during the process of collecting and analyzing data from social media accounts.Boyd and Crawford (2012) write ". ..it is problematic for researchers to justify their actions as ethical simply because the data are accessible. . .The process of evaluating the research ethics cannot be ignored simply because the data are seemingly public" [30].Although open discussions using Twitter differ from protected or private posts on Facebook, our main concern was to not cross the line between public and private posts.We always protected anonymity by replacing usernames with specific codes.Usernames are never published or used in the screening, classification and any manual assessment processes.As our concern deals with human resource development, we are aware that job applicants' social media usernames are commonly collected today and thus informed consent for profiling analysis was necessary for the process.It is also worth mentioning that we included disclaimers regarding the educational purposes of data collection and usage on the DISC test website.

Conclusion and future works
In this study, we tried to explore the possibility of conducting DISC analysis using social media posts written in Bahasa Indonesia, the mother tongue of Indonesia.Data collection tried to comply with the golden standard of profiling analysis conducted conventionally by experts using a DISC analysis instrument, which was used to collect textual and numerical information that is considered connected to a person's personality.A combination of NLP and statistical approach, complemented by a machine learning algorithm, has been proven effective in improving performance evaluation.We balanced the dataset using several resampling techniques, with SMOTETomek returning the best performance.Hyperparameter-tuned support vector classifier outperformed several supervised and ensemble learning algorithms, with an F1-score of 56.43%.This affirms that automatic personality classification from social media information in Bahasa Indonesia is feasible to be done and needs more in-depth further research.According to human resource experts, observation analysis of other social media platforms (such as Facebook, Instagram and LinkedIn) is also commonly practiced during the employee selection process.A combination

Figure 1 .
Figure 1.Overview of the methodologies

Table 2 .
Evaluation table of SVC default parameter and initial data ACI 4.2 Model hyper-parameter tuning on resampled data

Table 3 .
The random undersampling technique involves randomly selecting examples from the majority class (in this case, the S and C classes) to delete from the training dataset, thereby

Table 3 .
The

Table 4 .
Evaluation textual and visual information derived from those platforms might be able to provide more comprehensive and better classification results.Text mining of new prospective workers' resumes may also give comparable insight, and we aim to achieve such insight soon in future works. of