Benchmarking of Multi-Class Algorithms for Classifying Documents Related to Stunting

: Stunting is a condition in which children experience impaired growth and development, caused by malnutrition, repeated infections, and inadequate psychosocial stimulation. It often remains unrecognized due to a lack of awareness in the community. Therefore, the ﬁrst step towards developing a solution for stunting is to understand the level of awareness and the sentiment of the community towards issues related to stunting. As online media are widely used in everyday life, they o ﬀ er signiﬁcant potential towards providing such an understanding. However, exploiting this potential requires extensive identiﬁcation of documents containing discussions of stunting among lay people, to accurately gauge the awareness and sentiments of the community towards stunting. This task is a multi-class classiﬁcation problem. We perform a benchmark study, using data from the Indonesian context, to comparatively evaluate the performances of four algorithms, i.e., logistic regression, naive Bayes, random forest, and support vector machine (SVM), and three extracted features, namely term occurrence, term presence, and term frequency-inverse document frequency (TF-IDF). The SVM method coupled with TF-IDF produced the highest accuracy value of 0.98, with a standard deviation of 0.03, due to its capability to automatically model the interaction between features.


Introduction
Stunting is a condition that affects children and is characterized by impaired growth and development, leading to their height being much lower than that of a healthy child of their age. It is caused by malnutrition, repeated infections, and inadequate psychosocial stimulation. Formally, a child is identified as being affected by stunting when their height is more than two standard deviations below the World Health Organization (WHO) Child Growth Standards median [1]. Stunting is estimated to affect a significant number of children worldwide, and its effects on affected individuals and communities are serious and long-term.
According to information obtained from WHO data, Indonesia ranks fifth highest among all countries in the number of people affected by stunting. Consequently, President Joko Widodo (Jokowi) of the Republic of Indonesia mentioned that reducing the incidence of stunting in the country is a key concern, and instructed the ministers and his staff at a press conference to develop an integrated action plan to reduce stunting in Indonesia [2]. Nevertheless, to reduce stunting rates in Indonesia, the development of an integrated action plan must be supported with concrete steps to understand and address the root causes of stunting in the country. A significant factor that enables the prevalence of stunting is the lack of awareness around issues related to stunting, its causes, and its dangers. Commonly, stunting often advances unrecognized, particularly in communities where a short stature is considered normal [3]. Therefore, there is a need to understand the level of awareness and appreciation from the public regarding the causes and dangers of stunting. Such an understanding can enable stunting campaigns to be better targeted and more effectively reduce stunting rates in Indonesia.
In this study, we contribute to the development of such an understanding of community awareness and sentiment towards stunting issues, with a focus on the Indonesian context, by exploring the mining of public discussions and articles regarding stunting in Indonesia. As social media and other online media, including online news and crowd-sourced content, are used widely in everyday life, they offer significant potential for mining information about the level of awareness and the sentiment of the society towards stunting in Indonesia.
However, data on social and other online media can pertain to various subjects, and it can be challenging to retrieve data that is specifically related to stunting. Specifically, a document, e.g., a post or an article, might be unrelated to stunting, might contain information about stunting, or might contain a discussion of stunting among lay people. In order to correctly gauge the level of awareness and the sentiment of the community towards stunting, we need to extensively identify documents containing discussions related to stunting among lay people. Since there are three classes of documents as mentioned above, the task of automatically detecting a document containing a discussion of stunting among lay people is a multi-class classification task.
In this study, we perform a comparative evaluation of the performances of several algorithms, to provide a benchmark of algorithms for multi-class classification, and specifically for Indonesian language documents related to stunting issues. Such a benchmark can enable researchers and policy makers to choose the best-suited approach for extensively retrieving relevant discussions to study community awareness and sentiment towards stunting.
The rest of this paper is organized as follows. Section 2 describes prior studies on multi-class classification tasks for textual domain data, while the methodology adopted in our study is described in detail in Section 3. Section 4 discusses the experimental setup, and the experimental results and analysis. Finally, Section 5 concludes the paper.

Related Works
A multi-class classification task aims to assign a class label for each input example. It is similar to a binary classification task, but the assigned classes are drawn from K classes, where the value of K is higher than two. Formally, it can be defined as follows [4]: Given a training data set of the form (x i , y i ), where x i ∈ n is the i-th example, and y i ∈ {1, 2, . . . , K) is the class label for the respective i-th example, a multi-class classification model, denoted by H, is defined as in which H represents a model to predict a single label for a new unseen example. Obtaining a good prediction for multi-class classification is challenging since the classification model has to distinguish between a large number of classes [5]. Therefore, many studies have been conducted on multi-class classification, and they commonly adopt one of three approaches: (1) apply some kind of transformation over the training data, (2) design a better learning algorithm by modifying or adjusting the hyper-parameters of existing algorithms, and (3) combination of both approaches.
A strategy that is widely adopted in text classification is to use different feature extraction techniques, such as term occurrence [6][7][8][9], term presence [8], and term frequency-inverse document frequency (TF-IDF) [8][9][10][11][12] The examples of extracted features as shown in Equations (2)-(4) consist of 8 words, i.e., (w 1 , w 2 , . . . , w 8 ). Each element in the term occurrence feature shows how many words occur in a document. The example in Equation (2) shows that the document consists of 2 words of w 1 , 3 words of w 3 , 4 words of w 4 , 1 word of w 6 , and 5 words of w 7 , whereas each element in the term presence feature shows whether the related word appears in a document. For example, Equation (3) means that the document consists of the words w 1 , w 3 , w 4 , w 6 , and w 7 . Furthermore, in the TF-IDF feature, each value in the extracted feature shows the weight of how important the related word is in a document. For example, in Equation (4), the word importance weight of w 7 in the related document is 0.07.
Moreover, several studies on multi-class classification have been conducted in various fields, including those related to health. Research related to multi-class classification is mostly performed using classical machine learning methods, such as that of Yestigsen and Pratt [13], which uses the support vector machine (SVM) method to perform multi-class classification using a dataset in the form of a MEDLINE document from the OHSUMED dataset. In their study, the best classification results were obtained using a hybrid approach, with values of precision = 0.78, recall = 0.46, and F-measure = 0.60 [13].
The random forest method has also been used in some studies. For example, Bouazizi and Ohtsuki [14] used a Twitter dataset to obtain a classification accuracy of 60.2% for 7 different classes and 81.3% for binary classification. Further, in studies conducted by Ximale, Hinde, and Stone on multi-class classification of webpages on training courses, using the naive Bayes method, the highest F-measure obtained was 97% [15]. Another method that has been adopted for multi-class classification is logistic regression. For example, a study by Kurt et al. [16] used logistic regression to predict the presence of coronary artery disease (CAD) based on several factors, i.e., age, sex, family history of CAD, smoking status, diabetes mellitus, systemic hypertension, hypercholesterolemia, and body mass index (BMI). The outcomes of these studies suggest that the performances of the learning algorithms vary, and depend on the combination of various feature extraction techniques and learning algorithms. To identify the best performing algorithm requires a trial and error approach combining several feature extraction techniques and learning algorithms.
Therefore, in this study we perform an experimental benchmarking of the performance of four different algorithms that are widely implemented in multi-class classification problems, and specifically for the textual domain, based on classification of data with respect to stunting issues in Indonesia. In addition to the four algorithms, we also evaluate three types of textual features to obtain three different representation of the training data.

Methodology
This research adopts a multi-stage process, including data collection and selection, pre-processing, feature extraction, training, testing, and performance evaluation. We incorporate three types of extracted features-term occurrence (TO), term presence (TP), and term frequency-inverse document frequency (TF-IDF)-and four learning algorithms-logistic regression, naive Bayes, random forest, and support vector machine (SVM)-in the training process. Figure 1 provides an overview of the research methodology.

Data Collection and Selection
The dataset is collected by exploring news portal websites, government websites, health websites, and retrieving Twitter data via Twitter API (Application Programming Interface). The data collected is in Indonesian and English. Specifically, data related to information about stunting is collected from news portal websites, government websites, and health websites. Data related to discussions among lay people is collected by crawling twitter posts (also called tweets) from the Twitter account @cegahstunting for data in Indonesian, and @1000days for data in English, while data unrelated to stunting is collected by crawling tweets from health, social, and political sources. Those three sources were selected since those sources often contain words or terms that are also mentioned in discussions or information about stunting. Therefore, it is expected that the same words/terms within different contexts will be differentiated into different classes. Details of the data obtained are provided in Table 1. Subsequently, we perform a round of random selection over the collected data to obtain the same number of documents for each class (stunting information, lay discussion on stunting, and unrelated to stunting) in our dataset. Hence, the data selection stage aims to create a balanced dataset.

Data Collection and Selection
The dataset is collected by exploring news portal websites, government websites, health websites, and retrieving Twitter data via Twitter API (Application Programming Interface). The data collected is in Indonesian and English. Specifically, data related to information about stunting is collected from news portal websites, government websites, and health websites. Data related to discussions among lay people is collected by crawling twitter posts (also called tweets) from the Twitter account @cegahstunting for data in Indonesian, and @1000days for data in English, while data unrelated to stunting is collected by crawling tweets from health, social, and political sources. Those three sources were selected since those sources often contain words or terms that are also mentioned in discussions or information about stunting. Therefore, it is expected that the same words/terms within different contexts will be differentiated into different classes. Details of the data obtained are provided in Table 1. Subsequently, we perform a round of random selection over the collected data to obtain the same number of documents for each class (stunting information, lay discussion on stunting, and unrelated to stunting) in our dataset. Hence, the data selection stage aims to create a balanced dataset.

Pre-Processing
The pre-processing stage consists of three sequential processes, as illustrated in Figure 2.

Pre-processing
The pre-processing stage consists of three sequential processes, as illustrated in Figure 2.

Tokenization
Tokenization is the process of decomposing a document into smaller units such as words, phrases, symbols, and other meaningful elements to map the words in the document [17]. These units are commonly referred to as tokens.

Stop words removal
The stop words removal process filters out common words in documents, usually in the form of articles, prepositions, pronouns, or other words that do not add any significant meaning to the documents [10].

1.
Tokenization Tokenization is the process of decomposing a document into smaller units such as words, phrases, symbols, and other meaningful elements to map the words in the document [17]. These units are commonly referred to as tokens.

2.
Stop words removal The stop words removal process filters out common words in documents, usually in the form of articles, prepositions, pronouns, or other words that do not add any significant meaning to the documents [10].

3.
Stemming Stemming provides a mapping of different morphological variants of words into their base/common word (stem) [18]. Since we incorporate documents in both Indonesian and English, we adopt different stemming algorithms for the two languages. We used Porter stemmer for English documents, since it is efficient and simple [19], and Sastrawi stemmer for Indonesian documents. Sastrawi Stemmer is a modified stemmer algorithm from previously developed stemmers [20][21][22].

Feature Extraction
Three features are incorporated in this study, i.e., term occurrence (TO), term presence (TP), and term frequency-inverse document frequency (TF-IDF). TO, also known as raw term frequency, is a vocabulary vector, each of whose elements represents the number of occurrences of the corresponding word in a document. Many studies also refer to the TO feature as Bag-of-Words (BoW). TP is a simple form of TO in which each element of the vector is either 0 or 1, depending on whether the corresponding word occurs in the document or not.
TF-IDF is a numerical statistic that is intended to reflect the importance of a word to a document in a collection or corpus [23] so that the relevance of a word to all documents in the corpus can be known. The weight of a term can be calculated as follows [24]: where t f i,j : term frequency weight of the index term k i in the document d j f i,j : frequency of occurrence of index term k i in the document d j id f j : inverse document frequency of term k i N: total number of documents in the corpus n i : number of documents containing the term k i w i,j : tf-idf weighting scheme for term k i in the document d j

Training Process
In the training process, we implemented four different learning algorithms as mentioned in Figure 1, i.e., logistic regression, naive Bayes, random forest, and SVM. Figure 3 shows the illustration of both training dan testing processes and the following sub-sections explain each algorithm in detail.
Stemming provides a mapping of different morphological variants of words into their base/common word (stem) [18]. Since we incorporate documents in both Indonesian and English, we adopt different stemming algorithms for the two languages. We used Porter stemmer for English documents, since it is efficient and simple [19], and Sastrawi stemmer for Indonesian documents. Sastrawi Stemmer is a modified stemmer algorithm from previously developed stemmers [20][21][22].

Feature Extraction
Three features are incorporated in this study, i.e., term occurrence (TO), term presence (TP), and term frequency-inverse document frequency (TF-IDF). TO, also known as raw term frequency, is a vocabulary vector, each of whose elements represents the number of occurrences of the corresponding word in a document. Many studies also refer to the TO feature as Bag-of-Words (BoW). TP is a simple form of TO in which each element of the vector is either 0 or 1, depending on whether the corresponding word occurs in the document or not.
TF-IDF is a numerical statistic that is intended to reflect the importance of a word to a document in a collection or corpus [23] so that the relevance of a word to all documents in the corpus can be known. The weight of a term can be calculated as follows [24]: = log where , : term frequency weight of the index term in the document , : frequency of occurrence of index term in the document : inverse document frequency of term : total number of documents in the corpus : number of documents containing the term , : tf-idf weighting scheme for term in the document

Training Process
In the training process, we implemented four different learning algorithms as mentioned in Figure 1, i.e., logistic regression, naive Bayes, random forest, and SVM. Figure 3 shows the illustration of both training dan testing processes and the following sub-sections explain each algorithm in detail.

Training Process Using Logistic Regression
Logistic regression is a machine learning method that adopts sigmoid curves. It is suitable for binary classification, but can be adapted for multiclass classification by using the one vs. rest scheme. The following is the sigmoid equation used in logistic regression: where f (x): sigmoid function of x e: epsilon (2.7182) x: input value

Training Process Using Naïve Bayes
The naive Bayes model determines the class of data by summing up the frequency of combinations of given dataset values to calculate a set of probabilities [25]. In classifying a sample of data, the naïve Bayes classifier assumes that the presence or absence of a particular feature of a class is independent of the presence or absence of other features [26]. The following is the equation of the naive Bayes classifier: where → x : a document with unknown class, There are two types of Naïve Bayes which is implemented in this study, i.e., multinomial naïve Bayes and Gaussian naïve Bayes. The Gaussian naïve Bayes was chosen for the TF-IDF feature because the values from TF-IDF are included in the continuous data category and there is a typical assumption that the continuous values associated with each class are distributed according to a normal (or Gaussian) distribution. Meanwhile, the selection of multinomial naive Bayes for the term occurrence and term presence features is because the vectors of these two features represent the frequency and appearance of words that can be associated as a multinomial distribution. The prior probability can be estimated by dividing number of documents belonging to class c by the total number of documents. As likelihood probability for multinomial naïve Bayes can be calculated as follows: where w i : the i-th word in a document i: index of word, i = 1, 2, . . . , V V: size of vocabulary P(w i |c): the probability of the occurrence of the i-th word if the class is known to be c count(w i , c): the number of occurrences of term w i in training documents from class c, including multiple occurrences of a term in a document. count(c): the number of terms belonging to class c Furthermore, the difference between Gaussian naive Bayes and multinomial naive Bayes lies in the calculation of the likelihood value as follows: where µ c,i : the mean of the feature values of term w i associated with class c σ 2 c,i : the variance of the feature values of term w i associated with class c

Training Process Using Random Forest
The random forest method uses a large number of decision trees that operate as ensembles. Each tree casts a vote for the predicted class based on a random process, and the class with the most votes is the prediction of the model [27]. While simple in concept, the random forest method can be effective because the individual trees do not strongly correlate and therefore cancel out each other's errors, thereby potentially outperforming other individual models. To produce desirable outcomes, the random forest method requires two conditions, (1) there must be an actual signal in the feature so that the model built uses better features than random guessing, and (2) the predictions of the trees must have a low correlation with each other.

Training Process Using SVM
SVM is a method to compute the best hyperplane that separates two distinct classes [28]. SVM was originally designed for linear classification, but can now be used for non-linear classification problems by applying the kernel. In this study, we implemented non-linear SVM with RBF (Radial Basis Function) as a kernel function since it can be generally used for all types of data. In addition, we also employed some additional kernel functions, i.e., polynomial degree 3, linear, and sigmoid. We choose polynomial and linear kernel function since these kernels are less time consuming than the RBF. Thus, it is necessary to make a comparison to find out whether it can produce the same accuracy value or close to RBF, whereas the sigmoid kernel function comes from the Neural Networks field. Performing sigmoid kernel function in SVM is equivalent to a two-layer perceptron in which it is performed well in practice.

Testing Process
We implemented 10-fold cross validation to divide the dataset into two subsets of training and testing data. Therefore, the proportion of data splitting between training data and testing data was 90% of training data and 10% of testing data. Since we had 900 data samples, each fold consisted of 90 data samples, with 30 data samples from the stunting information class, 30 data samples from the lay discussions on stunting class, and 30 data samples from the unrelated to stunting class. As mentioned in Table 1, we employed a balanced dataset since the classification model is built based on imbalanced datasets that have minimal sensitivity to minority classes and can lead to poor classifier performance. We ran both the training and testing processes 10 times (for 10-fold cross validation), where each iteration was carried out by running the training process for 9 folds and the testing process for the remaining 1 fold. The iteration process was carried out so that each fold was tested.

Performance Evaluation
The performance of each learning algorithm was evaluated using the accuracy value, since we used a balanced dataset, i.e., 900 data samples across the three classes with 300 data samples for each class. In addition, there was no difference in the level of importance between classes so that evaluation metrics such as precision, recall, or F1-Measure were not required. As we had 3 classes in our multi-class classification task, i.e., stunting information class (A), lay discussion on stunting class (B), and unrelated to stunting class (C), we can compute the confusion matrix as shown in Figure 3.
The green cells in the figure represent true positives for each class, which are correctly classified samples. For example, TPA denotes the number of true positives for class A, i.e., the number of testing data samples that actually belong to class A and are also predicted as class A by the algorithm. By contrast, the red cells represent misclassified samples. For example, A P B represents the number of samples with true class A, but predicted as B by the algorithm. The sum of A P B and A P C is the number of false negatives for class A. The sum of correctly classified samples and misclassified samples is equal to the size of the dataset. Finally, the accuracy value is the ratio between the number of correctly classified samples (all green blocks) and the total size of the dataset (all blocks). Since we implemented 10-fold cross validation, as previously explained, we computed the mean and standard deviation values of the accuracy obtained in experiment of each fold.

Results and Analysis
In this section we provide details of the experimental setup, as well as the experimental results and analysis.

Experimental Setup
As explained previously, the research data consisted of 900 data samples for three classes. The setup was implemented in the Python programming language and run in Google Collaboratory. We performed parameter tuning for each algorithm as follows: 1.

2.
Random forest: maximum number of features (auto, sqrt and 0.2) and number of trees to build before taking the maximum voting or averages of predictions (10, 50 and 100). 3.
SVM: kernel function (RBF, Linear, Polynomial degree 3 and Sigmoid) In this study, parameters for the naive Bayes were not tuned but used 2 different models as described in Section 3.4.2 and the justification for selecting the kernel functions that are tuned have been described in Section 3.4.4. Furthermore, we set the regularization parameter in the logistic regression method to improve generalization performance, namely the performance when classifying new data. We employed three options for maximum number of features random forest as offered by Python in the Scikit-learn library, whereas the number of trees to build before taking the maximum voting or averages of predictions was determined by choosing any value but still paying attention to the processor capability since a higher number of trees gives better performance but makes the code slower.

Experimental Results and Analysis
We report the accuracy values for each parameter value and each feature. The all result tables show the mean accuracy values from ten folds and the standard deviations of the accuracy from the same ten folds. For example, the value in the first row of the first column in Table 1, namely 0.60 ± 0.27, means that the experimental results using the Logistic Regression method with a regularization parameter of 0.01 and the employed feature of term occurrence has an average accuracy value of 60% with a standard deviation of 27%.
According to Table 2, the increase in regularization value is directly proportional to the increase in the accuracy value. Therefore, the best accuracy value for the three features is obtained at a regularization value of 100 (marked in yellow). The value of the regularization parameter is equal to 1 per λ, in which the λ value is used to control the complexity of the built model. If the lambda value is large, it will cause the model to be built very simply and this causes underfit. Subsequently, since the parameter regularization value is inversely proportional to lambda value, then increasing the regularization value is directly proportional to the increase of accuracy value. According to Table 3, the best accuracy for all features was obtained for the following parameters, i.e., (i) maximum number of features is 0.2, and (ii) number of trees to build is 50. When we use the maximum number of features is 0.2, it means that we are increasing the employed features compared to the sqrt. This increases the accuracy value since at each node we have a higher number of options to be considered. While the second parameter-that is, the number of trees to build before taking the maximum voting or averages of predictions-can be determined based on trial and error. In this study, its best value is 50. The employed parameter tuning for SVM is kernel functions. As mentioned in Table 4, we observed four different kernel functions. The best kernel function for TF-IDF is RBF, while the best kernel function for TO and TP is linear function. There is no parameter tuning for Naive Bayes since the determination of the likelihood distribution must be in accordance with the employed features. Based on the results mentioned in Tables 2-4, the summary of best accuracy for each feature is provided in Table 5. These values were computed for all pairs of learning algorithms and employed features. According to Table 2, the best accuracy value is 0.98, with a standard deviation of 0.03, and it is obtained with SVM as the learning algorithm and TF-IDF as the employed feature. TF-IDF conceptually assigns a lower weight to a term that occurs in almost every document. Therefore, TF-IDF represents a term's relative importance within a document. As this study aims to classify documents regarding stunting information (class A), documents on lay discussions about stunting (class B), and documents that are not related to stunting (class C), the distribution of words with low TF-IDF in class A documents, for example, is very different from the distribution of the same words for documents that belong to class B or C. Hence, the TF-IDF feature can be considered a representative feature to be paired with SVM and the RBF kernel, which is capable of handling nonlinear relationships between class labels and attributes [29]. SVM outperforms other algorithms such as naive Bayes, logistic regression, and random forest because SVM generalizes well on larger documents and datasets, and can automatically model the interaction between features [30].
The implementation of naïve Bayes for the TO and TP features uses the multinomial variant of naïve Bayes. The results indicate that it provides better accuracy than the implementation for TF-IDF with Gaussian naïve Bayes. This is perhaps because, in classification tasks, whether a word occurs or not seems to be more important than the frequency of occurrence of the word, which renders the multinomial distribution better suited for the task than the Gaussian distribution.

Conclusions
In this study, we performed a comparative evaluation of the performances of several algorithms in order to provide a benchmark for algorithms for multi-class classification, and specifically for documents related to stunting issues in the Indonesian context. The results indicate that the SVM method coupled with TF-IDF was able to fairly extensively identify documents containing information and discussions among lay people related to stunting, producing the highest accuracy value of 0.98, with a standard deviation of 0.03, due to its capability to automatically model the interaction between features. The results can enable researchers and policy makers to choose among various approaches to extensively retrieve relevant discussions to study community awareness and sentiment towards stunting.