Can Multinomial Logistic Regression Predicts Research Group using Text Input?

While submitting proposals in SISINTA, students often confuse or falsely submit their proposals to the less relevant or incorrect research group. There are 13 research groups for the students to choose from. We proposed a text classification method to help students find the best research group based on the title and/or abstract. The stages in this study include data collection, preprocessing data, classification using Logistic Regression, and evaluation of the results. Three scenarios in research group classification are based on 1) title only, 2) abstract only, and 3) title and abstract. Based on the experiments, research group classification using title-only input is the best overall. This scenario gets the most optimal results with accuracy, precision, recall, and f1-score successively at 63.68%, 64.91%, 63.68%, and 63.46%. This result is sufficient to help students find the best research group based on the text titles. In addition, lecturers can comment more elaborately since the proposals are relevant to the research group’s scope.

classification, such as sentiment analysis [4]. Finally, we evaluate how well the LR predicts the research group based on the text input.

II. Method
In this research, several stages of the research methodology are described in Figure 1. We collected raw data from DEEI's SISINTA database at the data collection stage by dumping the SQL data into a Microsoft Excel file. No personal information such as students, supervisors, grades, or logs was included during data exporting. The main content we retrieved was text information relevant to these and the final projects. Data obtained from 16 April 2016 to 4 October 2022 contained 2164 samples, and the SISINTA administrator confirmed that these data are accurate. Each sample has independent variables: the title, abstract, and research group class. Thirteen research groups and their class distributions are shown in Table 1. From this table, we can see an imbalanced distribution of research groups. A challenge to be tackled by Resampling Technique in our proposed method. Text preprocessing is carried out to ensure text data is 'clean' and the algorithm can learn from it [5]. Text preprocessing involves stages to make text information more structured [6], which include text cleaning, removing missing values, removing duplicate rows, tokenization, stopword removal, and stemming.
Text cleaning consists of four steps. First, tag removal aims to remove HTML tags contained in the document [7]. Many of the text data contains HTML tags. This often happens when students copy-paste text from the document processor to the SISINTA input form. We use regular expression filtering (a.k.a regex) to remove HTML tags and keep informative text. Say, inputText = "<h1> Hello </h1>". By applying regex = re.compile(r'<[^>]+>'), the function regex.sub('', inputText) will ouput → Hello. Second, case folding aims to convert capital letters to lowercase. It is helpful to prevent the computer from interpreting the same word with different meanings [8]. For instance, Python case fold ("Case") will output the case. The third stage, trim text, aims to remove white space at the beginning and end of the text [9]. In Python, it is achieved by running the strip() function to remove spaces from both ends. The last stage removes punctuation, special characters, double white space, and the number [10]. We apply the regex for this purpose by adding more memorable characters to be removed.
The second stage of text preprocessing is to remove missing values. This step is carried out to handle missing data by removing columns or rows whose data is not available or NaN (Not a Number). This deletion's purpose is to reduce data bias [11]. This study's third stage of text preprocessing is to remove duplicates or redundant samples [12]. This will minimize the overfitting effect due to duplicates [13].
We use the Natural Language Toolkit (NLTK) for this step specifically the nltk.tokenize package. The goal is to break down sentences into words or tokens [14]. In this study, tokenization applies to the title and abstract into word fragments to identify words and the separators. Hence, tokenization helps extract meaning from text.
This study's fifth stage of text preprocessing is stopword removal or text filtering. We use nltk.corpus → stopwords, to filter out stop words such as 'diperlukan', 'hendaknya', and 'tapi'.
The final text preprocessing stage stems [15]. Stemming is used to cut prefixes, suffixes, inserts, combinations of prefixes and endings, and remove affixes [16]. Besides that, it can also eliminate word inflection to its basic form. The steaming process can be done using a particular Indonesian language streamer library, Sastrawi. This process aims to make the computer interpret a word constructed from the same root word with a different meaning [17]. For instance, when stemming is applied, the word "kecepatan" will produce "cepat".
Once the text data is clean and ready, term weighting converts data into a numeric form [18]. We apply the Term Frequency-Inverse Document Frequency (TF-IDF) method in this study. TF-IDF assigns a weight to each word that frequently appears to quantitatively measure how strong the relationship between the word and the document is [19]. When a word appears more frequently in a document, its weight increases proportionally. In contrast, the weight decreases if the word appears more regularly in many documents [20]. We apply the sci-kit-learn library, sklearn.feature_extraction.text.TfidfVectorizer for this purpose.
Until the resampling stage, the dataset was distributed unevenly between research groups. Although there are significant sample drops within each research group, the distribution is not balanced, as seen in Figure 2. The imbalanced dataset can cause bias in the data, where partial data tends to make the classifier performs best only when predicting dominant classes [21]. Therefore, we applied the resampling method, the Synthetic Minority Oversampling Technique (SMOTE). SMOTE iteratively generates artificial samples based on the original neighboring samples. This phase stops until all classes have the same number of samples, 194 samples each. This study used Multinomial Logistic Regression (MLR) due to 13 research group classes. Before modeling, we separated the dataset into 70% training and 30% test sets. The training set was then used to train and optimize the MLR via Grid Search Cross Validation (GSCV) method. This tuning method aims to find a combination of parameters from the model that produces the most optimal and effective predictions [22]. The GSCV method heuristically constructs and evaluates the MLR model using all parameter value combinations in Table 2 in a cross-validated environment (we use 10-fold). The GSCV method produces insights into using different parameter combinations regarding classification performances. Then, we refitted the MLR using the parameters that produce the highest classification performance.
Since there are two types of input relevant to the research group: title and abstract, we ran three scenarios of MLR prediction based on: 1) a title, 2) an abstract, and 3) a combination of a title and abstract. The goal is to identify which classifier performs best. Hence, the GSCV method is applied within each scenario producing 12 model candidates. In total, there are 36 candidates for the research group prediction model.
In the evaluation stage, the best model from each scenario was tested using 30% test data. The metrics used were accuracy, precision, recall, and f1-score. The goal was to test how effective the MLR was based on the classification performance or correctness level [23]. From there, we can choose which MLR is best applied for SISINTA.

III. Results and Discussion
The retrieved 2164 rows of data were raw text structured into columns: title, abstract, and research group. Figure 3 shows the rawness of the dataset. The process of tag removal, case folding, small text, and removal of punctuation marks, special characters, double spaces, and numbers is carried out at the next cleaning stage. The processing results of this stage can be seen in Figure 4.  The next step is to remove the missing values. There are four rows of missing values in the title column and 896 rows of missing values in the abstract column, where the number of missing values in the dataset can be seen in Figure 5. Furthermore, we identified one data duplication from the title column but none from the abstract. As a result of text preprocessing, the distribution of the dataset falls short, but there are imbalanced distributions of research group classes, see Figure 2.
The tokenization stage is carried out to separate text into tokens or words [24]. Figure 6 and Figure  7 show examples of the tokenization result in the title and abstract columns.  The stopwords removal stage is carried out to remove words or tokens that appear frequently and have no critical meaning in the text [25]. The results of the stopwords removal process in the title and abstract columns can be seen in Figure 8 and Figure 9.  The stemming stage is carried out to remove all affixes in words, such as suffixes, inserts, prefixes, and combinations between prefixes and suffixes [26]. The results of the steaming process in the title and abstract columns can be seen in Figure 10 and Figure 11.   We applied the default configuration of the SMOTE in generating synthetic samples (n_neighbors = 5). There are 194 data on each RESEARCH GROUP after the resampling process using SMOTE. In total, there are 2522 samples ready for model training.
In title scenario, using the Grid Search Cross-Validation (GSCV) method, the best parameter configurations for the MLR were C=0.1 and using a 'none' penalty. Fig. 13 depicts the comparison between the candidates' performances (in dots) that applies various regularization parameters (xaxis) and penalty (colored line). This graph shows that the MLR performs best when the C value is high, ignoring the penalty type. The result of MLR in the green line is suspect of overfitting because the other MLRs (orange and blue lines) underperformed when the C is lowest. This means that regularization is essential for the MLR to perform generically. From Figure 13, the L2-type regularization (orange line) should be the best since it performs better even using a low C value compared to the L1-type. The higher the C value, the MLR using L2-type is always on top of the MLR with L1-type. Therefore, the MLR was refitted in this scenario using the Penalty=L2 with C=5 as the most optimal one. In the abstract scenario, the results of the most optimal combination of parameters can be seen in Figure 14. Our analysis in this second scenario is similar to the first one. The difference appears only slightly in the resulting scores. From this graph, the MLR using abstract as input is refitted with Penalty=L2 and C=5. GSCV results for the third scenario can be seen in Figure 15. Our analysis in this third scenario is similar to the former two. The difference appears only slightly in the resulting scores. From this graph, the MLR using abstract as input is refitted with Penalty=L2 and C=5. From the three scenarios using GSCV, there were no significant differences between the effect of input used. Even the performances were relatively identical. However, we tested each using the test data to delve deeper into how the three MLR model performs. We measured each scenario's performance metrics; the results can be seen in Table 3.
The evaluation results show that the title scenario is the best and optimal scenario. Although this result is insignificant compared to the other two scenarios, it is more efficient since the input size for MLR is way smaller if using the title only. As such is a way to reduce the curse of dimensionality in research group classification. Hence, a minor computation power is available. In addition, there will be a slight chance of repeated words in the titles (except stopwords) compared to the abstract. Hence, we argue that using the title is more concise for the classification's performance.
We also pointed out the overall metrics that are below 70%. We identified the causes: typographical error (TYPO) within the title or abstract, coupled words, and the lack of a validation process to check for these errors. Examples of errors contained in the dataset can be seen in Figure  16. The words highlighted were only a few in a brief observation. However, these words are not core or root words that highly correlate with the research group. The classification model will lose some accuracy if this word is mistyped while contributing to a particular research group. The solution is applying a policy in the SISINTA that any typo entered in the title or abstract will dismiss the students to get comments from the research group. Either manual observation or automatic one is feasible. Alternatively, by applying additional text preprocessing to identify these typos and decide whether to correct or remove them. Fig16. Writing errors on the dataset In addition, great topics overlap between research group classes. For instance: the research group "Game Technology and Machine Learning" and "Knowledge Engineering and Data Science". Both research groups contain research with the keywords "machine learning", "data mining", "classification", etc. Too many terms were shared between these two examples of research groups. Only a few keywords disparate the two research groups, for instance, "game" and "text". To overcome the problem of shared words by looking at the linked words, we can use n-grams that decompose a text into n-character chunks so that linked words can be parsed. However, using the ngram feature significantly enlarges the dimension. Hence, more complex algorithms like Deep Learning should fit the task.
Finally, our proposed method is applicable in different departments as long as the digital storage of the student's research is organized in the research group (web-based information system and the database). Based on our findings, the future implementation may only need to structure the data into the title column and research group. Then, additional text preprocessing to identify and replace typos in the content is also essential to ensure the dataset's quality for the learning algorithm. Other learning algorithms are available depending on the target classes and the size of the dataset provided. Parameter tuning should be performed using GSCV with more combinations since the dataset's target case differs from our research. The remaining stages of research group recommendation are repeatable as is.
When SISINTA implements a recommendation of a research group based on user input, the initial procedure of the thesis or final project proposal can be done in seconds. This can also help lecturers in the research group to provide more elaborated and comprehensive comments within their scope of knowledge regarding the proposals. If there are revisions required for the proposal are relevant and constructive to make their research go in the right direction. Overall, this automatic instruction in SISINTA can make it an intelligent information system for educational purposes. Not only applicable in DEEI, but this approach should also be applicable in other departments as long as there are good platforms and data.

IV. Conclusion
This research showed that we successfully applied Multinomial Logistic Regression (MLR) Algorithm to predict the research group based on text input, either the title or thesis abstract. The stages we followed in the text mining technique were straightforward, and MLR performed adequately well to classify 13 research groups. The best scenario in this study was the MLR with the input variable from the title. Using title data as a model training scenario is considered adequate, optimal, and efficient. This is because there will be rare to write repeated words within a thesis title, except stopwords. With performances just above 63% in overall metrics, we argue that this MLR model with title text input is optimal due to its small dimensionality. However, the relatively low performances below the 70% threshold were limited because research groups shared similar keywords and typos inside the dataset. These typos can become noise or must be extracted from the core word. Therefore, additional text preprocessing should consider these typos.

Declarations
Author contribution