Improving the Accuracy of Text Classiﬁcation using Stemming Method, A Case of Informal Indonesian Conversation

As social beings, humans always interact with one another using either verbal or non-verbal language. Language is an arbitrary sound-symbol system, which is used by members of a community to cooperate, interact, and identify themselves. Indonesian language is classiﬁed into two categories, namely formal and non-formal. The former meets the grammatical standard as prescribed by linguistic rules of the language, while the latter tends to deviate it. In daily communication, however, non-formal language is more intensively used because they are more practical and easier to understand. With this tendency, non-formal language causes problems in linguistic computation because most linguistic computations use formal standard languages that already have standardized rules. This research aims to develop a dynamic Indonesian closed corpus related to airline ticket reservation, namely ”Incorbiz”. The ”Incorbiz” will be used as stemming tool for formal and non-formal Indonesian. Text processing, text normalization, and auto-update data were proposed in this research. This research also compared two stemming techniques i.e. ”Sastrawi” and ”Incorbiz” to process the 30-sample dataset. The algorithm used to process the classiﬁcation is Support Vector Machine (SVM). The data used to develop the ”Incorbiz” were taken from conversations between customer service staﬀ and consumers in airline ticket reservations. The result showed that ”Incorbiz” had higher accuracy than ”Sastrawi” on 0.89 and 0.67, respectively.


Introduction
As social beings, humans always interact with one another. The interactions were carried out in verbal or non-verbal language. Language is arbitrary sound symbol system, which is used by members of a community to cooperate, interact, and identify themselves [1]. This definition implies that language has a special character which is the identity of a country and domain of a dialog topic. In verbal communication, people use sentences in the forms of comprise of words or series of words to express a complete meaning. Indonesian language is classified into two categories namely formal and non-formal in the method of use. Indonesian formal is used in formality situation, while Indonesian non-formal is widely used in casual situation like in social media conversations [2]. In formal language, people use "standardized" language as prescribed by linguistic rules of the language. However, for communication especially in casual conversations, social media, or non-formal discussions, people often tend to communicate using non-formal language [3].
In business conversation, however, the language that is mostly used is non-formal language. "Bahasa slang" is a term used to refer to non-formal language [4]. The simplicity of non-formal language makes people tend to use it in their communication. Communication uses non-formal language is brief, but it can be easily understood by every person. As mentioned above, non-formal language tends to deviate grammatical rules. The example of deviations is found in suffixation and abbreviated words. In non-formal Indonesian correspondence, words are shortened by omitting the vowels, for example "saya" (I) is abbreviated as "sy", "tolong" (help) abbreviated as "tlg", "bayar" (pay) abbreviated as "byr", and so on. The suffixes that are not a part of Indonesian grammar i.e. "in" make the formal language transform into non-formal language [5], for example "bayarin" (pay), "tolongin" (help), "cetakin" (print), and the like. In Indonesian grammar "in" belongs to infixes that consist of "-er-", "-el-", "-em-", and "-in-" [6]. Beside the suffixes and abbreviations, the non-formal language is also formed using loanword, for examples "cancel", "booking", "issued", and so on. The others Indonesian non-formal are the words that are deformed from the original word, for example "sendiri" (alone) into "jomblo", "santai" (relax) into "santuy", "lambat" (slow) into "lambreta", and so forth.
In linguistic computation, the non-formal language comes to problems in data pre-processing which mostly consists of tokenizing, removing, stemming, and normalizing. Among the problems in data pre-processing, however, stemming is a key factor to prepare the data to be analyzed. Stemming is the process to remove affixes to obtain the root words, for example "berlari" into "lari" (run), "menulis" into "tulis" (write), "memakan" into "makan" (eat), and so on [7]. The sample words mentioned above are formal words that do not have a problem in stemming. The problem, however, is in non-formal words, the words which are deviated from Indonesian standard words, for example in sentences like "sistem lambreta bingit" (the system is very slow). "Lambreta" and "bingit" are not Indonesian words, so they will arise problem in stemming process.
Some research on Indonesian text processing had been done by using the existing Indonesian stemmer i.e. "Sastrawi". The research and improvements of "Sastrawi" algorithm was also done by using dynamic affixes to process non-formal Indonesian. However, the result of their research had limitations on non-formal Indonesian expressions which were formed into abbreviated and deformed words.
This research was the first step of three-phase research concerning empathetic chatbots for airline ticket reservations. The stages of the research include 1) Indonesian closed corpus development; 2) Emotion and intent classification development; and 3) Chatbot development. The purpose of the first step was to develop an Indonesian closed corpus that can be used as a stemmer on formal and non-formal Indonesian. On the other hand, a corpus is an important part of language synthesis systems which will affect in synthesis quality of speech [8].

Related Work
Indonesia is a country of about 17.000 islands, 1.300 ethnicities, and 742 languages. This condition requires a unifying means, and one of which is the national language called Bahasa (Indonesian language). However, in the context of computational linguistics, it lacks resources for Indonesian language processing including the area of Part-of-Speech (POS) tagger [9]. The POS tagger is used to assign a tag the class of word that can be used to process text in NLP such as text classification, text summarization, information retrievals, and so on. [10].
Related to the lack's resources for Indonesian language processing, a research in Indonesian POS (Part-of-Speech) Tagger was developed. Kwary had developed an educational corpus platform and divided into four disciplines according to user groups i.e. health, life, physics, and social science. This corpus collects about 5 million words obtained from several Indonesian university's journals, namely Airlangga, Diponegoro, Lampung, and Udayana. Data were taken by downloading 10 to 30 articles per journal. This corpus is claimed to be an Indonesian academic corpus that can be free to be accessed. Teachers and experts of Indonesian language were the target users of this corpus [11].
A research that has a correlation with the Indonesian language processing is topic summarization. Jiwanggi and Adriani had conducted a research to summarize Twitter data in Indonesian. The algorithm used was "The Phrase Reinforcement". This algorithm summarized a group of tweets that discussed similar topics by using a semi-abstractive approach. The data were obtained from tweeters with recorded time about 1 -4.5 hours and collected more than 100.000 tweets. The analysis had been done by humans to assess their readability, grammaticality, and informative level. There were 37 people consisting of undergraduate and graduate students to evaluate the summaries. The evaluations showed that more than 60% of the total positive answers were readability, grammaticality, and informative level [12].
The lack of resources regarding Indonesian in linguistic computing makes an idea for researcher to translate Indonesian into English. The reason is that there are more libraries and packages available in English. Gunawan, Mulyono, and Budiharto had researched the arithmetic word problem in questions and answers. The approach to problem-solving was to translate Indonesian into English and then processed it with Natural Language Toolkit (NLTK). The translation process was conducted by utilizing Google Translate Application Programming Interfaces (API). Although it had a good accuracy in doing answers, which was between 80% -100%, the speed of the process was low due to the language-translation process up to 1.12 minutes [13].
The main problem in developing a corpus, however, is text normalization and verification that cannot be done automatically by using computer technology. Textnormalization was done manually by using human assistance. Further the text normalization was verified using the Indonesian online dictionary namely "Kamus Besar Bahasa Indonesia (KBBI)".
Text normalization in Indonesian non-formal had been done by Sebastian and Nugraha. They developed a dataset to normalize abbreviated words. Abbreviated words were one of the problems in text normalization. An ambiguity occurred as the main factor on it, so they cannot be processed optimally. This research used "Crowdsourcing" to develop a dataset. This method was selected because only humans can translate abbreviated words into normal forms of words. The result showed the level of accuracy was 90.85%, however, there was a problem related to abbreviated words when there had been more than one meaning. Unique keywords were needed to determine which meaning was most accurate [14].
Micro texts are a term of communication in the digital era which are defined as an expression of an idea using in a short text. It is applied in the short message service which is limited by several characters. The limitation makes a user abbreviate the sentences or words so that it does not exceed the limit of the character number. Like the previous research, which was conducted by Sebastian and Nugraha, a research in text normalization was also done by Gunawan, Saniyah, and Hizriadi. This research used the approach of dictionary and Longest Common Subsequence (LCS). The result showed that their approach could solved the problem related to abbreviation, however, it is limited to pre-defined abbreviations and acronyms [15].
The previous research as mentioned above shows that data pre-processing is an important factor that has an impact on data quality. In the case of text analysis, a text normalization is the part of data pre-processing that had been done by the previous researcher. The result showed that the approach using a dictionary improved the data quality effectively. Out of vocabulary, however, is the main issue that must be solved in the dictionary approach.

Method
Raw data in this research were collected from the conversation between customer service staff and consumers in airline ticket reservations through WhatsApp messenger. By using this method, vocabularies that widely used in airline ticket reservations will be collected. That is important because it is an important rules in language learning [16]. Data were exported and converted into Comma-Separated Values (CSV) file format. The number of sentences which were collected in this process were 25.586. The purpose was to facilitate data processing using Python programming [17,18,19,20]. The step-by-step data pre-processing is shown in Figure. 1. Figure 1 Step by step data preprocessing.
Following the data converting, the next process is case folding. A case folding is a process to convert letters in the text documents into uppercase or lowercase. Intext processing, however, case folding was used to convert letters into lowercase. Its processes were done by using "lower ()" as a function of Python programming. The next process was tokenizing to split sentences into tokens by using space detection. It processes resulted in about 74.477 tokens which were verified in Indonesian rules correctly. According to the step of data pre-processing in Fig.1, verifications were done by using a process namely "Filtering". There are two processes in the filtering process i.e. stop word removing and deleting.
The stop word is an important factor to eliminate grammatical useless words in Indonesian like "di" (in), "pada" (on), "dari" (of), "yang" (that), and so on. Besides in the grammatical word categories, many tokens are not a part of Indonesian words for example booking code, passenger name, date of flight, routes of flight, and so on. The result of filtering processes was 1.009 words which were suitable for the Indonesian dictionary.
Stemming is the next process after filtering to obtain root words from each token. The stemming method used in developing corpus was "Sastrawi", however, there was also verification which was done manually. In information retrieval, stemming has two main functions. The first function is to improve the ability of the information retrieval system to select the appropriate documents, and the second function reduces the size of the vocabulary by mapping variants based on root words [21]. Inserting data into the corpus table is the parallel process after the stemming process was done.
The final process in developing corpus was normalizing which needed focus and involved three persons from linguistic discipline to normalize the corpus data. The normalizing processes were done by using several resources of Indonesian rules related to abbreviations and loanwords. The process collected 5.120 variants of words consists of abbreviations, loanwords, and affixes. The number of corpus data was small relatively, so the additions of Indonesian word collections were needed. The addition of words had been done by using source from Indonesian Natural Language Processing (NLP) of 29.605 words, so the total number of corpus data was 30.614 words.

Results and Discussions
The main result in this research is dynamic corpus i.e. Indonesian Closed Corpus for Business (Incorbiz). On the other hand, the research will also test the result of stemming between "Sastrawi" and "Incorbiz" using the Support Vector Machine algorithm to know the level of accuracy, respectively. The sample data of "Incorbiz" are shown in Table 1. The fields of "business domain" in the corpus table indicate that "incorbiz" could be used for multi-business, although, currently, "Incorbiz" is dedicated to airline ticket reservation business. On the other hand, "Incorbiz" was equipped with a variant of words that were stored in the field "set of word". The variant of words was provided to make it easier in the stemming process by using a matching string. To anticipate the slip of vocabulary, "Incorbiz" was designed to update data by using human assistance or automaticity. Human assistance was needed to process new data in the form of abbreviations or loanwords.
As an Indonesian stemmer, "Incorbiz" worked by using Structured Query Language (SQL) to find a root word. A root word was found by matching a string of variants of words. There are three possibilities in this method i.e. 1) variant of a word and root word were found, 2) variant of the word was not found but root word was found, and 3) were not found in both. On the second possibility, an auto-update will be done by "incorbiz", however, on the third possibility human assistance is needed. This research also compared the result of stemming by using "Sastrawi" and "Incorbiz". The classification used Support Vector Machine by using the linear kernel to know the result of classification accuracy. The sample of sentences is shown in Table 2. The comparing result of stemming between "Sastrawi" and "Incorbiz" are shown in Table 3 and Table 4, respectively. Based on the result in Table 3 and Table 4, there are several differences in the result of stemming between "Sastrawi" and "Incorbiz". The differences are in the words which used suffix "in", abbreviation", and loanwords. The normalization which was applied in "Incorbiz" changed the loanwords into Indonesia for example booking changed into "pesan" and issued changed into "cetak". In an abbreviation, the normalization was done by using human assistance because the only human could know the meaning of it. The result of stemming between "Sastrawi" and "incorbis" is a model that will be tested accurately by using the Support Vector Machine. By using Python programming, this research analyzed the level of accuracy by using a classification report and confusion matrix. The number of data that were used in this research were 30 sentences using two labels i.e. book and issued. The comparison between data testing and data training was 70%:30% respectively. The kernel used in the Support Vector Machine classification was "linear" which can classify text with large features (ref). The result of accuracy level based on "Sastrawi" and "Incorbis" stemming is shown in Table 5. The detail of the classification report based on "Sastrawi" and "Incorbis" stemming that consists of accuracy, precision, and recall are shown in Figure 2 and   The confusion matrix of the classification result based on "Sastrawi" and "Incorbis" stemming are shown in Figure 4 and Figure 5, respectively.   Table 5 shown the significant differences in the level of accuracy between stemming by using "Sastrawi" and "Incorbiz". This indicated that "Incorbiz" was better than "Sastrawi" in normalizing words, especially in the loanword and abbreviation. An increase in accuracy level of 0.22 showed that data pre-processing and normalization had an impact on accuracy results [22]. To observe the detail of accuracy, a confusion matrix can be used to recalculate by using its components.

The result in
The components of the confusion matrix consisted of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). True Positive is actual positive that are predicted positive, while True Negative is vice versa. False Positive is actual negative that is incorrectly predicted as positive, while False Negative is vice versa [23]. The accuracy is related to the error rate of the classifier model. The accuracy is the ratio of True Positive and True Negative to the total number of accounts, while the error rate is the ratio of False Positive and False Negative to the total number of accounts. The formula of accuracy is defined as: Accuracy=(TP+TN)/(TP+TN+FP+FN) (1) To obtain error rate, the formula that can be used to calculate it was defines as: Error rate=1-Accuracy (2) As mentioned, that the dataset was split into data testing and data training by using a ratio of 70%:30%. The number of rows in the dataset is 30, so the number of data testing is 9. Based on the confusion matrix in Figure 4 that showed the result of accuracy by using "Sastrawi", indicated that it had the number of True Positive and True Negatif of 6. This indicated that it had the number of false of 3, so the error rate is 3/9 or 0.33. This is identical to the formula of error rate (2) i.e. 1-0.67=0.33. In contrast, Figure 5 showed that stemming by using "Incorbis" increased the accuracy by indicating the number of True Positive and True Negative on 8, so the error rate is 1/9 or 0.11 or 1 -0.89 = 0.11 (2). The small error rate showed that the level of accuracy is high.

Conclusions
This research aimed to develop Indonesian dynamic closed corpus namely "Incorbiz", which can be used as an Indonesian formal and non-formal stemmer. The "Incorbiz's" word collections are 30.614 root words; however, the number of word variants is still limited on 5.120. This research also compared two stemming technique i.e. "Sastrawi" and "Incorbiz" to process the 30-sample dataset. The result showed that stemming by using "Incorbis" had a level of accuracy more than "Sastrawi" on 0.89 and 0.67, respectively. The future work is how to add the variant of words including loanwords and abbreviations based on their root word, automatically. Finally, the algorithms and techniques are needed.

Abbreviations
Incorbiz: Indonesian Closed Corpus for Business; SVM: Support Vector Machine; NLP: Natural Language Processing.