NAÏVE BAYES FOR THESIS LABELING

The thesis preparation in the Department of Informatics Universitas Ahmad Dahlan is divided into two areas of interest, namely Intelligent Systems and Software and Data Engineering. Existing thesis title data is only used as an archive and has never been processed or classified to determine the trend of thesis topics based on student interest each year. The stages include data collection, the data is divided into two parts (training data and test data), manual labeling of training data, text preprocessing, and classification using Naive Bayes. The results show the trend of thesis title taking from 2013 to 2018 shows the thesis trend in the field of Intelligent Systems and Software. Accuracy testing uses Confusion Matrix and K-Fold Cross Validation with a k value is 10, has a value of 94.60%, precision of 97.30%, and a recall of 85.70%.


INTRODUCTION
Education is the wise, hopeful and respectful cultivation of learning that undertaken in the belief that all should have the chance to share in life. Education can be known as a place of learning that can be done anywhere. Every human being must experience or run the education. The role of education is very important for humans today. Education is also useful for building human characteristics early on. To accomplish the basic education in Indonesia takes 12 years from Primary Intelligent Systemhool, Junior High Intelligent Systemhool and Senior High Intelligent Systemhool [1].
Education taken after a high Intelligent Systemhool level of education is that include diploma, undergraduate, professional, masters, doctorate to specialist program. Such an education is organized by the College. Data from Ministry of Education and Culture in 2017, in Indonesia there are as many as 4.504 university units. Of the total tertiary institutions registered in the Ministry of Education and Culture, 3.136 units are private universities 122 units are state universities, the rest are universities under ministries or state institutions with service systems and religious colleges.
Universitas Ahmad Dahlan is one of the private collages in Indonesia. It was founded on November 18, 1960. Universitas Ahmad Dahlan develops fields of expertise or study programs in the fields of technology as well as in the social and humanities fields. Universitas Ahmad Dahlan which is has 11 faculties and 36 undergraduate programs. One of the steps that students must go through to get a bachelor's degree is making a research report or thesis as a final project. The thesis is a Intelligent Systemientific paper made by students and is prepared to fulfill some of the requirements for completing education in a bachelor's degree. Intelligent Systemientific work that is made can be in the form of research reports, such as library research, development research, field research, and laboratory research.
In preparing the thesis, especially in the Department of Informatics at Universitas Ahmad Dahlan, it is divided into two areas of interest, namely the Intelligent Systems and Software and Data Engineering. Subjects from each area of interest also differ, some elective courses become part of Intelligent Systems family, and some become part of the Software and Data Engineering family. Software Quality Assurance ---Based on Table 1 Academic Interest that each area of interest has a subspecialization consisting of different elective courses. Transforming the data on the attributes of the research title to the research topic is still done manually in Microsoft Excel, so it is not very precise. Therefore, it is necessary to carry out the processing by performing automatic grouping (classification) of research titles using one of the techniques, namely text mining. So later it can be seen the trend of the topics that are developing in the implementation of the Department of Informatics at Universitas Ahmad Dahlan student thesis every year. Furthermore, it can be used as material for curriculum evaluation in the Informatics Engineering Study Program.
Rules on the curriculum contained in Law No. 12 Year 2012 Article 35, paragraph 2, it is written that the Higher Education Curriculum developed by each College with reference to the National Standards of Higher Education for each study program that includes the development of intelligence, character, and skills [2].
The curriculum is a set of plans and rules related to the achievement of learning targets, study materials, the learning process and the values used to guide the implementation of study programs. Curriculum arrangement and planning, consisting of several stages, including the needs analysis stage, the development stage, the implementation stage, the evaluation stage and the follow-up stages for the good done by the study program [3]. Curriculum enhancement is carried out to produce graduates according to the target learning outcomes that have been implemented by the study program.
Curriculum development and institutional-Intelligent Systemale research roadmap planning require knowledge of the trend of thesis topics for students at both the university level and the level of the study program. Based on interviews conducted with Coordinator of Research Methodology and Thesis of the Department of Informatics, that the data that has been collected has never been processed or further classified to determine the trend of thesis topics based on the student's interest area that students take each year. Classification is a categorization process carried out on a set of documents, classification is very important for the ease of users searching for documents, data classification from some document with specific is the title, is data classification in the form of text, so this type of classification can be done using the text mining method [4]. Text mining is an important stage in the Big Data analysis process, that is unstructured, such as a large number of text data (Xiang, Intelligent Systemhwartz, Gerdes Jr, & Uysal, 2015). Text mining is one part of data mining which is used to analyze and process data in the form of text which is semi-structured and unstructured. In contrast to data mining which is generally used to analyze data that is categorical, continuous, or ordinal [6]. In the text mining stage, several algorithms commonly used, including the C4.5 algorithm, the Naive Bayes algorithm, Cosine Similarity, TF-IDF weighting, Support Vector Machine (SVM), K-Nearest Neighbor (KNN).
Researches related to text mining have been done before, one of which is using TF-IDF weighting to make a web application about information retrieval on the site detik.com [7]. The combination of the Naive Bayes Classification Algorithm and the Chi-Square method in text classification results in excellent performance [8].
Research conducted using the Naive Bayes Algorithm to analyze product review based on a specific aspect of the product. This review could be range from thousands and various opinions. This research has three phases, data processing using POS, feature selection using Chi-square, and classification using Naïve Bayes [9].
Based on pre-existing research, and the problems that occur in the Universitas Ahmad Dahlan Informatics Engineering study program, related to thesis title taking based on areas of interest, this study applies the Naive Bayes algorithm to classify the thesis titles of Informatics Engineering students and produce a trend for each thesis title. years so it is expected that the results will be more accurate.

A. Text Mining
Text mining was developed in the 1980s and is becoming more and more effective with the increasing use of computerization. Text mining manages to find relationships and patterns that are not visible, for example measuring a person's level of happiness using his writing on Twitter and others [10].
Text mining is a part of text analysis that is done automatically and is carried out by a computer, the aim is to extract useful information from a collection of documents. The way this method works is to find words that represent the contents of the related document, then analyze the relationship between the documents using statistical calculations to determine group relationships, classifications and association patterns [11].
The steps taken at the text mining stage are data cleaning and text preprocessing. Data cleaning aims to eliminate noise in a data or document, cleaning data on documents is useful for filtering invalid data [12]. While the stages in text preprocessing are shown in Figure 1 The following text preprocessing stages:

a. Tokenizing
The tokenizing process is done by cutting sentences into constituent words. In the tokenizing process, punctuation is also removed [13].

b. Filtering
The filtering process is done by removing unnecessary words from the tokenizing process. This stage uses a stopword removal algorithm to remove unnecessary words. A stopword is a collection of words that have no meaning [13].

c. Stemming
The stemming process is done by looking for the root (root) of each word resulting from the filtering. At this stage, the process of returning various forms of words is carried out into the same representation. The stemming stage can also be interpreted as the process of removing affixes [14].

d. Tagging
The tagging process is the stage of finding the initial root form of each word then each word in the text will be categorized into grammatical functions such as nouns, pronouns, adjectives, verbs, adverbs, prepositions, determinants, and conjugations. Part Of Speech Tagging is important because several Natural Language Processing tasks, namely sentiment analysis, answering questions, and word disambiguation need differentiation to overcome word ambiguity [15].

e. Analyzing
The analyzing process is the process of determining how far the relationship between words in an existing document is. This stage uses the calculation of Term Frequency (TF) and the Naive Bayes Algorithm [16].

B. Term Frequency
The Term Frequency method is a way to give weight to the relationship of a word (term) to a document. The Term Frequency (TF) value is obtained based on the number of occurrences of a word in a particular document. For example, if a word appears 3 times in a document, then the TF value is three [7].

C. Naïve Bayes Algorithm
Naive Bayes is a classification algorithm that is carried out using probability and statistical calculations, put forward by British Intelligent Systemientist Thomas Bayes, this method makes predictions based on the previous to find out future opportunities [17]. The advantage of using Naive Bayes is that this algorithm only uses a small amount of training data to determine the estimated parameters needed in the classification process [18]. In this algorithm, each document is represented by attribute pairs a1, a2, a3, ........, n, where a1 is the first word, a2 is the second word and so on. Whereas V is the set of news categories. At the time of classification, the algorithm will look for the highest probability of all document categories tested (Vmap). The Vmap equation is as follows: Vmap is the probability value calculated by Naive Bayes for the corresponding target function value. The frequency at which words occur is the basis for the value P(vj) and P(ai |vj). The set of these probability values corresponds to the hypothesis to be studied. Hypotheses are then used to classify new data. Value P(vj) calculated at the time of training data, obtained by the following formula: Where |doc j| is the number f document (thesis title) which has category j in training. Meanwhile | training | is the number of documents (thesis title) in the sample used for training. For the word probability ai for each category P(ai|vj), it is calculated at the time of training.

P(ai|vj) = (3)
Where ni is the number of occurrences of the word ai in categorized documents vj, while n is the number of all words in the document by category vj and |vocabulary| is the number of words in the training example.

D. Confusion Matrix
Among the mechanisms that can be used to measure the validity of classification results, the ones that are often used are calculating the accuracy, precision and recall values [19]. Based on Figure 2 it can be explained that: a. TP, namely True Positive, is the number of positive data classified correctly by the system. b. TN, namely True Negative, is the amount of negative data classified correctly by the system. c. FN, namely False Negative, is the amount of negative data but is classified incorrectly by the system. d. FP, namely False Positive, is the number of positive data but is classified incorrectly by the system.
Accuracy is a calculation to get the result of the proportion of correct prediction. The calculation of accuracy value is shown in the following equation: x 100 % (4) Precision (P) is a measure of the number of documents found to be relevant, the calculation of precision is shown in the following equation: (5) Meanwhile, recall (R) is a measure of the number of relevant documents that can be recovered, as shown in the following equation:

E. K-Fold Cross Validation
K-fold cross validation is a method used to determine the average success rate of a system by looping using random attributes. This method can be used when the amount of data is limited. This test aims to determine the accuracy of the Naive Bayes method which is applied to the classification of the thesis title when tested with different training data and testing data [19].
Cross Validation is similar to the repeated random subsampling method, but the sampling is done in such a way that no 2 data tests overlap. In Figure 3 K-fold Cross Validation it is visualized that the k-fold cross validation process divides the data randomly into k loose subsets, the model is trained in the training set and then applied in the validation set [21].

RESULT AND DISCUSSION
The data used in this research uses thesis title data for the Universitas Ahmad Dahlan Informatics Engineering study program as many as 1290 thesis title data from 2013 to 2018 which will be grouped or classified into two areas of interest, namely Intelligent Systems (IS/SC) and Software and Data Engineering (SDE/Relata The first stage of the data that has been obtained is divided into two parts, namely training data and testing data, consisting of 500 data used as training data and labeled manually, then used to test 790 data which will be predicted periodically every year, and after the completion of the prediction, the 790 data will be updated into training data.
The test data is then carried out preprocessing with the tokenizing stage which aims to cut the sentence into words, followed by the filtering process which is the stage used to eliminate words that are considered unimportant. Furthermore, the stemming process is carried out aimed at converting each word into a root word. Then the weighting is carried out using term frequency, after the weighting results are obtained, classification is carried out using the Naive Bayes Algorithm.
The classification results are then tested using the Confusion Matrix and K-Fold Cross Validation values by dividing the 1290 thesis title data into 10 parts, this is done using the k = 10 value. The test is used to determine the level of accuracy, precision and recall in this study. Accuracy calculation is shown in Using the data calculation on the results of the first iteration obtained from the implementation of Jupyter Notebook using the Scikit-learn library, the accuracy value is obtained from equation (4) as follows: So that the average accuracy value at 10 data iterations is as follows: The precision value with equation (5)  Based on the results of the calculation of the thesis title classification system test using 1290 data with a total of 10 iterations using the Confusion Matrix and K-Fold Cross Validation, the accuracy value is 94.65%, the precision value is 97.33%, and the recall value is 85.73%.

CONCLUSION
Based on the results of the research that has been done, it can be concluded that a thesis title classification system has been created which can provide very useful information to determine the number of students who take thesis titles based on their fields of interest each year. So that it can help the thesis coordinator to classify the  thesis title automatically and can be used as material for evaluation of study programs related to the fields of interest that exist in the Department of Informatics Universitas Ahmad Dahlan.
The trend of taking the thesis from 2013 to 2018 in Department of Informatics Universitas Ahmad Dahlan in the field of relative interest. The accuracy performance obtained from the evaluation method uses the K-Fold Validation using the value of k = 10 on 1290 data, resulting in an accuracy value of 94.65%, a precision value of 97.33% and a recall value of 85.73%. The Naive Bayes method in this study produces a good classification level with high accuracy values. So, it is suitable for predicting the class of the thesis title every year.
This research still has some shortcomings, it is hoped that there will be a further development of this research. Suggestions that can be used as a reference for further development, such as the upload process are still done manually using files with the xlsx extension. So, it is hoped that development will be carried out which can retrieve data every semester from the existing thesis title database.
The application of word weighting still uses the Term Frequency (TF) because in its application, it only calculates the occurrence rate of words. So, it is necessary to do further research related to the application of TF-IDF in the thesis title classification system. It needs to be considered for further research into the classification of the title of the thesis of the Universitas Ahmad Dahlan Informatics Engineering study program based on the scientific sub-group.