A Data-driven Approach to the Automatic Classification of Korean Poetry

Automatic classification of text is an increasingly important area of research. It has important applications in virtual assistants and recommender systems. Among the different types of literary works, the poem is one of the most difficult to classify automatically because of the prolific use of metaphors and the short length. In this research, we propose a data-driven approach to automatically classify Korean poems. We use three different methods for finding keywords which can train the classifiers. Our results show that the proposed approach can produce better classification accuracy than using a predefined list of keywords created by a human expert. Index Terms — automatic classification, data-driven, poem, text mining, Korean text, keyword extraction


I. INTRODUCTION
Classifying text by its topic is useful in finding information that you want.For example, readers will be able to find poems that they want to read easily if poems are classified by its topic.However, poetry is a unique literature genre where the subject is described in an abstract and indirect manner.Therefore, it is difficult to figure out what the topic of a poem is without reading the entire poem.If a poem can be automatically classified by its topic, then it will help to organize a database of poetry systematically and improve its accessibility to the readers.
In this study, we explore the possibility of automatic classification of Korean poems by topics through keywords extracted from a database of poems.Classifying Korean text is a much more challenging task than classifying English text because the Korean grammar system uses many possible forms of word endings in a sentence, as opposed to English.We approach this problem by using a morpheme analyzing tool to separate sentences into keywords.By analyzing the occurrences of these keywords in the poems, we are able to classify Korean poems into categories based on its topic.
The rest of this paper is organized as follows.Section II describes related works in the classification of English text, and provides a brief description of the Korean basic grammar system.Section III describes our data-driven approach to generate and select the keywords used for the classification.Section IV presents the automatic classification result of our approach and Section V gives the conclusion and future work.

A. Classification of English Text
Various techniques and tools have been developed to classify English texts [1,2].To analyze English text, the text has to be separated into paragraphs and sentences [3].A large corpus is usually required to use these methods, and such large corpus based text analysis methods have been used to analyze blogs [4,5], songs [4], president's speech [4], and reactions of users to brands [6].However, such methods are hard to be used in analyzing poems because of the much smaller corpus size.

B. Classification of Korean Text
Unlike English, Korean text should be separated by the morphemes, not words.Morphemes are the smallest unit of words that has meaning.Generally, one independent word, such as a single noun or verb, is a morpheme.Whereas English is mostly separated into morphemes by space, in Korean there is a unit of word called 'eojeol' which is separated by space.And 'eojeol' is separated again into morphemes.For example, "나는" is the Korean translation of "I am" which '나' means 'I' and '는' means 'am'."I am" has two morphemes 'I' and 'am' which separated by space and "나는" also has two morphemes but there is no space between them.Therefore, Korean text should be analyzed by morpheme rather than space (as opposed to English).In this study, we used the Korean morpheme analyzer tool 'KoNLPy' [7] to separate sentences into morphemes.

III. METHODOLOGY
In this section, we describe three approaches to extract the morphemes, or keywords, that we can use to build our classifier.The poems that we use come from two books of collected poems by Kim Yongtaek, "Maybe the stars take away your sadness" vol.1 and vol.2 [8,9].These poems were organized into four categories, namely, "Nature", "Abstraction", "Artificial", and "Human".
The three approaches for finding keywords are: selected keywords, most frequent keywords, and 'should not appear' keywords (called as SNA keywords).These keywords are all noun, verb, and adjective morphemes.We only take these three word classes because these word classes contain the meaning of the sentence.

A. Approach 1 -Selected Keywords by Human User
In this first approach, we build a list of keywords that are the most common for poems in each category.For example, in the category of "Nature", we would expect to find words like "나무" (which means "tree") or "눈" (which means "snow").We will then use these keywords to build a classifier that will classify poems into categories based on the frequency of occurrences of these selected keyword.
Selected keywords are commonly referred morphemes that are the most acceptable words for each category.We generated these lists of selected keywords by ourselves.We chose a list of 40 keywords for each of the four categories.Table I shows a sample of 10 of the selected keywords in each category.

B. Approach 2 -Most Frequent Keywords
Our second approach attempts to generate the most significant keywords purely from data, hence it is a datadriven approach.We examine the frequency of occurrence of morphemes in a set of poems called the training set, and we selected the most frequent morphemes as our keywords.
To perform 10-fold cross validation, we split all poems into ten sets and take nine of them as the training set.This training set is used to build a list of the most frequent keywords for each category, which is used to build a classifier and then tested using the last set.We then repeat the process by taking a different nine sets to form the training set, leaving one out each time.The process is shown in Fig. 1.

C. Approach 3 -Should not Appear (SNA) Keywords
Our last approach is to augment our list of keywords with a list of "should not appear" (SNA) keywords.The SNA keywords are morphemes that should not appear in a certain category.Appearance of SNA keywords in a particular category means that the probability of the poem belonging to that category is low.
To find the list of SNA keywords, we use the same method as the most frequent keywords to find the frequency of occurrence of each morpheme.We then choose morphemes that do not appear in a category but appear in other categories.The weights for SNA keywords are calculated using the counts of the morphemes appearing in other categories, and are given a negative value.

D. Training Classifiers
After we have obtained a list of the keywords, we can now feed it to the classifier to train it.We use two of the most popular classifier, decision trees and SVMs, in our experiments.The input data for training these classifiers consist of a list of four floating point numbers which are the summations of product between frequency of appearance of each keyword and its weight in each category.To study the dependency of the accuracy of the results on the parameters of the classifier (such as maximum depth and minimum number of leaf samples for decision trees), we repeat our experiments on various values of the classifier parameters.

A. Dataset
From two books of collected poems by Kim Yongtaek, "Maybe the stars take away your sadness" vol.1 and vol.2, we get 154 poems.These poems are divided into four categories, "Nature", "Abstraction", "Artificial", and "Human".Table II is the distribution of poems in each category.

B. Implementation
We used a PC running Windows 10 on an Intel Core i5 processor with 4GB of RAM for the experiments.All our programs are written in Python and developed in IDLE.

C. Keywords
Table III shows the top ten most frequent keywords and weights extracted from the "Nature" category in approach 2. We noticed that in this category, the top two keywords are "하" (which means "to do") and "가" (which means "to go"), which are also in the top ten most frequent keywords in the other three categories, with similar weights.The reason for this is that "하" and "가" are common action words that would appear in any poem.However, this is not a problem as these words affect the weights in each category in a similar way and thus is ineffective in influencing the classification result.Table IV shows the top ten SNA keywords also for the "Nature" category.We observe that some of these keywords, such as "도토리" (which means "acorn") and "벌레" (which means "bug"), could actually appear in some other poems in the "Nature" category.They are not included in this case because they did not appear in our training dataset.This effect is mitigated by the fact that the appearance of these SNA keywords only reduces the overall weight by a small amount, instead of forcing the poem to belong to a different category.We build classifiers using the Scikit-learn toolkit [10].We use two of the most popular classifier, decision trees and SVMs (linear, RBF, and polynomial) in our experiments.

D. Classification
Figs. 2, 3, and 4 shows the decision trees obtained in each of the three approaches.

E. Classification Accuracy
Tables V and VI show the accuracy of the decision tree classifier used in our experiments.By repeating the experiments for different parameters of the classifier, we found that for the decision tree, the optimum value is 10 for maximum depth, and 15 for minimum leaf samples.VIII, we can see that there is a greater degree of accuracy using the approach of most frequent keywords as compared to the approach using selected keywords.The approach of using SNA keywords also has a higher accuracy than the approach of using the most frequent keywords.We observe that for the case of Decision Trees, the accuracy of using the approach of SNA keywords is actually worse.Let's look at the confusion matrix to determine why the accuracy of decision tree is worse in the approach for SNA keywords.
Tables IX, X, and XI are the confusion matrices of decision trees for the different approaches.From these tables, we can see that the classification accuracy of poems in the "Artificial" category is extremely poor.It may be due to the number of poems in the "Artificial" category.It makes it harder to find the meaningful most frequent keywords and should not appear keywords.
Also, in Tables IX, X, and XI, in the case of the "Nature" category, we also observe that the number of correct classifications is getting smaller from the approach using selected keywords to the approach using should not appear keywords.As shown in Tables III and IV, the most frequent keywords list seems reasonable but the SNA keywords list includes some words that are relevant to nature, such as acorn, bug and river.This may be the reason of the low accuracy.This kind of invalid keywords may appear because of the lack of number of poems.

V. CONCLUSION AND FUTURE WORKS
In this research, we tried three different approaches for finding keywords and weights for the classification of Korean Poetry.In SVMs, data-driven approaches have better accuracy then using user selected keywords.However, in decision trees, the selected keywords approach has better accuracy than data-driven approaches.This may be due to a lack of data in the training set and the number of keywords we choose (we only chose the top 40 keywords per category).Thus, in our future work, we will try to use all words from the poems and not just the top 40.Also we need more poems to train the classifiers and we will try other classifiers such as Bayes' net which is more efficient for datasets with uncertainty.

Figure 1 .
Figure 1.Finding the most frequent keywordsThe most frequent keywords are morphemes that appear frequently in poems of the corresponding category.As different keywords appear different number of times in a poem, we calculate the weight for each keyword by normalizing the count by the number of morphemes in the poem and the number of poems in category.Equation (1) shows how we normalize the keywords counts.The list of most frequent keywords is the list of 40 morphemes with the highest count.

Figure 4 .
Figure 4. Decision tree by should not appear keywords

TABLE I .
10 SELECTED KEYWORDS

TABLE II .
ORIGINAL DISTRIBUTION OF CATEGORIES

TABLE III .
TOP 10 MOST FREQUENT KEYWORDS AND WEIGHTS OF "NATURE"

TABLE IV .
TOP 10 SHOULD NOT APPEAR KEYWORDS AND WEIGHTS OF "NATURE"

TABLE V .
DECISION TREE ACCURACY BY MAXIMUM DEPTH

Table
VII show the accuracy of the Support Vector Machine (SVM) classifier using Radial Basis Functions (RBF) used in our experiments.By repeating the experiments for different parameters of the classifier, we found that the optimum value is 0.2 for gamma.

TABLE VII .
RBF SVM ACCURACY BY GAMMA

Table
VIII shows the classification accuracy by classifiers in the different approaches.From Table

TABLE VIII .
AVERAGE ACCURACY OF EACH APPROACH BY CLASSIFIERS

TABLE IX .
CONFUSION MATRIX FOR DECISION TREE FOR THE APPROACH USING SELECTED KEYWORDS

TABLE XI .
CONFUSION MATRIX FOR DECISION TREE FOR THE APPROACH USING SHOULD NOT APPEAR KEYWORDS