Javanese Document Image Recognition Using Multiclass Support Vector Machine

—Some ancient documents in Indonesia are written in the Javanese script. Those documents contain the knowledge of history and culture of Indonesia, especially about Java. However, only a few people understand the Javanese script. Thus, the automation system is needed to translate the document written in the Javanese script. In this study, the researchers use the classiﬁcation method to recognize the Javanese script written in the document. The method used is the Multiclass Support Vector Machine (SVM) using One Against One (OAO) strategy. The researchers use seven variations of Javanese script from the different document for this study. There are 31 classes and 182 data for training and testing data. The result shows good performance in the evaluation. The recognition system successfully resolves the problem of color variation from the dataset. The accuracy of the study is 81.3%.


I. INTRODUCTION
T HE Javanese script is often used in ancient Javanese documents that contain Javanese and Nusantara (Indonesian) culture [1]. However, only a few people understand the Javanese script [2,3]. The automation system is needed to translate the book with the Javanese script. Thus, it will help people to understand the document written in the Javanese script and help people to study the knowledge of those documents.
The automatic translation of the Javanese script consists of four main stages. The first stage is the segmentation to get the Region of Interest (ROI) image, each character of Javanese script, or letter. The second step is feature extraction of each ROI. In the third step, each character is recognized as the alphabet using the classification method. The last step is combining the alphabet into meaningful words. The Javanese script uses the Scriptio Continua (writing continuously) model [4] or script that does not use spaces or other punctuation.
Moreover, the alphabet of the Javanese script consists of 20 main characters which are syllabic. Javanese script also has Sandhangan character, Pasangan character, Murda characters, Swara character, punctuation marks, and numbers [5][6][7]. There are several studies on Javanese script and other ancient scripts in different culture and country. The topics of this research are about segmentation, feature extraction, recognition or classification, and transliteration. Reference [3] proposed the Hidden Markov Model (HMM) to recognize the character of the Javanese script. The study had good performance. The accuracy is 85.7%. In 2017, Ref. [4] improved the method on her research, but the accuracy was still in 85.7%. Reference [2] also proposed the new method in recognition of the Javanese script. He used the deep learning technique to classify the dataset. The datasets for his study were handwritten Javanese character. He showed a good result in performance with an accuracy of 94.57%.
The challenge in the study of Javanese script automatic translation system is to get a high accuracy in the recognition system and solve the various kind of script for the dataset. The variations are from the shape of the letter and color of the letter and background. In this study, the researchers propose a method in classification to recognize the Javanese script written in the document. The method used in the study is the Multiclass Support Vector Machine (SVM). The strategy of the multiclass classification is One Against One (OAO). This method has been proposed in the ancient script by Ref. [8]. In that study, they used English and Bengali script document for the dataset.

II. LITERATURE REVIEW A. Characteristics of the Javanese Script
Javanese culture is one of popular culture that shows the identity of Nusantara or Indonesian. Many intel-Cite this article as: Y. Sugianela and N. Suciati, "Javanese Document Image Recognition Using Multiclass Support Vector Machine", CommIT (Communication & Information Technology) Journal 13(1), 25-30, 2019. nslate the book with the eople to understand the cript and help people to nts. vanese script consists of e segmentation to get the ch character of Javanese eature extraction of each ter is recognized as the ethod. The last step is ful words. The Javanese ting continuously) model r other punctuation. nese script consists of 20 Javanese script also has racter, Murda characters, and numbers [5], [6], [7]. vanese script and other d country. The topics of tion, feature extraction, sliteration. Widiarti et al. del (HMM) to recognize t. The study had good . In 2017, Widiarti [4] h, but the accuracy was osed the new method in used the Deep Learning he datasets for his study He showed a good result 4.57%.
Ha-Na-Ca-Ra-Ka. The name of Ha-Na-Ca-Ra-Ka is from five first letters of Javanese script. Figure 1 is the list of the main character of the Javanese script.
The Javanese script also consists of Sandhangan letters. Those can change the pronounce of the main character. The main character can be added by one or more Sandhangan. For example, if there is the main character of Ha and Sandhangan Wulu, Ha is changed into Hi. Moreover, if there is the main character of Ka and Sandhangan, Pepet and Cecek, Ka is changed into Keng. The example is shown in Fig. 2.    Table 1 is an example of Sandhangan letters. In this study, the researchers only discuss the main and Sandhangan letters of the Javanese Script. Seven variations of Javanese script are used from a different document. One of the datasets is the printed document of the Javanese script. The document title is Bloemlezing Uit Javaansche Werken (Proza) published in 1942. Although the title is written in Dutch, the content of the document is written with the Javanese Language. Figure 3 is one of the documents used in this research. The other six datasets are the image of the Javanese script from the website. The examples of the dataset are shown in Table 2. Because of the good quality of the printed text, the distance between the lines is consistent, but there are some overlaps between characters. From Table 2, it can be seen that there are variations of shape and color of letters from a different source. lectual properties about the culture are written in the ancient book using the Javanese script. The contents of ancient Javanese books are linguistics, myths, religion, philosophy, laws and norms, folklore, kingdoms, and histories.
The Javanese script consists of 20 main characters known as Ha-Na-Ca-Ra-Ka. The name of Ha-Na-Ca-Ra-Ka is from five first letters of Javanese script. Figure 1 is the list of the main character of the Javanese script.
The Javanese script also consists of Sandhangan letters. Those can change the pronounce of the main character. The main character can be added by one or more Sandhangan letter. For example, if there is the main character of Ha and Sandhangan Wulu, Ha is changed into Hi. Moreover, if there is the main character of Ka and Sandhangan, Pepet and Cecek, Ka is changed into Keng. The example is shown in Fig. 2. Table I is an example of Sandhangan letters. In this study, the researchers only discuss the main and Sandhangan letters of the Javanese Script. Seven variations of Javanese script are used from a different document. One of the datasets is the printed document of the Javanese script. The document title is Bloemlezing Uit Javaansche Werken (Proza) published in 1942. Although the title is written in Dutch, the content of the document is written with the Javanese Language.   Table 1 is an example of Sandhangan letters. In this study, the researchers only discuss the main and Sandhangan letters of the Javanese Script. Seven variations of Javanese script are used from a different document. One of the datasets is the printed document of the Javanese script. The document title is Bloemlezing Uit Javaansche Werken (Proza) published in 1942. Although the title is written in Dutch, the content of the document is written with the Javanese Language. Figure 3 is one of the documents used in this research. The other six datasets are the image of the Javanese script from the website. The examples of the dataset are shown in Table 2. Because of the good quality of the printed text, the distance between the lines is consistent, but there are some overlaps between characters. From Table 2, it can be seen that there are variations of shape and color of letters from a     Table II. Because of the good quality of the printed text, the distance between the lines is consistent, but there are some overlaps between characters. From Table II, it can be seen that there are variations of shape and color of letters from a different source.

26
Cite this article as: Y. Sugianela and N. Suciati, "Javanese Document Image Recognition Using Multiclass Support Vector Machine", CommIT (Communication & Information Technology) Journal 13(1), 25-30, 2019. SVM was developed by Boser, Guyon, and Vapnik, and presented the first time in 1992 in the Annual Workshop on Computational Learning Theory [9]. The concept of SVM is the combination of computation theories that have existed many years before such as margin hyperplane [10]. The basic idea of SVM is a linear classifier, but in the next development, SVM can be used in the non-linear problem by using the kernel trick concept. SVM is the method of learning machine that has a basic idea of Structural Risk Minimization (SRM). SVM works by finding the best hyperplane in the input space. Hyperplane in the vector room in dimension d is d − 1 dimension affine subspace that separates the vector room into two parts that each of them corresponds to different classes [11].
In general, problems in the real world are rarely linear separable. Most of those are non-linear. To solve this problem, SVM is modified using kernel functions. In this study, the researchers use the Radial Basis Function (RBF) kernel [12]. As mentioned in the Burges tutorial [13], the SVM method has been used for pattern recognition cases such as for isolated handwritten digit recognition [10] and text recognition [14]. In pattern recognition case, dataset input is classified into many classes. To solve the problem, the multiclass SVM classification is needed. One popular strategy in multiclass SVM is OAO. This strategy builds an SVM for each pair of classes, so it is also known as pairwise [15]. This method is like a knockout system in the football tournaments. For a problem with k classes, k(k − 1)/2 SVMs are trained to separate the input dataset of a class from the dataset of another class [16].

III. RESEARCH METHOD
In this study, the researchers use a multiclass SVM using OAO to recognize the document of the Javanese script. Figure 4 shows the steps of the proposed 3 III. Support Vector Machine (SVM) SVM was developed by Boser, Guyon, Vapnik, and presented the first time in 1992 in the Annual Workshop on Computational Learning Theory [9]. The concept of SVM is the combination of computation theories that have existed many years before such as margin hyperplane [10]. The basic idea of SVM is a linear classifier, but in the next development, SVM can be used in the non-linear problem by using the kernel trick concept. SVM is the method of learning machine that has a basic idea of Structural Risk Minimization (SRM). SVM works by finding the best hyperplane in the input space. Hyperplane in the vector room in dimension d is d-1 dimension affine subspace that separates the vector room into two parts that each of them corresponds to different classes [11].
In general, problems in the real world are rarely linear separable. Most of those are non-linear. To solve this problem, SVM is modified using kernel functions. In this study, the researchers use the Radial Basis Function (RBF) kernel [12]. As mentioned in the Burges tutorial [13], the SVM method has been used for pattern recognition cases such as for isolated handwritten digit recognition [10] and text recognition [14]. In pattern recognition case, dataset input is classified into many classes. To solve the problem, the multiclass SVM classification is needed. One popular strategy in multiclass SVM is OAO. This strategy builds an SVM for each pair of classes, so it is also known as pairwise [15]. This method is like a knockout system in the football tournaments. For a problem with k classes, k ( k -1 ) / 2 SVMs are trained to separate the input dataset of a class from the dataset of another class [16].

IV. Research Method
In this study, the researchers use a multiclass SVM using OAO to recognize the document of the Javanese script. Figure  4 shows the steps of the proposed method. First, it is getting ROI from the document image. Second, each ROI image will be processed in feature extraction using the Histogram of Oriented Gradient (HOG). Third, after getting the feature vector from the HOG process, getting the label of each ROI using multiclass SVM in the classification step will be done. The label of the document consists of name and kind (main or Sandhangan).

IV. Results and Discussion
The dataset for this research is from the printed document and images from the website. The researchers select the main character and Sandhangan for ROI from seven Javanese script documents. There are many variations in the size of ROI, so the researchers normalize the size of ROI into 64 x 64 pixels. If the height and width of the original ROI are not the same and the width is less than height, we will add pad in the left and right side. Moreover, the researchers will add pad in above and below side when the height is less than the width. Thus, the height and width will be the same. Figure 5 is the detail of the number for each letter we used for the dataset. The total of the dataset is 182 images of Javanese script letter. Moreover, the researchers set the color of the pad like the color of the background image of ROI. The examples of padding step are in Fig 6b and Fig 6e. The number of datasets for each letter is different. method. First, it is getting ROI from the document image. Second, each ROI image will be processed in feature extraction using the Histogram of Oriented Gradient (HOG). Third, after getting the feature vector from the HOG process, getting the label of each ROI using multiclass SVM in the classification step will be done. The label of the document consists of name and kind (main or Sandhangan).

IV. RESULTS AND DISCUSSION
The dataset for this research is from the printed document and images from the website. The researchers select the main character and Sandhangan for ROI from seven Javanese script documents. There are many variations in the size of ROI, so the researchers normalize the size of ROI into 64×64 pixels. If the height and width of the original ROI are not the same and the width is less than height, the researchers will add pad in the left and right side. Moreover, the researchers will add pad in above and below side when the height is less than the width. Thus, the height and width will be the same. Figure 5 is the detail of the number for each letter used for the dataset. The total of the dataset is 182 images of Javanese script letter. Moreover, the researchers set the color of the pad like the color of the background image of ROI. The examples of padding step are in Figs. 6b and 6e. The number of datasets for each letter is different. The next step is extracting features using the HOG method. The inputs in this stage are grayscale ROI image. HOG is a method for discriminating features. This feature is a histogram description based on edges and orientations that apply to object recognition. This method is widely used in face recognition, animals, and detection of vehicle images [17]. HOG is also used to extract features in the multiclass classification of Batik [18]. Reference [19] show the experiment result that the HOG method applied to Javanese scripts image can achieve good accuracy for recognition.

27
Cite this article as: Y. Sugianela and N. Suciati, "Javanese Document Image Recognition Using Multiclass Support Vector Machine", CommIT (Communication & Information Technology) Journal 13(1), 25-30, 2019.  description based on edges and orientations that apply to object recognition. This method is widely used in face recognition, animals, and detection of vehicle images [17]. HOG is also used to extract features in the multiclass classification of Batik [18]. Sugianela [19] show the experiment result that the HOG method applied to Javanese scripts image can achieve good accuracy for recognition.
There are some steps in the HOG method. It is started by calculating the gradient of each pixel. After that, the ROI image is segmented into four cells (32 x 32 pixels) to calculate the histogram cell. Fig 6c and 6f are examples of gradient illustration in the HOG method.
Every pixel in the cell gives the vote for a histogram channel based on orientation. The researchers use nine bin orientations in voting to set the histogram value contribution. The number of voting is based on formula 3 -4 in [19]. Histogram value in a cell is grouped with others to be normalized for reducing the difference of object brightness. The number of the feature vector that the researchers get in this step is four cells x 9 bins = 36 feature vectors.  The next step in this study is classification using multiclass SVM with OAO strategy. Inputs for this stage are 36 feature vectors and a label for each letter. The researchers use all the dataset for training and testing data. The total class used is 31, and the total number of datasets is 182.
To evaluate the performance of our proposed method, the researchers test the accuracy. Accuracy is defined in equation (1).
In Confusion Matrix, there is True Positive (TP) statistics, which are the results of the correct system classification for class A. True Negative (TN) is the Non-A class. False Positive (FP) class is the Non-A class classified as an A class. Then, False Negative (FN) is A class which is classified as a Non-A class. Figure 8 is an illustration of the confusion matrix [20]. The result shows good performance in the evaluation. There are some mistakes in the model classification. Table 3 is a list of mistakes in the recognition system.

System
In this study, each class of letters has a different amount of training data. There are classes that have more than five training data such as Ha, Na, Ca, Ra, Ka, and Sandhangan Taling, Tarung, and Cecek. However, there are classes less than four training data such as Sandhangan Adheg-adheg, Titik, and Koma. This does not affect the classification results because the ROI classifies the class consisting of only one training data and has the correct results.
The factor that impacts in misclassification is a similarity in the shape of the letter that gives the effect in similar value in the HOG. Figure 9 shows the detail results for each letter. The recognition system successfully resolves the problem of color variation from the dataset. Most of the mistakes are from the letters that have italic shape. It is shown in Table 3.  There are some steps in the HOG method. It is started by calculating the gradient of each pixel. After that, the ROI image is segmented into four cells (32 × 32 pixels) to calculate the histogram cell. Figures 6c  and 6f are examples of gradient illustration in the HOG method.
Every pixel in the cell gives the vote for a histogram channel based on orientation. The researchers use nine bin orientations in voting to set the histogram value contribution. The number of voting is based on Eqs. (1)-(2) [19]. Histogram value in a cell is grouped with others to be normalized for reducing the difference of object brightness. The number of the feature vector that the researchers get in this step is four cells × 9 bins = 36 feature vectors. Figure 7 is an example of the HOG results of Fig. 6a. The next step in this study is classification using multiclass SVM with OAO strategy. Inputs for this stage are 36 feature vectors and a label for each letter. The researchers use all the dataset for training and testing data. The total class used is 31, and the total number of datasets is 182.
To evaluate the performance of our proposed method, the researchers test the accuracy. Accuracy is defined in equation (1).
The factor that impa the shape of the letter the HOG. description based on edges and orientations that apply to object recognition. This method is widely used in face recognition, animals, and detection of vehicle images [17]. HOG is also used to extract features in the multiclass classification of Batik [18]. Sugianela [19] show the experiment result that the HOG method applied to Javanese scripts image can achieve good accuracy for recognition. There are some steps in the HOG method. It is started by calculating the gradient of each pixel. After that, the ROI image is segmented into four cells (32 x 32 pixels) to calculate the histogram cell. Fig 6c and 6f are examples of gradient illustration in the HOG method.
Every pixel in the cell gives the vote for a histogram channel based on orientation. The researchers use nine bin orientations in voting to set the histogram value contribution. The number of voting is based on formula 3 -4 in [19]. Histogram value in a cell is grouped with others to be normalized for reducing the difference of object brightness. The number of the feature vector that the researchers get in this step is four cells x 9 bins = 36 feature vectors.  The next step in this study is classification using multiclass SVM with OAO strategy. Inputs for this stage are 36 feature vectors and a label for each letter. The researchers use all the dataset for training and testing data. The total class used is 31, and the total number of datasets is 182.
To evaluate the performance of our proposed method, the researchers test the accuracy. Accuracy is defined in equation (1).
In Confusion Matrix, there is True Positive (TP) statistics, which are the results of the correct system classification for class A. True Negative (TN) is the Non-A class. False Positive (FP) class is the Non-A class classified as an A class. Then, False Negative (FN) is A class which is classified as a Non-A class. Figure 8 is an illustration of the confusion matrix [20]. The result shows good performance in the evaluation. There are some mistakes in the model classification. Table 3 is a list of mistakes in the recognition system.

System
In this study, each class of letters has a different amount of training data. There are classes that have more than five training data such as Ha, Na, Ca, Ra, Ka, and Sandhangan Taling, Tarung, and Cecek. However, there are classes less than four training data such as Sandhangan Adheg-adheg, Titik, and Koma. This does not affect the classification results because the ROI classifies the class consisting of only one training data and has the correct results.
The factor that impacts in misclassification is a similarity in the shape of the letter that gives the effect in similar value in the HOG. Figure 9 shows the detail results for each letter. The recognition system successfully resolves the problem of color variation from the dataset. Most of the mistakes are from the letters that have italic shape. It is shown in Table 3.
It shows that ν as the contribution of the histogram value, µ as the gradient on pixels, c as the middle angle value in bin, θ as the gradient orientation angle at pixel, w as the width of the middle angle value w = 180 B , and B as bin width of the histogram used.
The next step in this research is classification using multiclass SVM with OAO strategy. Inputs for this stage are 36 feature vectors and a label for each letter. The researchers use all the dataset for training and testing data. The total class used is 31, and the total number of datasets is 182.
To evaluate the performance of our proposed method, the researchers test the accuracy. Accuracy is defined in Eq. (3).
In Confusion Matrix, there is True Positive (TP) statistics, which are the results of the correct system classification for class A. True Negative (TN) is the Non-A class. False Positive (FP) class is the Non-A class classified as an A class. Then, False Negative (FN) is A class which is classified as a Non-A class. Figure 8 is an illustration of the confusion matrix [20].
In this research, each class of letters has a different amount of training data. There are classes that have more than five training data such as Ha, Na, Ca,

28
Cite this article as: Y. Sugianela and N. Suciati, "Javanese Document Image Recognition Using Multiclass Support Vector Machine", CommIT (Communication & Information Technology) Journal 13(1), 25-30, 2019.  Ra, Ka, and Sandhangan Taling, Tarung, and Cecek. However, there are classes less than four training data such as Sandhangan Adeg-adeg, Titik, and Koma. This does not affect the classification results because the ROI classifies the class consisting of only one training data and has the correct results. The result shows good performance in the evaluation. However, there are some mistakes in the model classification. The factor that impacts in misclassification is a similarity in the shape of the letter that gives the effect in similar value in the HOG. Figure 9 shows the detail results for each letter. The recognition system successfully resolves the problem of color variation from the dataset. Most of the mistakes are from the letters that have italic shape. It is shown in Table III.
The researchers also compare the performance of Multiclass SVM with other popular classification methods. It is compared with Random Forest (RF), K-Nearest Neighbor (KNN), and Artificial Neural Network (ANN).  for training and testing data. The result shows good performance in the evaluation. The accuracy of the evaluation is 81.3%. This is the best result compared to the other popular classification method (RF, KNN, and ANN). For further research, the segmentation method is needed to translate the document of the Javanese script automatically. The evaluation for another kind of Javanese script is also required such as Pasangan, Swara, Angka, and Murdha. Moreover, the recognition system successfully resolves the problem of color variation from the datasets, but there are still many mistakes from the letters that have italic shape. It can be solved by the method of feature extraction. Moreover, this kind of study can be done in printed Sundanese and Balinese script because they have similar characteristics. The improvements are needed to recognize the ancient handwritten script.