Support Vector Machines Study on English Isolated-Word-Error Classification and Regression

,


INTRODUCTION
Support Vector Machines (SVM) are a useful technique for data classification (Vapnik, 1998;Chang and Lin, 2011) and regression (Vapnik, 1998;Smola and Scholkopf, 2003).SVM requires data to be presented as a vector of real numbers and scaling of data before applying SVM (Hsu et al., 2010).In supervised learning problems, feature selection is important for generalization performance and running time requirements (Weston et al., 2000;Dasgupta et al., 2007), whereby a subset of the features available from the data are selected for application of a learning algorithm (Blum and Langley, 1997;Weston et al., 2000).Selecting an optimal set of features is in general difficult, both theoretically and empirically (Kohavi and John, 1997).
In machine learning, satisfactory features selection is typically done by cross-validation, grid search and automatic scripts method (Hsu et al., 2010;Chang and Lin, 2011), that is, thus in doing so it eliminates the irrelevant features or variables (Alain, 2003).
The tasks of error detection and correction must be clearly defined (Chang and Lin, 2011).Efficient techniques such as n-gram analysis, lookup tables, etc., have been devised for detecting strings that do not appear in a given word list, dictionary or lexicon (Kukich, 1992).Many existing spelling correctors use different techniques such as rule-based and minimum edit distance technique (Kukich, 1992).In this study, we are proposing Support Vector Classification (SVC) for detecting the types of words: correct or wrong and finally using Support Vector Regression (SVR) to predict the possible correct words as the replacement for the wrong word.
This study has been organized in such a way that, in the next section, we provide brief literature survey on Support Vector Machines (SVM), methodology, experimental results and discussion and finally, conclusion.

THEORY ON SUPPORT VECTOR MACHINES
Basically, SVM is based on the Structural Risk Minimization principle (SRM) from the statistical learning theory (Vapnik, 1998).SVM is a set of related supervised machine learning methods used for classification, on the other hand, the empirical risk minimization principle, which is used by neural network to minimize the error on the training data, the SRM minimizes a bound on the testing error, thus allowing SVM to generalize better than conventional neural network (Gunn, 1997).Apart from the problem of poor generization and overfitting in neural network, SVM also address the problem of efficiency of training and testing and the parameter optimization problems frequently encountered in neural network (Cristiani and Taylor, 2000).Support vectors in SVM can be categorized into two types: the training samples that exactly locate on the margin of the separation hyperplane and the training samples that locate beyond their corresponding margins, whereby the later is regarded as misclassified samples (Zhan and Shen, 2005).At the moment SVM usage generally covers data classification and regression only (Chang and Lin, 2011;Drucker et al., 1996).
Given training data consisting of k labeled vectors represented by {x i , y i } for i = 1, 2 …., m where, x i ∈ n represents an n-dimensional input vector and y i ∈ 1, 1 represents the class label.These training patterns are linearly separable if a vector w (orientation of a discriminating plane) and a scalar b (offset of the discriminating plane from origin) can be defined so that inequalities in Eq. ( 1) and (2) are satisfied (Nagi et al., 2008): A hyperplane which divides the data is to be determined.The procedure involve determining w and b so that for all values of y i , Eq. ( 1) and ( 2) can be written as: If a hyperplane is optimally satisfied, then two classes are known to be linearly separable and Eq.(3) can be written as (Nagi et al., 2008): If the data is not linearly separable, another parameter called the slack variable ξ i for i = 1, 2, ……, m is introduced with ξ i >10 such that Eq. (3) can be represented as (Nagi et al., 2008): The solution to find a generalized optimal separating hyperplane can be obtained using the condition (Nagi et al., 2008): The first term in Eq. ( 6) controls the learning capacity, while the second term controls the number of misclassified points.The parameter C is selected by the user (Nagi et al., 2008), which is viewed as a regularization parameter that characterizes willingness to accept possible misclassifications in linearly nonseparable datasets.The classification function is (Nagi et al., 2008): where, α i is the support vector of nonnegative Lagrange multipliers.The parameters α i and b are to be determined by SVM's learning algorithm (Aida, 2009) during the classification process.Parameters y i and K(x i , x j ) are the label or class type and the Kernel function respectively.Parameter x i is the training or test data while x j is the input data for prediction.On the other hand, the regression function is: For our case study, we have chosen the Radial Basis Function (RBF) as kernel as it has less numerical difficulties (Vapnik, 1998) and has the property defined by K(x i ,x j ) = exp(-y|x i ,x j | 2 ).The kernel bandwidth parameter γ and the penalty parameter C for the RBF kernel are pre-determined by v-fold cross validation or grid-search (Hsu et al., 2010).
A recent result (Keerthi and Lin, 2003) shows that if RBF is used with model selection, then there is no need to consider the linear kernel.

METHODOLOGY ON SVM CLASSIFICATION AND REGRESSION MODEL
We defined word literally as a word that could be correctly read or identified without any ambiguity and non-word is defined as vice-versa.The scope of our study comes under the isolated-word error category (Kukich, 1992) and specifically on machine printed English word only.In this study, we are proposing a model to represent a word to meet the SVM classification requirements, with a maximum of eleven attributes or features namely numbers of consonant or non-vowel (nv) and vowel (v) character in them, total letter length (len) and their weight (w), weight of first (w1) and last letter (wn), position-vowel factor (pvf), sum of position-consonant-vowel factor (pcf), position factor (pf), real value (RV) and lastly, the type for the first letter (typ) in the word.In order to classify these words correctly, hereby we are proposing five rules as stated below: Rule 1: The number of non-vowel and vowel in a word equals to the length of the word.To illustrate our proposed model, we take any word, for example, consider the word final, then, with the help of Rule 1, 2 and 3, it can be shown that nv = 3, v = 2, len = 5, w = 42, w1 = 6, wn = 12, pvf = 22, pcf = 108, pf = 130, typ = 0 and RV = 0.553132238.We used LibSVM (Hsu et al., 2010;Chang and Lin, 2011), which is a SVM learning tool package, for the training and testing process of the normalized words.The five rules above will mathematically play an important function in the classification and regression of each word.
Word classification: Proposing a 2-steps procedure as follows: Step 1: Select the cross-validation parameter (v) equal 10 and the kernel function (RBF).By choosing v = 10, LibSVM package (Chang and Lin, 2011)  Wrong word regression: Again, we are proposing 2steps procedure: Step 1: Choose RBF.Train the training data by setting the correct value for parameter C and γ obtained through cross-validation process.
Step 2: Test the same unknown to obtain the mean squared error and squared correlation coefficient, hence deciding the possible correct words.

EXPERIMENTAL RESULTS AND DISCUSSION
Word classification: We found that SVM can be trained on words with and without spelling errors, thus it has the potential to adapt to the unique patterns for each different word.This unique pattern characterized the personnel preference of a person to certain words.
Meaning that a person tends to be personalized when comes to using or choosing a word; another person may prefer to use a different set of words during conversation, texting or writing.We define these preferences as the personalized dictionary of a particular person; thus, we managed to create a data base with a smaller number of words for SVM training purpose.To find the extent of our proposed classification model, we carried out two different studies in which all our data (training and testing) has been normalized accordingly to the range of (0, 1) and each word appears only once in our data base.
For the first work, we took three drafts of three different technical reports of a student, namely Report 1, Report 2 and Report 3, whereby normally these reports (Vivekan, 2010) have to be checked, corrected and commented by the respective supervisor before the reports are being allowed to be submitted to the University.We found that in those reports there are very few words with spelling errors made by the student himself.The pattern on the words used in writing three different reports are distinguishable for the student, meaning that he tends to use certain particular words in his writing.Since the total numbers of spelling errors in the reports are not many, thus, we have created 418 (or 26.2%) spelling errors ourselves from the total 1595 words.We decided to divide those data into 70:30 ratios, whereby 478 (30%) of the data will be used for testing purposes with v = 10.This time we obtained the testing accuracy of 77% with the best values of (C, γ) = (1, 1) by using iteration technique and number of attributes equals seven.In the second work, we increased the number of attributes and training words.Using an automatic script (Hsu et al., 2010) Fig. 1: Accuracy obtained using an automatic script (Fig. 1), subsequently we obtained the best parameters for (C, γ) as (0.03125, 0.0078125) respectively, thus giving a much better test accuracy value of 78.27% (Table 1).From the classification model file, we observed that parameter ρ = 1.00380 and number of support vectors is 695.Lastly, Kukich (1992) did mention that there is no optimal value for the number of attributes that one could consider the most appropriate to use.
In order to improve the accuracy, we have tried to rescale our data from (0, 1) range to (-1, 1) as proposed by Hsu et al. (2010).We observed no change in the value of the test accuracy for both data samples; meaning that our normalization of all the words to (0, 1) range is acceptable.For future work on the accuracy improvement possibility, one that we could look into is word pronunciation.Presently, our proposed model does not involve voice for pronunciation purpose as one of the attributes, meaning that a unique technique has to be deployed so that words can be 'logically' pronounced for the purpose of SVM processes.
Comparing work by Lyon and Yaeger (1996), Al-Jawfi ( 2009) and Khawaja et al. (2006) in classification of hand-printing English, Arabic and Chinese characters using neural network respectively, we found that the selected attributes were mainly based on the character size or the pixel counts obtained from character own images.Rani et al. (2011) used Gabor filters to get the concentration of energies in various directions for identifying printed English numerals at word level from Punjabi words document and she has chosen 5-fold cross-validation.Our proposed SVM model is using maximum eleven different behavioral features in representing an English word and 10-fold Cross-Validation (CV).
Referring to Table 2, we observed that our study produced the highest accuracy; meaning that SVM could provide us the best tool to classify isolated-word error in texts.Table 3 shows the comparison with Frank and Asuncion (2010) on related work on British English and Telugu vowels and again the accuracy values obtained are reasonably acceptable due to the constraints in obtaining a large numbers of attributes.
During the classification process, one of the file produced was a model file from which we obtained the parameter b = -rho = 1.00380, γ = gamma = 0.0078125 and the summation of Eq. ( 7) can be partly shown as in Table 4 for a set of prediction data X j1 = 0.148260212, X j2 = 0.211119459, X j3 = 0.189854345, X j4 = 0.444444444, X j5 = 0.5, X j6 = 0.4375, X j7 = 0.363636364, X j8 = 0.48, X j9 = 0.56, X j10 = 0 and X j11 = 0.161035843.We define parameter m in Eq. ( 7) to be the number of support vectors, that is, m = i = total number of SV = 695.Obviously, The RBF kernel is  actually calculating the exponential of the negative of Euclidean distance between the support vectors and the unknown vector points that we want to predict.Accordingly, it can be shown that for i = 695 is f (x 695 ) = -0.029704383.Hence, we can calculate the decision function when the offset parameter b = 1.00435 as: Graphically f(x) is a mapped point on the positive side of the hyperplane of the feature space, meaning that this test data has been classified into a+1 class.Rechecking with our data base shows that the test or predicted data used in the above calculated was the word 'maintain'.The word was originally labeled as Label 1 word which means it was a correctly printed English word.Table 5 shows the classification results on three other test data or words.
Wrong word regression: Table 6 shows that the regression results on our test data with different combination values of C and γ and the best results came from the best values of C and γ that were obtained by using iteration technique Hsu et al. (2010).It was observed that when (C, γ) are (0.5, 0.0078125) respectively, we get the third smallest total number of support vectors which will provides a shorter computer processing time for regression, the largest distance between the regression's hyperplane from the origin, the smallest Mean Squared Error (MSE) and the best Squared Correlation Coefficient (SCC).Statistically, MSE measures the average of the squares of the "errors" where the error is the amount by which the value implied by the model differs from test data.The difference occurs because of randomness or because the model does not account for information that could produce a more accurate estimate (Wikipedia, 2012).On the other hand, the SCC is a statistical measure of how well the regression line approximates the real data points.An SCC of 1.0 indicates that the regression line perfectly fits the data (Wikipedia, 2012).What we observed was that the obtained SCC values are very low, even though the MSE values can be considered as good number (Table 6).Thus, we can safely say that our regression model might be facing some difficulties in predicting a correct word.
On the other hand, regression function g(x) for each of the word (Table 7) indicates that the regression model or kernel that we obtained while training the train data was a good model, that is, g(x) is always greater than 0.9.What we want from our study on regression is if someone typed any wrong word or words with label -1 such as hazadous, hazarduos, azardous, etc., the error (s) needs to be corrected either by the user herself or by a word corrector.Thus, in order to correct the error (s), we have calculated their Margin of Error (ME) which equals to ±1.96 of the sample standard error (Wikipedia, 2012) for 95% confidence level with the number of sample of 451.We are proposing using only the three most important out of eleven parameters initially considered for our regression model, that is, parameter NRV, Npvf and Npcf.What we found was that the word hazarduos can be corrected and replaced by the word hazardous but not hazadous and azardous.These arguments are based on their calculated NRV, Npvf and Npcf values as shown in Table 7.Hence, the statistical method that we have adopted here has not been able to produce the expected results; even though the three respective parameter's margin of error is less than 1.4% only each.Hence, Table 7 contents need further analysis.

CONCLUSION
In this study, we have demonstrated the ability of SVM in classifying and regressioning certain isolatedword errors in the English word.One finding was that SVC could be trained as a word classification system, but for regression our SVR outputs need further fine tuning.

Rule 2 :
The weight of each word is equal to the word own weight.We define each character has a specific weight: (A, a) = 1, (B, b) = 2, (C, c) = 3, (D, d) = 4, until (Z, z) = 26 and (') = 0. Rule 3: The type for first character of the word is set to 0 if the first character is a non-vowel and 1 if vowel.Rule 4: The first letter in a word does not necessarily carry the highest probability that determine the word classification.Rule 5: If the weight of the first letter in the first word is different from the first letter of the second word and consecutively, then either word is different.

Table 1 :
Test resultsNo. of attributes used

Table 4 :
Partial calculation for kernel function (x i , x j ) for i = 1 and 2 Iteration i

Table 7 :
Regression function vs SVR parameters