A Framework for Grammatical Error Detection and Correction System for Punjabi Language Using Stochastic Approach

INTRODUCTION: In this modern era of internet and technology natural language processing task has emerged as one of the major research area in computer science. Grammatical error detection and correction system assists to detect and correct syntactic errors present in written text. OBJECTIVES: In this research article, author investigate the applicability of stochastic approach for the development of grammatical error detection and correction system for Punjabi language. METHOD: Author used corpus based stochastic approach to developed the system. The corpus used was taken from Indian language corpora initiative. RESULTS: On testing, the developed system shows a precision as 82.5%, recall as 89% .and f-measure as 85%. The results of the proposed system outperform the existing rule based system that shows precision of 76.79%, recall of 87.08%, and F-measure of 81.61%. CONCLUSION: author concluded that for syntax analysis stochastic approach can perform better than rule based approach.


Introduction
Today, millions of people around the world are using Punjabi language for speaking and writing purpose.In fact, non-native Punjabi speakers currently outnumber than native speakers and their numbers will keep increasing in the future.Non-native Punjabi speakers usually make errors in written text, and further these errors are of various types according to their complexity.A practical grammatical error correction (GEC) system to correct errors in Punjabi text promises to benefit Millions of Punjabi language learners around the world.Further GEC has commercial perspective also i.e. there is a great potential for many other practical applications, such as proofreading tools that help non-native speakers to identify and correct their writing errors without human intervention or an educational software for automated language learning and assessment.One of the popular existing professional GEC systems is GRAMMARLY (used for detection and correction of spelling and grammar error in English language).The general architecture of GEC system can be viewed as shown in following figure1:

Figure 1. General representation of GEC
As shown in above figure1, the user will give input text in the form of paragraphs or sentence.The GEC system will detect the grammatical or syntactical correctness of input text as per the grammar rule of the language in which input text is written.If the input text is found correct then no error or suggestions will be provided otherwise, if the text is found grammatically incorrect then suggestions to rectify the errors will be provided by the system.
Punjabi is the official language of one of the state in India i.e.Punjab.There are approximately 125 million Punjabi speakers in India.Other than India, Punjabi is also spoken by a number of migrated peoples residing in Canada, USA, Australia and UK etc. Punjabi language is also used in Pakistan in written as well as in spoken form.Different scripts are used to write Punjabi language in India and Pakistan.The script used to write Punjabi in India is Gurmukhi while script used to write Punjabi in Pakistan is Shahmukhi.Center for technical development of Punjabi language had already developed a software called Sangam (Gurpreet Singh Lehal

Existing Work
As discussed in section1, various researchers and organizations are working on the development of natural language processing resources, but still a lot of work for development of GEC is in queue.After reviewing a number of literatures written by different authors it is observed that there are mainly rule based, classifier based and statistics based methods are used for GEC system development.Some of the observations from reviewed literatures is discussed in the following section.

Rule Based Approach
This is the oldest method used for development of GEC.In the beginning, simple pattern matching and string replacement techniques were used to implement rule based approach.Later on syntactic parsing using part of speech tagging, tree parsing and hand crafted rules were used (Heidorn, Jensen, Miller, Byrd, & Chodorow, 1982).The first grammar checking tools, such as the Unix Writer's Workbench (MacDonald et al., 1982) or EPISTLE and CRITIQUE (Heidorn et al., 1982), used hand-crafted rules and pattern matching techniques.The most widely used grammar checker nowadays from Microsoft Word text editor (Heidorn, 2000) relies mostly on a rule-based approach.Another recent example is LanguageTool2 (Miªkowski, 2010), which has been initially developed by Naber (2003).Further this approach is used by ( One of the disadvantage of the rule based system is that, most of the errors are complex and the rule-based systems fail to rectify those errors.Further it is not feasible to construct exhaustive set of rules to rectify all possible types of grammatical errors.Therefore, now in the development of most of the GEC systems, instead of employing only rule-based mechanism, stochastic or hybrid approach is preferred.

Classifier Based Approach
Now, because of easily availability of annotated corpus, various machine learning classifiers were developed to correct the incorrect sentences ((Han, Hall, Chodorow, & Leacock, 2008), (Rozovskaya, Tech, & Roth, 2016))).In this approach, GEC is simulated as a classification problem with multiple classifiers in which an incorrect candidate sentence may have multiple possible correct solutions.This approach is used by (Han et al., 2008) in which author trained a maximum entropy classifier to detect article errors and achieved an accuracy of 88%.Further (Tetreault & Chodorow, 2008) used maximum entropy models to correct errors for 34 common English prepositions in learner text.In this approach, one of the commonly used method is to build multiple classifiers, one for each error type and cascade them into a pipeline.Further a combination of rule-based and classifier models to build GEC systems (that can solve multiple errors) is tried by [23].But the disadvantage of classifier approach is that, it can be applied to solve only those errors which are independent of each other and are unable to solve the dependent errors.The problem of dependent errors is solved by developing a system of multiple classifiers for a sentence containing dependent errors [27].In addition (Dahlmeier & Ng, 2012) developed a beamsearch decoder for correcting interacting errors.

Statistics Based Approach
Statistical approaches are tried by many researchers for the development of GEC.The main reason of using this approach was availability of digital data on the internet for training.Researchers used this digital text to train the system.Most of the statistical approaches are probability based in which various types of probabilities (e.g.transition, emission, n-gram etc.) from sequence of POS (part-of-speech) tags is calculated.The POS sequence of input text is evaluated against these probabilities and if they fall below some threshold values, then input sentence is considered as correct otherwise incorrect.Larger the annotated corpus more will be the accuracy of the system.Further the annotated corpus should be versatile i.e. it should cover as many different domains as possible.This approach also has some pitfalls as due to its statistical nature sometimes it provides unpredicted results, and it becomes difficult for the user to interpret these results.The main advantage of this approach is that it can be implemented on any natural language without the knowledge of the syntax of that language.First researcher to use this approach was Atwell, Eric Steven ( (Brockett et al., 2006) to correct a set of 14 countable/uncountable noun errors made by learners of English language.Experiments show that their SMT system was generally able to beat the standard Microsoft Word 2003 grammar checker, although it produced a relatively higher rate of erroneous corrections.Further (Mizumoto, Komachi, Masaaki, Ntt, & Matsumoto, 2011) used this SMT based approach to develop Japanese language error detection system.Further the effect of training corpus size on various types of grammatical errors in English language is studied by (Mizumoto, Hayashibe, Komachi, Nagata, & Matsumoto, 2012) and concluded that a phrase-based SMT system is effective at correcting errors that can be identified by a local context, but less effective for correcting errors that need long-range contextual information.Another POS-factored SMT system is trained by (Yuan & Kingdom, 2013) to correct five types of grammatical errors (articles, prepositions, noun number, verb form, and subject-verb agreement).A combination of rulebased system and a phrase-based SMT system is proposed by (Felice, 2014).Another hybrid approach by combining MT and classifier model is developed by (Susanto, 2014).Another experiment to develop GEC is done by (Grundkiewicz, 2014) by employing word-level Levenshtein distance between source and target as a translation model feature.Further in this field, effect of f-score tuning on precision is studied by (

Proposed Architecture
In this research, author experimented with stochastic approach to develop GEC system for Punjabi language.Author complete this work in two phases.
In the first phase stochastic probabilities are calculated using ILCI annotated corpus and in the second phase, grammar of input sentence is checked for correction using the stochastic probabilities calculated in the first phase.The first phase is developed using single module (tag sequence probability calculations) while in the second phase, three modules (Preprocessing, pattern matching based error detection and grammatical error correction) are used.Figure3 shows the architecture of proposed GEC system.As shown in figure 3, there are basically two components of GEC.
In the first component the probabilities of unique tag sequences is calculated and in the second component, using these unique tag probabilities, error is detected in the input text.After detection of error, input text is rectified as per grammar agreement rules.Further details of the various components of proposed architecture are explained in the section 4.1 to 4.4.

Annotated corpus used to calculate stochastic probabilities
As discussed above, the task of the first phase is to calculate the probability of tag sequences.In order to calculate these tag sequence probabilities annotated corpus of Punjabi language is required.The Punjabi annotated corpus used for this task is taken from Indian languages corpora Initiative (ILCI).This corpus includes data from various domains like sports news, agriculture, entertainment, tourism and health.Total 2, 64,474 number of sentences were taken to calculate the tag sequence probabilities.Further details of the annotated corpus are shown in following table 1

Phase 1 (Tag sequence probability calculation)
This is the first phase of this research work and this phase is completed in three steps (Word splitting, tag sequence extraction and unique tag sequence probability calculations).In word splitting, annotated training corpus (ILCI) is split in to list of tokens (individual words along with tags) and these tokens are stored in an array.Thereafter each individual token is processed to extract the tags from it.After extracting the tags from the tokens, these extracted tags are arranged to form tagged patterns.Some sample entries are shown below: After generating these tag sequence, bigrams of each tag sequence is generated.After generating bigrams from all the tag sequences, probability of unique bigram tag sequence is calculated by using following formula: Bi-gram Probability of tag i and tag j pair i.e.

Some sample entries of bigram probabilities are
shown in the table 2. In this phase, the input sentence entered by the user is checked against the bigram probabilities calculated in phase 1.In this phase, three modules are used.The first module is preprocessing, in which input text is split at sentence level followed by phrase level and in the last at word level till we get individual word as final token.However, if input text is in the form of paragraph, then the system will first split this paragraph into sentence then these sentences into tokens.In order to split the input text into tokens, special symbols are used as identifier other than tab space.These special symbols includes punctuation marks like comma (,), colon (:), question mark (?), semi-colon (;) and exclamation (!).After splitting, labeling is done.In labeling, all the individuals tokens separated in tokenization steps are assigned label as per morphology rules and tag tokens within their appropriate morphology based POS tag from the tagger dictionary.If a token is not present in the tagger dictionary then it will be assigned as "Unknown".The second module used in this phase is Pattern matching based error detection.In this module, various errors related with the mismatch of the agreement in an input sentence is detected.Various agreement errors detected by the system includes Subject-Verb, Object-Verb, Modifier-Noun, and Adverb-Verb grammar agreement errors.To identify these types of errors, probability of the tag pattern of input text is calculated.

Algorithm Used:
As explained in above algorithm, input sentence is scanned from left to right and agreement between the noun and adjective or noun and determiner or noun and number is identified.If there is mismatched in the agreement then error message is displayed.The subject and verb agreement is done according to number, gender and person tag value of the subject and the verb.The subject of input sentence is identified from the tagged sentence having NN or PNP tags.
Similarly agreement errors between modifier noun and noun adjective is identified.

Correction of error
After detection of error, last step of this second phase is the correction of detected error.This is most crucial step and need addition database i.e. morph.Correction is done on the basis of the mismatch component of tag.This is explained by following example.

𝐶𝐶𝐶𝐶𝐶𝐶 𝐶𝐶𝐶𝐶𝐶𝐶+𝑁𝑁𝐶𝐶𝐶𝐶
After applying grammatical information (POS tagging): Now the error detection system will provide the following error: From the POS tags it is clear that CDPD is plural and NNMSD is singular.Therefore to correct this error, the word ਮੁੰ ਡਾ need to be converted in to plural form.
Here the role of morph comes into play.From morph the plural word of ਮੁੰ ਡਾ is ਮੁੰ ਡੇ and hence to make the sentence grammatically correct, the word ਮੁੰ ਡਾ should be replaced with word ਮੁੰ ਡੇ .

Types of Error Covered
When we talked about the grammatical mistakes in written text then there may be countless number of errors in text.Thus it is very difficult to develop a single GEC that could detect and correct all possible errors present in written text.In this research author covered five types of errors.These errors with suitable examples is shown in table 3.

Evaluation Metrics
To measure the performance of the developed system, three basic parameters are used i.e. precision, recall and f-score.These metrics are explained as follow: Let FCE = number of flagged correct grammar errors FWE = number of flagged wrong grammar errors, NFE = number non flagged grammar errors, Then precision can be defined as percentage of relevant results and can be calculated by using the following formula: Recall can be defined as percentage of total relevant results correctly classified by an algorithm and can be calculated by using the following formula: F-measure denotes the accuracy of the system and can be calculate by taking the geometric mean of the precision and recall as shown in the following formula: (

Test Result and Discussion
The developed GEC system is manually tested on 600 sentences using mixture of correct and incorrect Punjabi sentences and the output of the test results are recorded manually.To test the system, 410 Punjabi correct grammar sentences and 190 incorrect grammar sentences are taken.Out of 410 correct sentences, 210 sentences are taken from reliable internet sources i.e. e-papers and 200 sentences are taken from standard Punjabi corpus available at ILCI.To perform the testing, mixture of correct and incorrect sentences are distributed into four sets containing 150 sentences in each set.These four sets are given the label as test_set1, test_set2, test_set3 and test_set4.The complete details of the corpus used for testing is shown in table 4. The output of the system is manually evaluated by linguistic.

Comparison with existing Punjabi grammar checker
Rule based grammar checker for Punjabi language (Gill, 2008) Identifies grammatical errors in Punjabi texts such as modifier and noun agreement, subject and verb agreement, noun and adjective, order of modifier of noun in a noun phrase, order of verb in a verb phrase and the like.To detect the errors the system passes through few steps or phases initially, pre-processing task is done on the input text which is tokenization, morphological analysis, rule-based part of speech tagging, chucking and finally, using the grammatical error checking rule.Grammatical errors internal to the phrases and the sentences are identified and correction suggested.The evaluation of the grammar checker shows precision of 76.79%, recall of 87.08%, and F-measure of 81.61%.The researchers stated that the system generated some false alarms for complex and compound sentences.

Conclusion and future scope
In this research article, author developed statistics based Punjabi grammar checker in which he used pattern matching along with n-gram probability for detection of errors and class agreement rules for correction of errors.On testing the system on a dataset of 600 sentences, system shows a precision of 0.82, recall as 0.89 and f-measure as 0.85.Further the test data used for testing the system contains 410 correct sentences and 190 incorrect sentences.These incorrect sentences were manually generated by incorporating those errors for which this system has been designed.This grammar checker mainly checks four types of errors i.e. error related to subject verb agreement in terms of number and gender, modifier noun agreement in terms of number and gender, use of KE after the oblique case and order of modifier.In future this system can be extended for long Punjabi sentences like compound and complex sentences and also for some other types of errors alike order in verb phrase, errors related to contractions and long term dependencies etc.
& Saini, 2014) that can convert Gurmukhi to Shahmukhi and vice versa.There are many organizations working on the technical development of Punjabi language.Main organizations working in this field are center for technical development of Punjabi language (Punjabi University Patiala), C-DAC Mohali and Thapar Institute of Engineering and Technology (TIET) Patiala.Besides these, researchers from TDIL (Technical development of Indian languages) and IIIT (International institute of information technology) Hyderabad are also working on technical development of Punjabi language.Some of the Punjabi language processing resources developed by these organizations include Punjabi spell checker (Dhanju, Lehal, Saini, & Kaur, 2015), Punjabi grammar checker (Gill, n.d.), Punjabi POS tagger (Adamson, 2009), Punjabi Morphological analyzer (Gill, 2007), Gurmukhi to Shahmukhi machine translation (Gurpreet Singh Lehal, 2009), Hindi to Punjabi machine translation (Goyal & Lehal, 2009), Punjabi to Hindi Machine translation system (Josan & Lehal, 2008), Punjabi Optical Character Recognition system (G.S. Lehal & Singh, 2002), Punjabi summarization (Gupta & Singh, 2012) etc.

Figure 2 .
Figure 2. Various variants of statistical techniques used for grammar checking Kunchukuttan, Chaudhury, & Bhattacharyya, 2014) and concluded that this will reduce the performance of the GEC.More recently, (Napoles & Callison-Burch, 2018) proposed a light weight approach to develop GEC called Specialized Machine translation for Error Correction (SMEC) which represents a single model that handles morphological changes, spelling corrections, and phrasal substitutions.Further, (Hermet, Edward, & Désilets, 2009) handled task on detection of preposition errors by generating a roundtrip translation via French and their model identify 66.4% of errors.An all-errors task using round-trip translations obtained from the Google Translate API via eight different pivot languages is attempted by (Nitin, Tetreault, & Chodorow, 2012).

Figure 3 .
Figure 3. Proposed architecture of GEC

Figure 4 .
Figure 4. Test results of proposed GEC

4 TEST
T _ S E T 1 T E S T _ S E T 2 T E S T _ S E T 3 T E S T _ S ET 08 2021 | Volume 8 | Issue 32 | e7

Table 1 .
. Details of the annoated Corpus used for calculating tag sequence probabilities

Table 4 .
Details of the corpus used for testing the proposed GECThe developed system is tested on the data mentioned in table4and the analysis of the results obtained are shown in table 5 and figure4.It is clear from the table 5 that the developed system shows an average precision of 0.82, average recall of 0.89 and an average f-measure as 0.85.EAI Endorsed Transactions Scalable Information Systems 06 2021 -08 2021 | Volume 8 | Issue 32 | e7 L. Jindal, H. Singh, S. K. Sharma