Research on business English grammar detection system based on LSTM model

: In order to solve the problems that the current English grammar correction algorithms are not e ﬀ ective, the error correction ability is limited, and the error correction accuracy needs to be improved, this study proposes an automatic grammar correction method for business English writing based on two-way long short-term memory (LSTM) and N-gram. First, this study considers article and preposition errors as a special sequence labeling task, and proposes a Grammar error checking (GEC) method for sequence labeling based on bidirectional LSTM. During training, english as a second language (ESL) corpus and supplementary corpus are used to label speci ﬁ c articles or prepositions. Second, for noun simple-plural errors, verb form errors, and subject-verb inconsistency errors, a large number of news corpora are used to count the frequency of N-gram, and a GEC method based on ESL and news corpora N-gram voting strategy is proposed. Experimental results show that the overall F 1 value of the method designed in this study on the GEC data of CoNLL2013 is 33.87%, which is higher than the F 1 value of UIUC. The F 1 value of article error correction is 38.05%, and the F 1 value of preposition error correction is 28.89%. It is proved that this method can e ﬀ ectively improve the accuracy of grammar error correction and solve the gradient explosion problem of traditional error correction model, which is of great signi ﬁ cance to further strengthen the practicality of automatic grammar error correction technology.


Introduction
Grammar error checking (GEC) is a very important and long-standing problem in the field of computer-aided linguistics, and it is also an important problem that must be overcome in natural language processing.It uses the dependency structure and coherence between the components to verify, repair, and correct the grammar.For second language acquisition, students not only need to quickly identify and correct the wrong use of vocabulary, but also to ensure that the grammatical structure is correct.Only by conforming to the rules of language structure can our wording and expression be more precise and persuasive.The use of GEC AIDS can effectively improve the language skills of english as a second language (ESL) learners.This tool can provide them with grammar, spelling corrections, and suggestions, in order to strengthen their listening, speaking, reading, writing, and communication skills, so that grammar is no longer a barrier in ESL learning.
In recent years, GEC has become a research hotspot in the field of natural language processing.After Ng et al. opened the CoNLL-2013 and CoNLL-2014 shared tasks, along with the continuous improvement of machine learning algorithms and models, the research related to grammatical error correction in the English department has been greatly promoted.Wang et al. proposed a labeling model based on recurrent neural networks to address the impact of grammatical errors on sequence information extraction in English writing corpora, which was verified by experiments and achieved a high accuracy [1].Chen et al. used multi-layer forward feedback neural network to build a language model, complete English grammar correction, and designed the overall architecture of the system, which achieved high accuracy and fast error correction speed [2].Zhou and Liu described in their study that they modify and adjust for grammatical errors by applying algorithms similar to human language processing neural networks [3], which opens a new way for us to solve such problems more effectively in the future.Wang and Gu developed a multilayer perceptron-based algorithm for automatically correcting English writing grammar [4].The algorithm takes the feature set as the input of the multi-layer perceptron, and completes grammar correction by setting penalty parameters and deviation parameters, and improves the timeliness and recall rate of English grammar correction.Solyman et al. proposed a GEC model based on seq2seq Transformer [5], which uses expectation maximization routing algorithm to dynamically synthesize cross-layer information in Arabic GEC, and introduces bidirectional regularization terms using Kullback-Leibler divergence to improve the accuracy of syntax error correction.
Although the accuracy and recall rate of the above models have been significantly improved, there is still a long way to go before intelligent grammar error correction is fully realized.At the same time, the error correction task requires supervised corpus for model training, which often leads to insufficient corpus for text error correction in the training process [6].N-gram model solves this problem well, and can use a large number of unsupervised corpus to train statistical language models.N-gram is the simplest and most commonly used model, but the model does not consider the semantic information of the context when replacing the characters inside the text using the puzzle set, which often results in the situation that although the adjacent words are correct, they are not logical in the whole sentence, resulting in a low score [7].With the gradual rise of machine learning, more and more scholars apply deep learning methods to error correction tasks [8].The artificial neural network of long short-term memory (LSTM) solves the problem of gradient disappearance and long-term dependence of information by setting input gate, output gate, forgetting gate, etc.The effect is more prominent in processing sequence problems.LSTM artificial neural network model can learn the deep information of text, strengthen the understanding of text semantics, deepen the learning of text structure, and have a better help to obtain the recommended modification options [9].
Therefore, this study proposes an automatic grammar correction method for business English writing based on two-way LSTM and N-gram.Aiming at the errors of articles and prepositions in business English writing, this study treats them as sequential labeling problems and uses bidirectional LSTM to train the model.In this study, an automatic English grammar correction method based on N-gram voting strategy is proposed for noun singular and plural errors, verb form errors, and subject-verb inconsistency errors.
The main contributions of this study are as follows: (1) In the N-gram method of this study, a large number of news corpus are used to perform N-gram frequency statistics, so as to identify and correct subject-predicate errors of nouns and verbs; (2) In the bidirectional LSTM model, manual error generation is carried out on the corpus without grammatical errors to balance the differences between the corpus and supplement the corpus for model training.The innovations of this study are as follows: the traditional GEC method uses N-gram or rule-based method to correct syntax errors, and only uses context information with fixed window size for correction, which is not enough.Moreover, when the window size becomes larger, it is difficult to train the model.In this study, the bidirectional LSTM network model can learn the long-term dependent information that determines the use of prepositions or articles, and can avoid the problems such as gradient disappearance that may occur in traditional recurrent neural networks.
The automatic detection of business English grammar can bring better development opportunities for artificial intelligence, enhance its knowledge and skills of natural language analysis, and improve its semantic search ability.In addition, the results of this study have some enlightening significance for foreign language teaching and native language teaching.Through this research, our results can help students to have a deeper understanding of the meaning of English words, reduce the incidence of grammar errors, and enhance their confidence and motivation in learning, and finally achieve a multiplier effect in learning with half the effort.

Related work 2.1 Word segmentation technology
After dividing a paragraph into single sentences, we also need to extract the words from a sentence, which is the smallest unit of language study.Therefore, word segmentation is a very critical step, which is related to the problem analysis of the whole sentence, and is an important step in text processing.
English word segmentation is relatively simple compared with Chinese, because English itself has more obvious characteristics, there are spaces or other symbols between the words.However, English word segmentation is also difficult, and there are many ambiguous scenes.For example, abbreviations of words: it's and couldn't, it's can be used as a word or split into it is, couldn't can be split into could not or as a separate word.Similarly, the representation of the number can also affect the segmentation, for example, 222333 can be written as 222,333 or 222.333, and the use of different punctuation marks to represent the number causes some ambiguity.There are also certain words, such as New York and Los Angeles, which should be treated as a complete word in word segmentation, and should not be separated to facilitate part of speech tagging [10,11].
At present, the commonly used word segmentation methods mainly include the following types: rule model, which defines a series of word segmentation rules combined with regular expressions to divide words in sentences.The other is the word segmentation technology based on statistical probability model.Based on the corpus and some commonly used language models such as N-gram and Hidden Markov model, the segmentation results are compared and analyzed to select the most reasonable way of word segmentation [12].

Part-of-speech tagging
Part-of-speech tagging is a common technique in natural language processing.Part-of-speech tagging is to label parts of speech for each word unit on the basis of word segmentation, such as verbs, nouns, adjectives, adverbs, and so on.The grammatical correctness of a sentence is closely related to the part-of-speech of a word.Some grammatical errors are caused by the wrong tense of a word.The biggest difficulty in part-ofspeech tagging is to solve various ambiguity problems.Words and parts of speech are not all one-to-one correspondences, and some words have different parts of speech in different scenes.For example, here are two examples: "I bought a book yesterday.""I have booked a seat at the restaurant." The book in the first sentence is the noun "book" and the verb "book" in the second sentence.It is not reasonable to distinguish the parts of speech only according to morphology, and the general processing method will label the parts of speech according to the context.The above example is only a typical example of ambiguity in part-of-speech tagging.Brown corpus is one of the most famous corpora in natural language processing.According to the statistics of part-of-speech tagging results, there are 4,100 words with multiple tagging and 35,340 words with only one tagging, which shows that ambiguity in English part-of-speech tagging is very common.The important task of part-of-speech tagging is to deal with these ambiguous scenarios.
There are two types of part-of-speech tagging: regular model and statistical model.Most of the early adopters were rule models, in which linguists defined the part-of-speech candidate set and formulated a series of part-of-speech tagging rules according to their understanding and familiarity with the language [13].This method relies heavily on the linguist's knowledge of the language, and English is a very complex language, and thousands of rules may be formulated for some large corpora.The work is cumbersome, consumes a lot of manpower and material resources, and after too many rules, it is difficult to ensure correctness and consistency, and there may be consistency and inconsistency.The rule model has to do special processing for each language, and the work is extremely tedious.Therefore, the annotation method of the rule model is rarely used at present, at most as an auxiliary method.
Compared with the rule model, statistical method is much simpler.The development of statistical model is closely related to the construction and improvement of corpus [14].According to the corpus and related algorithms, a tagger is first trained and learned, and then annotated.Although there is also the problem of manual initial labeling of data, the labeling here is much easier than making rules, and inconsistency before and after labeling is allowed because probability is used.Commonly used methods include Viterbi algorithm, CLAWS algorithm, and improved VOLSUNGA parts-of-speech tagging algorithm [15].They all use Markov hypothesis principle, and apply dynamic programming technology to reduce the complexity of the algorithm and improve the speed of annotation.On the whole, the statistical model has lower requirements on the original labeled data, and the labeling process is simple, which has greater advantages than the regular model.

Syntax error correction algorithm
Syntax analysis is a common technique in natural language processing, and open source tools can perform syntax analysis and build corresponding syntax trees.Generally speaking, if the construction of a sentence's grammar tree fails, then the sentence should exist.Finally, according to the voting importance results, the grammar corrected results were obtained.Of course, syntactic analysis also has limitations and shortcomings [16].The current syntactic analysis ability is limited, and the analysis ability of complex clauses is limited.For some special sentences, it is necessary to add parsing rules.One of the most important problems with syntaxbased language is to establish a complete grammatical system for the language, which requires a lot of work by language experts.At present, the theories of grammar system construction by language experts are not consistent, and there are great controversies on some difficult points.This also leads to different results in the syntactic analysis of some sentences, and there is ambiguity.
The rule-based approach is to build the rule base first, and then match the input word sequence according to the rules.If the sentence has a syntax error, it will not match the rules in the rule base.Some rule bases are also built based on the parts of speech sequence of words, but the basic principle is no different from the word sequence principle.The word check feature of Microsoft Word uses the rule base method and can check multiple languages.And its error detection function is more powerful, for common word spelling errors, phrase errors and conjunction errors can be recognized, and can give suggestions for modification.Rule based syntax error correction is very convenient to implement, search and match according to the rules on the line, from the perspective of system implementation is simple and fast.In CoNLL2014, many papers mentioned the use of rules for grammar error correction, which is more used as an auxiliary function [17].The biggest limitation of rule grammar correction is that rules are limited, grammar errors are complicated, English grammar errors are difficult to count, and with the expansion of the rule base, the cost of adding rules is getting higher and higher.It becomes more difficult to conclude new rules, there may be conflicts between rules, and the proliferation of rules cannot be stopped.

LSTM model
When tagging sequence data, the tagging information of the current word generally depends on the context information.Traditional sequence labeling methods rely on statistical or fusion methods, while recurrent neural networks can extract sequence information well by establishing sequence relationships of hidden layers.By setting threshold units and cells, LSTM can effectively avoid the problems such as gradient disappearance and gradient explosion that may occur in the training of traditional recurrent neural networks [18].
Compared with one-way LSTM model, which only accumulates the information before the current moment, two-way LSTM can accumulate the context information of the current moment, so that the model can synthesize the context information for sequence annotation.Bidirectional LSTM processes data from both directions by using two separate hidden layers, both of which are then fed forward to the same output layer.The whole network structure is shown in Figure 1.
Through time T to 1 and time 1 to T, the bidirectional LSTM network uses the formulas (1)-( 3) and the input vector sequence , and network output layer, respectively.
where W represents the weight matrix; b is the bias vector; and f is the hidden layer function, specifi- cally 3 GEC method based on N-gram and bidirectional LSTM

N-gram search service and knowledge base
The syntax error correction strategy is based on N-gram frequency statistics, so it is necessary to establish Ngram search service first.The N-gram sources used were all news for 2020 from about 12,000 news sites, with statistics shown in Table 1.In order to improve its query efficiency, the open source search engine is used to establish inverted index and provide search services.
Both articles and prepositions have limited variations and are in closed sets.Build a finite confusion set for articles and prepositions.The article confusion set contains three cases: the, a/an, and null.Null indicates that no article is used.The preposition confusion set contains the common 17 prepositions: on, about, into, with, as, at, by, or, from, in, of, over, to, among, between, under, and within.Unlike articles and prepositions, nouns and verbs are open sets.Therefore, the variation tables are established for noun errors, verb forms, and subjectverb inconsistency errors, respectively.

Moving windows and N-gram voting strategies
The GEC method for confounding set as open set is based on moving window and N-gram voting strategy.
The moving window is defined as follows: where ω i is the ith word in the sentence, k represents the window size, and j is the distance between the first word in the window and ω i .The selection of the window size k and the value range of j directly affect the effect of GEC.For different error types, different k and j values are selected.
The N-gram voting strategy simulates a real-life voting mechanism, with an N-gram fragment containing a syntactically incorrect candidate representing a candidate who may have the right to vote.Due to the limited corpus, the frequency of N-gram fragments may be very sparse.This policy sets a minimum effective frequency.Only when the queried frequency is higher than the minimum effective frequency, the N-gram fragment has the voting right.
In real life, the importance of votes cast by different people in different fields is different, and the importance of votes cast by experts in fields is higher than that of ordinary people.This strategy uses Ngram fragment length to simulate the degree of expertise of the domain expert, and the longer the N-gram, the higher the importance of the vote.Finally, according to the voting importance results, the grammar corrected results were obtained.The specific algorithm is as follows: Input This algorithm is based on the corpus, because on the one hand, the corpus size is limited and it is impossible to contain all fragments, and on the other hand, there are noise data in the corpus.Therefore, the minimum effective frequency Fset is set.Only when the frequency of the queried N-gram fragment is greater than this frequency, can it be shown that the corpus contains relevant corpus and this N-gram fragment has voting rights.According to the experimental comparison, Fset is set to 100.
The frequency of the N-gram segment with voting rights represents the probability of changing to the corresponding result.Suppose that the frequency of "have an apple" is 2, the frequency of "have the apple" is 1, and the frequency of "have apple" is 1.Then, according to the corpus, the probability of changing to "have an apple" is greater than "have the apple" and "have apple."The voting strategy is to select one as the voting object from "have an apple," "have the apple," and "have apple."This strategy selects the one with high probability as the voting object.

Annotation correction strategy based on bidirectional LSTM
Considering article and preposition errors as a special sequence labeling task, this study proposes a GEC method for sequence labeling based on LSTM.First, the training corpus with part-of-speech tagging is preprocessed, and the article part of speech is replaced by a special mark "ART" and the prepositional part of speech is replaced by a special mark "TO," and the above marks are exchanged with the positions of the articles or prepositions.Then, the model is trained using LSTM.Finally, the trained model is used to test the data.The labeling model based on LSTM is shown in Figure 2.
The model first converts the word into a word vector as input to the model, entering the word vector at the corresponding position in the sequence at each moment.During the training process, word vectors are updated as parameters.The model uses the word vector as input to the LSTM unit and, at each moment, outputs the corresponding annotation vector.The tagging set is the union of all parts of speech sets and all confounding sets of prepositions or articles.The dimension of the annotation vector is consistent with the total number of annotation sets, and the annotation with the highest selection probability is marked by SoftMax.
The labeling model relies on back propagation through time algorithm and uses stochastic gradient descent for supervised training.
In this study, the grammatical errors with fixed confusion sets, namely, articles and prepositions, are marked and corrected.Before labeling, the article or preposition in the sequence is represented by a uniform identifier.When marking, the uniform identifier is marked with specific prepositions or articles to realize the correction of article or preposition errors in grammar.

Verb and subject-verb inconsistency error correction
The verb error correction module mainly corrects the verb form misuse and subject-verb inconsistency.This module relies on verb form conjugation table, verb single and plural conjugation table and N-gram voting strategy.The specific correction process is as follows: (1) The parts of speech sequence is obtained by marking the parts of speech of the words in the sentence.For verb form errors, the words labeled as VB, VBD, VBG, and VBN are extracted as error candidates.For subject-verb inconsistency errors, the words labeled VBP and VBZ are extracted as candidates for errors.(2) The correction candidate set of the error candidate is obtained according to the error candidate and verb form change table/verb single and plural change table .(3) For the correction candidate, the collection of N-grams fragments is obtained using a moving window with a size of 3-5.Use the N-gram voting strategy to get the correct candidate with the highest number of votes and replace it in the original sentence.
The GEC method system architecture based on LSTM and N-gram is shown in Figure 3.

Experimental data
The experimental data came from the GEC evaluation task of CoNLL2013, and the statistical results are shown in Table 2.The CoNLL2013 corpus does not have correct parts-of-speech tagging, and the CoNLL2013 training corpus is smaller than PIGAI parts-of-speech tagging corpus and Brown corpus.Therefore, when the PIGAI parts-of-speech tagging corpus, Brown corpus, and labeled CoNLL corpus were used to expand the LSTM training corpus tagging, the Stanford tagging tool was used for parts-of-speech tagging.Among them, CoNLL2013 and PIGAI corpus are used as ESL corpus, and Brown corpus is used as supplementary news corpus to participate in the training of the model.Research on business English grammar detection system based on LSTM model  9 The GEC evaluation task data of CoNLL2013 marked a variety of error types, but the evaluation task mainly targeted at the five error types with a relatively high proportion: article error, preposition error, noun error, subject-verb agreement, and verb form error.

Evaluation method
The evaluation standard of CoNLL2013 is F 1 , which is defined as follows: where P and R represent the accuracy rate and recall rate, respectively, defined as follows: where N correct refers to the correct number of syntax errors that the system modifies, N predicted refers to the number of syntax errors that the system modifies, and N target refers to the number of errors in the corpus itself.

Experimental results and analysis
The experimental results of the GEC method based on LSTM and N-gram on the GEC evaluation data of CoNLL2013 are shown in Tables 3-5, and are compared with the corp-based methods of GEC and correction of English articles and UIUC methods.As shown in Table 3, for the correction of article errors, the F 1 value of the proposed method is 5% higher than that of the UIUC method and 5% higher than that of the Corpus GEC method.For the correction of preposition errors, F 1 value of the proposed method is 21% higher than that of UIUC method and 13% higher than Corpus GEC method, indicating that the LSTM-based GEC method is effective for article and preposition syntax error correction tasks.This is because the word vector contains the main rich context information, and using LSTM better learns the long-term dependency information that determines the use of articles or prepositions, so the results are better.
As shown in Table 4, when only N-gram + vote voting strategy is used to correct noun and verb errors, there is still a certain gap between F 1 value and UIUC method.This is because the news corpus on which the Ngram + vote strategy is based is different from the article that needs to be corrected, and a large number of correct sentences are changed into incorrect ones.The table of noun and verb conjugations cannot cover all the conjugations of nouns and verbs, which leads to certain limitations in the correction of nouns and verbs.
As shown in Table 5, for the correction of all five types of errors, the proposed method is superior to the UIUC method.The total F 1 value in the GEC data of CoNLL in 2013 is 33.87%, exceeding the total F 1 value of UIUC by 31.20%.The experimental results show that the automatic correction method based on LSTM and N-gram is effective.
In order to verify the accuracy of the automatic grammar correction system based on bidirectional LSTM and N-gram, the RNN based Sep2Sep model, LSTM-based Sep2Sep model, CNN-based Sep2Sep model, nested attention neural model, and deep context model are further introduced for comparative analysis.The syntax error correction effects of the six error correction models are shown in Table 6.Research on business English grammar detection system based on LSTM model  11 (2) Service access module The service access module consists of five parts: error correction service, information feedback mechanism, user operation and management, text processing and management, and system upgrade and update.The error correction service accepts error correction requests from users, uses specific algorithms and language processing techniques to correct syntax errors, and then sends the corrected version back.An information feedback mechanism that accepts the user's comments on the content of the web page and applies filters to select the most suitable solutions from the suggestions returned by the server side, and then pushes them to the user.User operation and management include identity authentication, category differentiation, and permission verification.During identity authentication, you need to check whether the user name and key are correct.Implement the role management mechanism according to the user's change suggestion action, only senior members and VIP customers are allowed to enjoy this function control.Text management allows you to store, modify, and export suggested texts in batches.The recommended processing times are recorded and a counter is set in the system update.Once accumulated to a predefined upper limit, it causes the syntax repair algorithm to be retrained to improve the self-learning ability of the whole system.
(3) Feedback filter module This functional component mainly undertakes the implementation of feedback filtering and its service, including three key steps: corpus processing, model training, and feedback classification.After using the collected language samples for lexical analysis and grammatical decomposition, the expressiveness of the language model is optimized by controlling the sentence structure.Finally, the trained model is evaluated to obtain its performance data.The Thrift service interface provides feedback and filters out proposals that do not meet the requirements, and returns the accepted or rejected responses to the user.

Conclusion
To sum up, the automatic grammar correction system built in this study based on two-way LSTM and N-gram is feasible and effective, which can effectively improve the accuracy of English writing grammar error detection and enhance the effect of English grammar error correction.Experimental verification results show that the F 1 value, accuracy rate, and recall rate of the proposed method are higher than that of the traditional UIUC method and Corpus GEC method, and the grammar error correction of articles and prepositions is better.At the same time, compared with other models, the accuracy, recall rate, and F 1 value of this model are significantly improved.However, the effectiveness of the method proposed in this study is still to be improved in correcting the errors of nouns and verbs.In the next step, we will further improve the rationality of data construction, so that the constructed error samples are more consistent with the actual grammatical errors made by business English students.In addition, we will improve the structure of multi-task learning of linguistic features to further improve the GED task detection effect.

Table 1 :
N-gram details

Table 2 :
GEC evaluation task data statistics of CoNLL2013

Table 3 :
Results of article and preposition errors correction

Table 4 :
Results of noun and verb errors correction

Table 5 :
Results of correcting all types of errors

Table 6 :
Analysis of model syntax error correction effect