A Levenshtein distance-based method for word segmentation in corpus augmentation of geoscience texts

ABSTRACT For geoscience text, rich domain corpora have become the basis of improving the model performance in word segmentation. However, the lack of domain-specific corpus with annotation labelled has become a major obstacle to professional information mining in geoscience fields. In this paper, we propose a corpus augmentation method based on Levenshtein distance. According to the technique, a geoscience dictionary of 20,137 words was collected and constructed by crawling the keywords from published papers in China National Knowledge Infrastructure (CNKI). The dictionary was further used as the main source of synonyms to enrich the geoscience corpus according to the Levenshtein distance between words. Finally, a Chinese word segmentation model combining the BERT, Bi-gated recurrent neural network (Bi-GRU), and conditional random fields (CRF) was implemented. Geoscience corpus composed of complex long specific vocabularies has been selected to test the proposed word segmentation framework. CNN-LSTM, Bi-LSTM-CRF, and Bi-GRU-CRF models were all selected to evaluate the effects of Levenshtein data augmentation technique. Experiments results prove that the proposed methods achieve a significant performance improvement of more than 10%. It has great potential for natural languages processing tasks like named entity recognition and relation extraction.


Introduction
With the development of Internet technology and the ever-enriching of Internet resources, a great deal of text data has been accumulated in the field of geoscience. In many geoscience texts, valuable rules and information can be mined and obtained by combining data analysis and geoscience knowledge, which enriches our understanding on the basis of comparing and linking related works (Zhu and Yang 2019). However, how to extract useful information automatically and quickly from geoscience texts is a great challenge at present (Qiu et al. 2018). Especially, due to the lack of delimiters between Chinese characters, it is more difficult to do word segmentation and extract information from Chinese geoscience documents. On the other side, geoscience vocabulary has more complicated word information characteristics and distribution rules than general vocabulary. Therefore, Chinese word segmentation in geoscience is a key problem to be resolved in information mining from geoscience text (Qiu et al. 2019).
Based on previous research achievements, word segmentation methods for the Chinese language are mainly categorized into three types. The first category is dictionary-based word segmentation, also called mechanical word segmentation. In this method, the sentences to be segmented are matched with dictionaries prepared in advance according to the rules. If the word in the dictionary is successfully matched, the consecutive characters in the sentence will be separated into one word (Qiu et al. 2019). According to different matching directions when scanning strings, strings can be divided using the forward maximum matching algorithm (Lei et al. 2014), backward maximum matching algorithm (Zhang et al. 2006), bidirectional scanning algorithm (Gai et al. 2014) and N-shortest path algorithm (Ke et al. 2019). Zhao et al. (2018) designed a dictionary structure that improved the recognition rate of unknown words through a dictionary loading function, and obtained the best segmentation result using the bidirectional matching algorithm, thus also improving ambiguous word recognition. However, these methods rely heavily on dictionaries and are unable to deal with new word detection very well; moreover, Chinese is a relatively complex and diverse language with a huge vocabulary, and its usage is too flexible and changeable, resulting in high cost for constructing domain dictionaries (Qiu et al. 2019).
The second method is named statistical word segmentation. Such methods need to use the word segmentation data as training data and judge whether adjacent characters can form words according to the frequency of the adjacent characters appearing in the data (Huang, Sun, and Wang 2017). Generally speaking, the more times adjacent words appear together in the corpus, the higher probability of forming a word. Methods for word segmentation based on statistics include the hidden Markov model, maximum entropy model () and conditional random field model (Lafferty, McCallum, and Fernando 2001). For example, Qiu et al. (2019) employed a conditional random field to capture dependencies between adjacent tags, which has achieved good results in clinical named entity recognition. However, the statistics-based domain word segmentation methods rely heavily on feature engineering, and the scarcity of domain annotated corpora makes it difficult to implement.
The third method relies on the neural network, which is essentially a sequence labelling task, but makes full use of the powerful feature representation capabilities of the neural network (Vinotheni and LakshmanaPandian 2021;Ma et al. 2021). Owing to the excellent performance of the cyclic neural network and long short-term memory networks (LSTM) in sequential tagging tasks, various neural networks have been widely applied in Chinese word segmentation tasks (Zhao et al. 2018). Many studies have explored the use of neural networks to automatically learn better representations Al-Ayyoub et al. 2018;Qiu et al. 2020). For example, the maxmargin tensor neural network was proposed to model the relationship between labels and characters (Huang, Sun, and Wang 2017). Meanwhile,  exploited a gated recursive neural network (GRNN) to model the combination of characters for word segmentation and presented four different architectures of LSTM networks to test the effectiveness. Wang and Xu (2017) proposed a convolutional neural network combined with word embedding for Chinese word segmentation. Although different word segmentation methods are constantly proposed, due to the limitation of labelled corpora, the existing word segmentation methods will significantly decrease for the geoscience domain (Zhang et al. 2016).
Compared with the expensive, time-consuming, and complex task of text annotation, data augmentation is the main means to improve the accuracy of the word segmentation model (Haralabopoulos et al. 2021). For different natural language processing tasks, various text data enhancement techniques have been proposed (Wei and Zou 2019;Feng et al. 2021). Random insertion and deletion of arbitrary word is a common method. For example, Wei et al. (2021) proposed a curriculum data augmentation method, which can improve the performance of triplet networks by up to 3% on average by gradually increasing noise at each stage. Ding et al. (2020) first inserted the tag corresponding to the naming entity into the sentence training model, and then used the model to label the unlabelled sequence for data augmentation. The second data enhancement method is about back translation technique. For example, Sennrich, Haddow, and Birch (2015) used monolingual corpus to construct pseudo-parallel sentence pairs for data augmentation and applied neural machine translation models trained on a small-scale corpus to translate monolingual corpus to obtain pseudo-parallel sentence pairs. Then the pseudo-parallel sentence pairs and the real parallel sentence pairs were put together for model training, which increased the Bilingual Evaluation Understudy (BLEU) score of the machine translation task by an average of 3%. Different from Sennrich's idea, Fadaee, Bisazza, and Monz (2017) trained a language model on a large monolingual corpus, and then used the language model to find the positions where highfrequency words could be replaced by low-frequency words in the sentence, thus increasing the frequency of low-frequency words appearing in the training corpus, which improved translation quality by up to 2.9 BLEU points over the baseline and up to 3.2 BLEU over backtranslation. The third commonly used technique is lexical substitution or synonym term replacement. This method generates new corpus by word replacement, so as to expand the training set. Kobayashi (2018) uses large-scale corpora to train the bi-directional language model and then randomly replaces words with words predicted by the language model for data augmentation. Phreeraphattanakarn and Kijsirikul (2021) performed lexical substitution for augmenting text data specifically for the Thai language based on text similarity and effectively improved the performance of text classification. Pasunuru et al. (2021) took the title of the summary as the query words, sent it to BM25 search engine to retrieve and generate documents related to subject words for data augmentation. The above researches show that data augmentation has achieved good performance in different natural language tasks. Vector position shift and neural-based generation methods were also proposed to reduce class imbalance and maximize training datasets for limited datasets (Rizos et al. 2019). Zhang, Yu, and Zhang (2020) mixed the sample sequence and tags for data augmentation, and the F1 value of the task for entity recognition and event extraction increased by 2.27% to 3.75%.
In general, the data augmentation technology based on random term substitution, insertion or the replacement of the same kind of label words is lack of constraints, which cannot fully guarantee the accuracy of sentence sequence semantics; the data augmentation technology based on substitution words generated by the neural network language model or the direct use of synonyms for replacement is not suitable for specific fields, such as medicine and geosciences. Due to the lack of annotation field data, and many words do not have corresponding synonyms. Therefore, it is still an urgent problem how to design data augmentation methods for geoscience text data mining.
The purpose of this paper is to illustrate a way to obtain geoscience-related words dictionary, and provide a data augmentation technique based on Levenshtein distance. Based on the enhanced corpus, a specific word segmentation framework integrated Bidirectional Encoder Representation from Transformers (BERT), Bidirectional Gated Recurrent Unit (Bi-GRU), and Conditional Random Fields (CRF) is designed to learn the sequence model for Chinese word segmentation. Following will be the data and materials part. Section 3 introduces methods in detail with experiments, and results and discussion followed in Section 4 and the conclusion is given in Section 5.

Data and materials
The data used in the study include domain corpus data of geoscience and glossaries data used for domain corpus augmentation. The corpus data of geoscience came from the website of the National Geological Archives of China Geological Survey 1 (NGAC) and mainly consisted of the geological survey report of mineral resources (30,000 words in total) (Zhang et al. 2018 China National Knowledge Infrastructure (CNKI) is the largest full-text database of Chinese journals in the world, which contents cover natural science, engineering technology, agriculture, philosophy, medicine, humanities and social sciences, and other fields, so it is feasible and reasonable to obtain glossaries from CKNI. A web crawler was developed to get the keywords within a geoscience paper and finally we constructed a CNKI geoscience word dictionary (CGWD). In order to maintain the complementarity between glossaries, we eliminate the words existing in general segmentation corpora (PKU and MSR) and finally obtain a 20,137-word dictionary related to the geoscience field. The collection of CGWD effectively makes up for the lack of geoscience domain vocabulary.

Basic idea
The more training data a machine learning algorithm has, the more effective it is. Even if the quality of the training data is low, as long as the model can obtain useful features from the original corpus, the algorithm can perform well. However, in practical applications, most tasks only have a small amount of annotated data, especially in geoscience domains. Therefore, it is important to increase the diversity of training samples and enhance the robustness of the model through data augmentation technology (Bengio and Senecal 2008;Collobert et al. 2011), while synonym replacement is the main technique of corpus enhancement. It can thus be seen that constructing domain vocabulary and finding synonyms becomes the key step of corpus enhancement (Kalinic and Krisp 2019). In the paper, we first use Levenshtein-based data augmentation technology to enrich the geoscience corpus. Next, we solve the problem of sequence tagging by combining the BERT language model and Bi-GRU-CRF neural network model. The overall framework is shown in Figure 1, and the following subsections explain these components in detail.

Calculation of Levenshtein distance
Levenshtein distance, also named edit distance, is proposed by Levenshtein in 1966. it is a string metric for measuring the difference between two words. The Levenshtein distance between two words is the minimum number of operations required to change one word into the other, including insertions, deletions, or substitutions. According to the definition of Levenshtein distance, the fewer editing operations, the higher the similarity between two words. This can be used to determine whether two words are synonyms. The Levenshtein distance between two strings a, b can be calculated by the following equation: where lev(a, b) stands for the Levenshtein distance between string a and b; |a| and |b| stand for the lengths of string a and b, respectively; tail(a) stand for the string a with the first character removed; tail(b) is the same meaning as the tail(a). This formula was implemented with a recursive function using Python language.

Training set augmentation based on Levenshtein distance algorithm
The number of the corpus in a specific field is usually limited, which has an important impact on the training accuracy of the model. Therefore, we apply data augmentation technology to specific domain corpus. Figure 2 shows the main steps of data augmentation based on Levenshtein distance. The first step is glossaries data preparation, which has been described in section 2. These glossaries are the source of vocabulary replacement when the corpus is augmented. In the following steps, the training set with word-divided sentences is enhanced by synonyms replacement and synonyms are judging by the following equation.
The threshold value is determined according to the effects of model training. Although Levenshtein distance can be used to quantify the similarity between two words, it has its particularity for Chinese words. Because Chinese words emphasize semantics, even if there is only one word difference between two words, their meanings are very different. Therefore, the Levenshtein distance between two Chinese words cannot reflect the similarity between them. We propose to translate Chinese words into English first, and then calculate the Levenshtein distance between them, which can effectively solve the problem of semantic differences between Chinese words.
In the program, if a word in the current sentence exists in PKU or MSR, its synonyms are obtained from Chinese WordNet 3 which is a complete knowledge base of a Chinese word meaning distinction and lexicalsemantic relationship. If it does not exist, we translate the word into English, then calculate the similarity between the term in the sentence and the term in the CGWD vocabulary based on the Levenshtein distance algorithm. If the similarity value is less than the given threshold, the term in the current sentence can be replaced by the term in CGWD. The data augmentation algorithm for the geoscience domain is illustrated in Algorithm 1. The purpose of each iteration is to create a new sentence from a manually processed sentence. Then, the newly generated sentence is added to our domain-specific corpus to prepare for the follow-up training.

Algorithm 1 Levenshtein data augmentation for geoscience domain
Input: a sentence with segmented words Output: a set of sentences with segmented words 1. For word in sentence: 2. If word in domain-generic corpus (PKU, MSR): 3.
Calculate the synonyms of word through WordNet 4.
If synonyms of word not None: 5.
Synonyms of word can replace word 6.
Get the top three candidate synonyms of word 7. Else if word not in domain-generic corpus: 8.
translate word into English 9.
For d_word in CGWD and translate d_word into English: 10.
d_word can replace the word 13.
Get the candidate d_word 14. Candidate synonyms and candidate d_word random replace its origin word 15. Form new sentence Xue (2003) first proposed word segmentation by training a classifier with a machine-learning algorithm to label each word and constructed a sequence label word segmentation system on the basis of the maximum entropy model. The machine learning method received remarkable results on Bakeoff-2003 datasets (Xue and Converse 2002;Xue 2003). In the present study, Chinese word segmentation is generally considered as a problem of character-level sequence labelling. The BMES tagging scheme is broadly accepted by annotators. Every character in each word is marked as one of {B, M, E, S}, which is used to represent the beginning, middle, end of a word, as well as a word composed of a single character, respectively. For instance, a sentence with n characters can be expressed as follows: sentence = C 1 C 2 ⋯ C n where C stands for a character in a sentence and n is the total number of characters. This sentence can be labelled as follows:

Converting word segmentation into sequence label
labelled sentence = s 1 s 2 ⋯ s n where s is the segmentation label of C with a value from {B, M, E, S}. For example, in the sentence with 26characters '铁克里克断隆带上主要分布前震旦系, 泥盆 系和石炭系地层', the corresponding labelled sequence is 'BMMEBMESBEBEBMMESBMESBMEBE'. Therefore, the linguistically meaningful segmentation is '铁克里克/断 隆带/上/主要/分布/前震旦系/,/泥盆系/和/石炭系/地 层/'. The four-label segmentation rules can effectively improve segmentation precision and computational effectiveness, which is a remarkable feature in natural language processing (Yao and Huang 2016). Therefore, the main goal of Chinese word segmentation is to find the corresponding mapping sequence S for the given character C in a sentence as the following equation: where p(s|C) represents the probability that the sequence label s is B, M, E, S under the condition that the character is C, argmax outputs the label s that produces the largest p.

Word segmentation model for Chinese text of geoscience
The word segmentation model framework proposed in this article is shown in Figure 3. The model consists of three modules. First, the labelled corpus passes through the BERT pre-training model to obtain the corresponding word vector, and then the word vector is passed to Bi-GRU Module, Bi-GRU obtains the score of the word segmentation tag corresponding to each word. Although Bi-GRU learns context information, the output of Bi-GRU is independent of each other. The last layer of CRF adds some constraints to the output of Bi-GRU to learn the corresponding transition matrix to prevent the tag sequence that does not conform to the rules, and the decoding result corresponds to the word segmentation tag of each word.

BERT embedding
BERT (Devlin et al. 2019) is a pre-training model composed of a bidirectional transformer model (Vaswani et al. 2017). For the upstream language preprocessing of Chinese word segmentation, it has always been a hot issue, and the BERT model, as the advanced language preprocessing model, can obtain high-quality embedding vectors, which is beneficial to the downstream task for word classification. Figure 4 presents the pretraining architecture of the BERT model. The input is a series of tokens, which are first embedded to form a vector and then encoded in the transformer model. The corresponding output is a vector sequence with dimension H, wherein each vector corresponds to the input token sharing the same index.
Compared with that of the traditional transformer encoder, the vector representation of the input of BERT's transformer encoder has more segment embeddings, which can better represent semantic information. The BERT model adopts two pre-training tasks: the mask-LM task and next sentence prediction (NSP). When the BERT model is pre-trained, the goal is to minimize the combined loss function of the two pre-training tasks, wherein mask-LM is used to capture contextual information and NSP is used to infer the relationship between sentences.
Because of the large-scale unmarked data and deep structure used in pre-training, the model can capture rich semantic patterns and complex language phenomena from plain text, and well understand the language (Peters et al. 2018;Goldberg 2019). After fine-tuning, the performance of different natural language processing tasks can be improved using the BERT model. For the multi-label classification problem, the different probability distributions of labels in label set S are calculated, and the multi-label output results are determined by a threshold. Therefore, in our proposed framework, BERT is used to generate the word vectors of the context so as to ameliorate the performance of sequence labelling tasks (Souza, Nogueira, and Lotufo 2019).

Bidirectional GRU
Long-term and short-term memory (LSTM) networks (Graves 2013) are improved versions of recurrent neural networks (RNN) and deal with the vanishing gradient problem in RNN training (Bengio, Simard, and Frasconi 1994). LSTM networks are formed by LSTM units, and one LSTM unit consists of an input gate, forgetting gate, output gate, and cell state. The input gate adds new information to the cell state, the forget gate filters useless information of the previous cell state, and the output gate controls the output of the cell state. Formally, the LSTM unit performs the following operations at time step t: where x t , C tÀ 1 , and h tÀ 1 are the inputs of LSTM, W is a set of weight parameter matrices, and b is a set of bias parameters matrices. The σ operation represents a Sigmoid function; i, f, o, and c are the input gate, forget gate, output gate, and cell vector, respectively. The gated recurrent unit (GRU) is a smaller variant of the LSTM (Cho et al. 2014), and its forget gates and input gates are consolidated together into an update gate. The GRU combines a cell state with a hidden state, makes some improvements. Compared with the LSTM model, the GRU model has a faster training speed and a simpler model structure, and thus it is more popular. The GRU performs the following operations at time step t: where h t is the main output of the GRU for complex operations in subsequent layers, but the hidden state h t of the current GRU only relies on the previously hidden state h tÀ 1 , and ignores the next hidden state h tþ1 . This means that the GRU can only know the previous text information and not the following text information. However, the following text information from the backward direction is also useful in Chinese word segmentation ). An effective way to solve this problem is to apply the bidirectional GRU (Bi-GRU) model; its structure is shown in Figure 5. The model can learn the sequence information before and after each sequence, thus producing two single hidden states to capture the contextual relationship.

CRF
For Chinese word segmentation, taking into account the relationship between adjacent labels is necessary. For instance, the latter of a B (Begin) label should be an E (End) label or an M (Middle) label, and cannot be an S (Single) label. Conditional random fields (CRFs) use a single exponential model to represent the joint probability of a whole label sequence, and therefore can effectively solve the problem of label deviation and ensure the validity of predicted labels (Haruechaiyasak, Kongyoung, and Dailey 2008). Thus, we use CRFs to model the label sequence jointly rather than independently.
In the training data, we assume there is a sentence with n characters expressed as C ¼ C 1 C 2 � � � C n , ands ¼ s 1 s 2 � � � s n is the corresponding segmentation label result, s i 2 B; M; E; S, through the CRF layer, the predicted segmentation label sequence is Assume that the output matrix of Bi-GRU is P, and P t;b s t represents the non-normalized probability of the mapping of the character C t to b s t . A is an important parameter in CRF, and is called a transfer score matrix. It can be set manually or learned by the model. A i;j i; j 2 B; M; E; S f g ð Þ represents the score from tag i to tag j. With characters C and ŝ, the predicted score is as illustrated in Equation (14).
p sjC ð Þ in Eq. (1) can then be formulated as follows: where S c represents the set of all possible label sequences of sentence C, score C;ŝ ð Þ represents the evaluation score of generating the corresponding label sequence ŝ for a given characters C.
Therefore, when training the neural network, we only need to maximize the likelihood probability P sjC ð Þ. Using log-likelihood, the loss function is as shown Equation (14).
where C and S are the characters and corresponding segmentation label sequences in the training set, i represents the numbers of characters.

Model evaluation
To select the best Chinese word segmentation model, we use three evaluation indicators to evaluate the performance of the model. Precision indicates the percentage of correct word segmentations in all predicted words. Recall indicates the percentage of all positive samples accurately predicted. Finally, the F1-score is used to evaluate the model's overall performance.
where P is precision and R is recall.

Model parameters
BERT model was used to extract features from training datasets. Then, the word embedding vectors with a dimension of 128 is taken as input for the Bi-GRU layer. The number of Bi-GRU hidden units is equal to the dimensions of word embedding vectors. A dropout layer partially filters the output of the Bi-GRU layer to prevent our model from over-fitting. The hyperparameter settings of our method are shown in Table 2.

Performance of word segmentation based on Levenshtein distance-based augmentation
In order to validate the method based on Levenshtein distance for word segmentation in corpus augmentation of geoscience texts. The corpus dataset of the geological survey report of mineral resources was used for model training and evaluation. A 10-fold crossing validation method were employed to evaluate the performance of word segmentation with Levenshtein distance-based augmentation. Table 3 lists the result of each run of the program for 10-fold dataset with BERT-Bi-GRU-CRF model. The last two column shows the rate of invocabulary (IV) and out-of-vocabulary (OOV) words for the test datasets. Compared to the results before augmented, there is an improvement of 10.06%.

Comparisons of different augmentation methods
For natural language processing, there are many different augmentation methods used for different thematic areas. Synonym replacement and random insertion are selected for comparison. Table 4 shows the experimental results of different data augmentation strategies for word segmentation with BERT-Bi-GRU-CRF model. It can be seen that the improvement for the word segmentation model is obvious by using data augmentation with Levenshtein distance. Data augmentation with synonym replacement and random insertion methods could not improve much. Instead, there is a small drop for the precision, recall and F1-score. Analysis of the reason may be that many domain words in the domain data set, such as 'Heishilazi iron deposit' and 'lead-zinc deposit' do not have corresponding synonyms to replace. Though synonym replacement has been proved to have obvious improvement effects in many texts mining tasks (Wei and Zou 2019). Our experiments show that the general domain corpus does not necessarily play a role in the word segmentation of domain-specific corpus such as the geoscience domain.

Effects on other deep-learning based word segmentation models
In order to validate the effectiveness of data augmentation with Levenshtein distance-based technique, different deep-learning based models, which are currently widely used, are evaluated on the same data set. Table 5 shows the experimental results of different models with and without Levenshtein distance data augmentation. It can be seen that the performance of all models has been greatly improved after using Levenshtein distance data augmentation. Taking F1-score for example, Bi-LSTM, Bi-GRU, Bi-LSTM-CRF and CNN-LSTM are improved of 10.97%, 10.92%, 13.13% and 12.20%, respectively.

Effects of open-source Chinese word segmentation tools on geological dataset
Seven widely used and mainstream open-source Chinese word segmentation tools from the Internet: PyNlpir, 4 Jieba, 5 StanfordCorenlp, 6 Thulac, 7 Hanlp, 8 Pkuseg 9 and Ltp, 10 were performed and evaluated on our geoscience datasets. As illustrated in Table 6, Ltp performed best, with an F1-score of 83.3%. Meanwhile, PyNlpir had the lowest performance (67.0%). Furthermore, Ltp had the highest precision (86.1%) and recall (83.8%). However, since these open-source tools are trained by domain-generic corpora, when applied to specific fields, such as the geoscience domain, the results are not satisfactory. Therefore, training new separators on geoscience texts is needed.

Effect of threshold setting on experimental results
Data augmentation with Levenshtein distance is affected by the setting of the threshold. In order to further explore the performance effects under different thresholds. Comparative experiments with different thresholds were performed using BERT-Bi-GRU-CRF model. The experimental results are shown in Figure 6. Among them, when the threshold is equal to 0, it can be considered as the experimental result without Levenshtein data augmentation technique. In general, when the threshold is less than 13, the performance of the model gradually improves, and then begins to show a significant downward trend. When the threshold is more than 50, the word segmentation result of the model is even lower than that of no Levenshtein data augmentation technique. The reason may be that the restriction is relatively low, and the candidate words of the domain vocabulary are not similar words of the domain vocabulary.
Generally, the performance of the model with the Levenshtein edit distance augmentation technique could be affected significantly by the threshold value. The implementation of Levenshtein distance-based technique requires quantitative testing and determining the threshold value, the appropriate threshold value can effectively improve the performance of Chinese word segmentation in geoscience domain.

Potential applications of Levenshtein distance-based technique on the other fields
The data enhancement strategy based on the Levenshtein distance algorithm proposed in this paper mainly aims at a specific field and can achieve better results in geoscience texts segmentation fields. The method directly processes domain vocabulary and data and uses the Levenshtein distance technique to enhance the data of the geoscience corpus so as to obtain enough data to enhance the model's robustness without designing an algorithm to adjust the word segmentation model. Therefore, it can be well transplanted to different fields. With the arrival of the big data era, a great deal of text data has been accumulated in various research fields using different languages. In general, Chinese is written in continuous character sequences without explicit word delimiters, the efficiency of word segmentation is very low. Therefore, this proposed method will make it possible to effectively improve the performance of word segmentation in Chinese and also can be beneficial for data augmentation in other languages. It can solve the problem of corpus scarcity faced by statistical and deep learning-based models.

Conclusions
In this paper, we first proposed a Levenshtein distance-based method for word segmentation in corpus augmentation of geoscience texts. This method could effectively enrich the geoscience corpus. Therefore, it was a good application for the augmentation of the training set in the word segmentation model. Based on the augmented corpus, we then designed a word segmentation framework for promoting domain knowledge and annotated corpus to segment domain-specific corpus. The framework includes three parts: a collection of domain glossaries, data augmentation based on Levenshtein distance, and an algorithm to segment domain corpus with the BERT-Bi-GRU-CRF model. Geoscience corpus composed of complex long specific vocabularies were selected to test the proposed word segmentation framework. Experimental results proved that the proposed framework achieved a significant performance improvement of 10% over previous domain word segmentation methods. The framework with a Levenshtein data augmentation technique could segment geoscience terms with high quality (the maximum F1 value is 0.9884) and be easily applied to other fields. The implementation of the Levenshtein data augmentation strategy is automatic and the designed framework can be easily migrated to various specific domains such as medicine. The framework can also be adapted to natural languages processing tasks like named entity recognition and relation extraction. According to our study, though synonym replacement has been proved to have obvious improvement effects in many texts mining tasks (Wei and Zou 2019). The data augmentation of general synonym substitution is not suitable for specific fields. The main reason affecting the effect of word segmentation in specific fields is the lack of professional vocabulary. In view of this point, we constructed a CNKI geoscience word dictionary with 20,137 words contained, which is a great benefit to the Chinese word segmentation in the geoscience corpus.