BERTCWS: unsupervised multi-granular Chinese word segmentation based on a BERT method for the geoscience domain

ABSTRACT Unlike alphabet-based languages such as English, the Chinese language has no specifying word boundaries. Segmentation, particularly for the Chinese language, is a fundamental step towards Chinese text processing, information retrieval, and knowledge discovery. In the geoscience domain, most existing Chinese word segmentation tools/models require a prespecified dictionary and a large amount of relevant training corpus, and the segmentation accuracies drop significantly when processing out-domain situations using these same methods. To address this issue, a purely unsupervised and generic two-stage architecture (named BERTCWS) for domain-specific Chinese word segmentation is proposed. We first design an incidence matrix termed the ‘character combination tightness’ to calculate the closeness between characters. Then, BERTCWS recognizes geoscience terms based on a Bidirectional Encoder Representations from Transformers(BERT)-based segmenter, and multi-granular segmentation is generated by setting different thresholds. Finally, the discriminator is constructed to validate the correctness of the segmented words. Our numerical study demonstrates that BERTCWS can identify both general-domain terms and geoscience-domain terms. Additionally, multi-granular segmentation could be applied to offer a set of potential geoscience terms of various lengths.


Introduction
The rich geological information consisting of electronic natural-language texts can facilitate diverse research ranging from studies on geological named entity recognition (GNER) (Qiu et al. 2019;Qiu et al., 2019b) and automatic information extraction (Holden et al. 2019;Wang et al. 2022), to enquiries leading to automated knowledge graph development and knowledge discovery (Ma et al. 2010;Wang, Ma, and Chen 2018;Wang et al.,2018b). Geologists can discover and understand geoscience events from geological reports: what they were, what happened, and what relationships they (e.g. rock) had with other strata.
In this article, we focus on one critical challenge in Chinese natural language processing (NLP); in particular, we tackle the problem of geoscience domain word segmentation, which aims to segment a set of Chinese characters (e.g. regional geological unstructured texts) into a sequence of representative and meaningful words and phrases.
Unlike alphabet-based languages such as English and other Western languages for which a set of successful NLP approaches and algorithms have proposed and used (Deng et al. 2016;Liang et al. 2019;Yuan et al. 2020), character-based languages such as Chinese and many East Asian languages) do not have words segmented by space. Domain-specific word segmentation and effective new word detection are still of fundamental importance for addressing these languages (Huang, Du, and Chen 2015;Shu et al. 2017;Qiu et al. 2018). In the geoscience domain especially, Chinese word segmentation (CWS) is an area of considerable research interest because of the unique characteristics of Chinese text structures. Words in Chinese generally consist of a set of characters that are not joined together with any delimiter. For instance, the meaningful word 'Ophiolite' consists of three characters in Chinese: 'snake', 'green', and 'rock'. If these characters from the words are taken individually rather than in combination, it may not represent a complete and representative meaning, and could lead to misinterpretation. Effective and high-performance CWS could correctly convert geological texts into a combination of grammatical and semantically meaningful words; these words could be used to extract and gain an understanding of geological reports. Therefore, good and accurate segmentation is an important step in Chinese geological report processing. Lacking this, significant ambiguities presented in texts are difficult to extract and recognize correctly. Table 1 and Table 2 illustrates the CWS problem with an example sentence from one of our experimental datasets. It is the part sentences of a published geological reports. The original text has no word boundaries at all. The alignments between manually annotated words and their English counterparts clearly demonstrate the succinctness of the Chinese language. For example, the four characters in 查干楚鲁 mean search, do, Chu and Lu respectively, whereas together they form a toponym. If the four characters were separated, they no longer represent the intended meaning.
Existing approaches for Chinese word segmentation fall into three categories: dictionary-based models/methods, statistics-based models/methods, and their hybrids (Huang, Du, and Chen 2015;Qiu et al. 2018). Dictionarybased methods use a corpora as a predefined dictionary to recognize character sequences and segment these characters. Most of these approaches depend on the maximum matching strategies and obtain a high performance when a majority of the input texts are contained in the dictionary. However, for domain-specific texts (geoscience texts), they are limited owing to the lack of appropriate dictionaries (Zeng et al. 2011). Furthermore, mainstream statisticsbased methods/algorithms handle word segmentation as a sequence tagging problem and then develop a set of learning algorithms to train and predict the semination sequences Zhao et al. 2020). Traditional dictionary-based approaches lack the ability to identify words that are not covered in the predefined dictionary, whereas statistics-based approaches can learn from a large training corpus and acquire knowledge in segmenting outof-vocabulary (OOV) words. More attention has been focused on statistics-based methods, especially for machine-learning-based and deep-learning-based approaches (Wang, Zong, and Su 2012;Zhang, Liu, and Fu 2018;Ma, Ganchev, and Weiss 2018;Yuan et al. 2020). In the geoscience domain (Huang, Du, and Chen (2015)) developed a statistical framework (referred to as GeoSegmenter) with conditional random fields (CRF) for domain-specific CWS. Their model first segments general words using a generic segmenter and then identifies geoscience words with a trained model that converts the initial division into the goal division (Qiu et al. 2018). used the word and related word frequency as the input of bidirectional long-short term memory (LSTM) and automatically generated a set of training datasets on the task of geoscience Chinese word segmentation. Their model is a weakly supervised approach and achieves good performance. However, these supervised and weakly supervised algorithms are also limited owing to the lack of a large number of training datasets when used in the domain of geoscience. Additionally, manual annotation requires expert time, is expensive, and lacks expansibility and interoperability.
In this article, we focus on incorporating the bidirectional encoder representation from transformers (BERT) model into Chinese word segmentation and new word detection system (referred to as BERTCWS) in the geoscience domain. Here, we propose a purely unsupervised approach, pipeline segmentation, and new word discovery to simultaneously segment a sequence of Chinese texts (domain-general and domain-specific). We discover new words/phrases without requiring a collection of dictionaries or a large training corpus. Our proposed approach is based on an incidence matrix termed the 'character combination tightness', which represents the close degree of composition between characters in natural language texts. The experimental results on geoscience reports demonstrated that BERTCWS could effectively segment both general words and geoscience words. In addition, the presented approach could obtain multi-granular segmentation that is considerably more efficient in new word detection when compared to single-granular segmentation. To the best of our knowledge, this is the first study to segment Chinese texts from unstructured geoscience reports using a purely unsupervised approach.
The remainder of this paper is organized as follows. First, we present a detailed methodology and its basic  components. Section 3 describes the experimental results and discussion. Section 4 presents the discussion. Section 5 concludes this article and introduces future work.

Methodology
BERTCWS is mainly composed of two major stages: the segmenter and the discriminator. The segmenter aims to offer multi-granular Chinese word segmentation. After obtaining the initial segmented terms, the discriminator attempts to evaluate how to form a more effective term from these initial segmentation units. It is important to highlight that this part facilitates word segmentation and new term detection. The overall workflow of the BERTCWS is shown in Figure 1.

Background: BERT
The basic framework of BERT is shown in Figure 2. It is a multi-layer bidirectional transformer encoder that is used to learn representations by training both the left and right contexts for all layers (Devlin et al. 2019). The initial pre-trained representation is conditioned based on masked language models using BooksCorpus (a total of 800 M words) (Conneau et al. 2018) and English Wikipedia (a total of 2,500 M words); it contains a multilingual model that is trained based on XNLI  samples (a total of 112,500 tagged pairs that consist of 15 languages) (Zhu et al. 2015).
where c i represents the ith token, W e represents the weight information in the embedding layer, and W p represents the positional encoding. Additionally, a special token is added to the architecture that includes [START] (represented by c 0 ) and [END] (represented by c m+1 ). L represents the number of transformer_block layers that mainly contain self-attention and fully connected layers. W o and b o represent the weight matrix and basic bias, respectively. The final classifier could be selected as CRF or Softmax.

Segmenter
To a non-Chinese, the Chinese language might appear as a series of strings of random combinations of 'characters' However, the context between characters is related. It is not a random combination of 'words'. Instead, it should be viewed as a random combination of 'routines' Therefore, we will explore the 'routines' of some languages. The first 'routine' is the combination of adjacent words, which is the word we understand. The process of new word discovery is to judge whether a given fragment is really a word according to the corpus; the socalled word formation is relatively independent and cannot be segmented. One solution to this problem is to reverse it -determine the fragments that cannot be words. Thus, when the compactness of a fragment is greater than a certain degree, the fragment may become a word. That is to say, if the compactness of a fragment is lower than a certain degree, the fragment cannot become a word. Then, we can segment it from the original corpus. A grammatically correct sentence is modelled as the concatenation of a set of words together with the corresponding frequency drawn from a probabilistic dictionary (Ge, Pratt, and Smyth 1999;Bussemaker, Li, and Siggia 2000;Deng et al. 2016). A sentence is composed of a sequence of characters that represents the basic units of a language. Characters are the basic units of a sentence, but a sentence can be read and understood by a list of higher-order units, including words, phrases, idioms, and a variety of regular expressions. These higher-order units in our research are all broadly regarded as 'word' units. Let C = {c 1 ,c 2 , . . . ,c p } represent the set of fundamental unit 'characters' of a language. In the English language, it represents an alphabet of 26 letters, but it contains thousands of distinct characters that may appear in Chinese texts (Zhonghua Zihai Dictionary contains 87,019 Chinese characters of which 3,000 are commonly applied) (Deng et al. 2016). A word w is formed by combining a set of elements in set C with a certain order, such as w = c i1 ,c i2 , . . . ,c il . Let D denote the dictionary for texts.
BERTCWS treats each given sentence S = {w 1 ,w 2 , . . . , w n } as a combination of words drawn randomly from D together with the character combination tightness θ i for word w i . Using θ = (θ 1, θ 2 , . . . , θ N ) as the closeness between adjacent characters, the probability of producing a K-segmented sentence based on BERTCWS is calculated as follows: In unsupervised word segmentation (or text analysis), the main challenge is to discover and detect new words (unknown words) that do not appear in dictionary D. In unsupervised word segmentation, a stepwise approach named 'word grammar' is proposed to address the problem, first by developing an initial dictionary together with single-character words, and then estimating word-related frequency θ with a predefined dictionary. In this method, new detected words are added to the predefined dictionary iteratively (Olivier 1968). However, this approach is limited by the computational resources of time, and it does not also work well. Later on, this approach was modified and improved by applying the maximum-likelihood estimation metric and used for Chinese text analysis. The TopWORDS proposed by (Deng et al. (2016))also used a more complicated Markov dictionary model on the task of Chinese text analysis. All of these approaches (e.g. 'word grammar' and TopWORDS) detected unknown words by a"bottom -up" heuristics, and have been successfully used for English texts, whereas it is difficult (too expensive) for Chinese text because both the given dictionary D and the alphabet A are rather large. The main challenge is to estimate and calculate θ without other information.
In our design, we estimate the character combination tightness θ i for word w i using the BERT model. Given a sentence S={w 1 ,w 2 , . . . ,w n }, the BERTCWS result can be naturally translated into a segmentation of a sentence by considering the degree of character combination tightness. Therefore, the Chinese word segmentation is translated to estimate or calculate the closeness between adjacent characters.
To tackle the problem (degree of character combination tightness), we construct an n*n correlation matrix F to identify the correlation between any two characters in a sentence. The development of the correlation matrix F considerably impacts the word-segmentation results. A strong link with a larger value w ij from the correlation matrix F reflects that character i and character j have a higher correlation, which indicates that the two characters are more likely to form words. We construct correlations between characters by mutual information, thus forming a correlation matrix. Once we get the association matrix, we can use the values in the obtained matrix to understand the closeness between two characters. If the closeness of the two characters exceeds the set threshold, we consider that the two characters should be formed into one word, otherwise we consider that the two characters should be segmented separately from each other.
In this study, for unsupervised Chinese word segmentation, a designed correlation matrix F is proposed based on the BERT model. We use H(x) to represent the output sequence of sequence x after the BERT encoder, whereas H(x) i represents the corresponding encoding vector of the ith token. In addition, X\{x i } represents the sequence after replacing the ith token with the sequence after [mask]; X\{x i , x j } indicates that the ith token and jth token are replaced with the sequence after [mask]. Let f(x i , x j ) indicate the dependence of the ith token on the jth token or the 'influence' of the jth token on the ith token. Then, we define it as where d(·) denotes the distance between the two vectors. In our work, we use Euclidean distance, such as d(u, v)= ǁu-vǁ 2 . An illustrative example is shown in Figure 3, where the distance between character 'of' and character 'association' represents their degree of dependence.
In the masked language modelling (MLM) model (Wu et al. 2020), H(x∖{x i }) i , H(x∖{{x i ,x j }) i are used to predict the characteristics of x i in the MLM model. According to the intuitive idea of 'more mask, more inaccurate prediction', we can assume that H(x∖{x i }) i can predict x i more accurately than H(x∖{{x i ,x j }) i , so if we remove the information of x j , we can use the distance between them to represent the 'influence' of x i on x j .
After constructing the correlation matrix, we can use this matrix to segment Chinese texts with an unsupervised approach. By setting a threshold λ (λ≥1), two tokens with a correlation less than the threshold value are divided, and tokens greater than or equal to this threshold are regarded as the same word (a word contains multi-characters). The threshold is defined by the following objective function: The threshold is used to control multi-granular word segmentation. However, not all threshold λ values can obtain good segmentation results. In this article, selecting an appropriate threshold is discussed after the descriptions of the discriminator.

Discriminator
As presented previously, the main challenge of unsupervised Chinese word segmentation is identifying unknown words. After the segmenter, the given input sentences (texts) can obtain an initial segmented result. However, for unsupervised word segmentation, the segmented result may contain many incorrect segmentations. To address this problem, the discriminator is designed to estimate how likely a string (that combination formation by characters) is a correct word/term. The goal of the discriminator is to remove incorrect segmentation results (strings) to avoid recognizing them as a set of terms. In addition, the discriminator is applied to choose high-quality segmentation results.
First, the segmentation results are directly counted, then the low-frequency words are filtered out based on statistics, and the remaining words are used as the word bank. The word bank is constructed and re-input into the segmenter for repeated word segmentation operations; this improves the accuracy of segmentation.

Word segmentation and new word discovery based on the segmenter and discriminator
Most Chinese word segmentation approaches/tools produce only one granularity. Our proposed approach, BERTCWS, can offer multi-granular results by adjusting the threshold λ; this provides the 'best' granularity (result). When discovering new words, they can be obtained in a straightforward metric based on the segmenter and discriminator. First, the segmenter is applied to provide an initial word-segmentation result; after removing and filtering words with low frequencies based on statistical information, the remaining words for the segmentation results could be considered as a set of candidate terms. Then, a candidate string (term) can be accepted and treated as a correct segmented word if it occurs more than or equal to a certain threshold n min (in our experiments, it is set to 5) in the whole corpus.

Experiments
In this subsection, we conduct some primary experiments to fine-tune the multi-granular parameter of BERTCWS, and to evaluate the performance of the proposed unsupervised approach on the task of Chinese word segmentation in the domain of geoscience. The first experiment was conducted to demonstrate the main obtained results by selecting the best parameter for character combination tightness. The second experiment was conducted to evaluate the model's ability to recognize new words. The third experiment was conducted with the selected parameter for comparison with other models.

Experimental setup
Datasets: Three experimental datasets that contain the domain-general corpus and the domain-geoscience corpus were used to investigate the proposed methodology. The domain-general is collected and obtained from the SIGHAN Chinese word segmentation Bakeoff (Emerson 2005), which includes a variety of documents with a variety of topics and are the popular benchmark corpora on different Chinese NLP tasks. We selected the simplified Chinese versions of test corpora constructed by Microsoft Research (MSR) and Peking University (PKU). In the PKU and MSR corpora, the test sets have 5.6 M/12.1 M segments and 180K/705K distinct words, respectively. Following the same criteria for generating the benchmark corpus in the domain-general, we first collected a total of 43 geoscience reports and then annotated these reports into a gold test set with five human annotators who have sufficient background knowledge of both geoscience and Chinese word segmentation. This processing resulted in 7.8 M segments with 220K distinct words; this dataset is named the GEO corpus.
Performance measures Five performance metrics criteria were selected and used to evaluate a set of segmenters' performance. Precision (referred to as P) denotes the percentage of segmented words divided by all true words that are tagged by annotators. Recall (referred to as R) denotes the percentage of current words that are segmented. The F1-score is regarded as the harmonic mean of the precision and recall calculated as F1-score = 2PR/(P+R). In order to evaluate the performance of the unsupervised method, the recall of OOV (referred to as R OOV ) and Recall of IV (referred to as R IV ) are used to evaluate the ability to recognize new words. R OOV and R IV denote the percentage of out-of-vocabulary and in-vocabulary words/terms that are correctly segmented, respectively. R OOV suggests the ability of a model to generalize or extend to a new domain.
Test method: A 10-fold cross-validation approach is applied to test and calculate the average results of 10 independent runs.
We tested and investigated BERTCWS against a set of notable CWS tools/systems, consisting of Jieba (Sun 2012), THULAC (Sun et al. 2016), PKUSEG (Luo et al. 2019), and ICTCLAS (Zhang et al. 2003). Jieba segments words based on the constructed dictionary and applies a hidden Markov model to recognize new words. THULAC, PKUSEG, and ICTCLAS are trained based on a large number of datasets and are typical supervised models. All these models were trained in the following environment: CPU:2*Intel(R) Xeon(R) E5-2620 v2@2.10 GHz, GPU:2*NVidia, Tesla K20, memory: 96 GB. The operating system used was Ubuntu 14.04 64-bit.

Overall performance
As stated in Section 2.2, the multi-granular parameter (i.e. λ) controls the granularity of the segmentation results to prevent over-segmentation or under-segmentation. Over-segmentation would easily lead to further segmentation of the model, which has been correctly segmented. Under-segmentation would make the model fail to segment that should continue with the content of the segmentation. A larger or a smaller value of the multi-granular parameter may lead to over-segmentation or under-segmentation. To finetune this parameter, we conducted a set of experiments with parameters ranging from 2 to 10 with a step size of 2 (we also experimented with the parameter λ = 1). The obtained results are demonstrated in Table 1, in terms of precision, recall, and F1-score.
As indicated in Table 3, the proposed approach on the MSR dataset achieved the highest F1-score compared to the other test sets across all situations. For example, when this parameter was set to 6 on the MSR, the proposed method attained 64% precision, 70% recall, and 67% in the F1-score. However, the overall performance is not more than 70%, and this provides some empirical evidence that pure unsupervised Chinese word segmentation is indeed more difficult to address both in domain-general and domaingeoscience.
In particular, in the domain of geoscience, the proposed method obtained the highest values with the parameter λ = 6. After λ = 6, the average performance (e.g. precision, recall, and F1-score) of the proposed method starts to decrease to the least satisfactory level. This could be caused by the larger λ over the segmented CWS model. When faced with this situation, the proposed model would further segment relevant corrected words/phrases for the input data; thus, it cannot break up sentences precisely (the correct words are further segmented, resulting in incorrect results).
As we have three test sets to evaluate and investigate the proposed BERTCWS model, we conducted a set of experiments to evaluate the effectiveness of a variety of testing strategies. Specifically, seven different strategies are experimented with to test BERTCWS; their performance based on PKU, MSR, and GEO test sets are presented in Table 4.
The first three strategies S1 to S3 use the single dataset only (e.g. PKU, MSR, and GEO) to evaluate BERTCWS. In such a case, BERTCWS used a single dataset to obtain a fair performance, achieving an F1-score of 0.66,0.67 and 0.62, respectively. The results suggest that the proposed method is effective in segmenting Chinese texts on the datasets of PKU, MSR, and GEO. Strategies S4 to S6 aim to test the performance of BERTCWS when it is tested on pairwise datasets. In strategy S7, all three test sets are combined to test the performance of BERTCWS.
Two important observations can be concluded from these seven strategies (S1 to S7). First, adding more test data is not helpful for the performance of the proposed model. For instance, the results of S4 to S6 are even worse than those of S1 to S3, which used a single test dataset. This indicates that it is difficult to tackle the domain-geoscience of Chinese word segmentation. Second, geoscience texts are quite different from general texts, and these differences will directly lead to the degradation of word segmentation performance in the hybrid corpus. The result is not surprising, as we infer that the domain-geoscience texts often comprise longer characters, such as 'Biotite plagioclase gneiss' (it is composed of seven characters in Chinese). Consequently, when all corpora are mixed together and a certain parameter is used, those hybrid corpora confuse rather than help the model during the evaluation process. This is because either long characters are correctly segmented or shorter characters are correctly segmented.
As described previously, BERTCWS can identify both short and long new terms based on multi-granular  segmentation. Many sample results on the GEO test set having a length ranging from 4 to 12 segmented by BERTCWS(Seg+Dis) are presented in Table 5. By adjusting the threshold, the BERTCWS can discover new words, such as 'Diopside bearing biotite clinopyrite gneiss' and 'Feldspar gneiss in cordierite biotite' This is not surprising. BERTCWS is a learning model that has the ability to learn character combinations of different lengths. 70% of the words in the dataset are unigrams, the remaining 14%, 11%, 3%, and 2% are bigrams, 3-grams, 4-grams, and more than four, respectively. The detection rate of the annotated words with different lengths is illustrated in Figure 4. The detection rate of the oneword is the highest on all of the three datasets. The other types of words' detection rates are all over 0.5, showing an acceptable performance of detecting multi-word. In the future, we will investigate how to further enhance the detection of multi-word.

Capability of new word identification
With the most effective parameter configuration, the continued experiments are conducted by evaluating the BERTCWS with four off-the-shelf CWS tools. The R OOV and R IV performance of the proposed GEO dataset are presented in Table 4. Note that BERECWS(Seg) means that we only keep the segmenter in the approach, whereas BERECWS(Seg+Dis) means we can keep all stages that contain the segmenter and discriminator for Chinese word segmentation. Because our proposed method is a purely unsupervised method, not all test words are contained and appear. Hence, our approach does not calculate the R IV performance.
As indicated by Table 6 in terms of R OOV and R IV , it is difficult to identify new words. A comparison of the OOV rates (as demonstrated in Table 4) indicates the domainspecific characteristics of the task. The dictionary-based approach (e.g. Jieba) and supervised-based approaches (e.g. ICTCLAS, PKUSEG, and THULAC) achieve high performance in terms of R IV ; however, recall rates of OOV terms dropped quickly as the domain was converted. By contrast, our proposed method performs the best against other models in the task of new word recognition. These results highlight BERTCWS's capability in the adaptation of domain, reaching 66.3% and 71.7% for BERECWS(Seg) and BERECWS(Seg+Dis), respectively. This is not surprising because its competitors are trained based on a large number of domain-general terms, and lose the capability of recognizing geoscience terms.
These results and findings indicate that the BERTCWS is an effective approach to tackle the domain-specific word segmentation issue, and both general terms and geoscience terms can be identified separately.

Comparison with other models
We also compared our proposed approach with other advanced Chinese word segmentation approaches  (shown in Table 7). The MM represents the maximum matching approach depending on the predefined dictionary and segments the input texts into words relying on the longest match strategy (Qiu et al. 2018). TopWORDS is an unsupervised and top-down word segmentation approach without requiring a given dictionary; it depends on a statistical model to discover new words (Deng et al. 2016). The GeoSegmenter is a statistical architecture for domain-specific Chinese word segmentation based on CRF (Huang, Du, and Chen 2015). DICND first detected the Chinese word boundary based on Edge Likelihood and then segmented the terms by calculating the values in the vector of the words (Liang et al. 2019).
As indicated in Table 8, the DGeoSegmenter obtained the best results (reaching an F1-score of 86.6%); the second-best results were achieved by the GeoSegmenter (reaching an F1-score of 85.9%). As expected, the supervised approaches are trained and constructed with a larger number of labelled data, and they can learn various patterns and rules of word formation from the training dataset. By contrast, our proposed approach depends solely on the BERT model, which is a pretraining model provided by Google. The model can be easily obtained, and the segmentation results of different granularity can be obtained by adjusting the threshold set. More importantly, it is a purely unsupervised approach and indeed effective in the task of domain-specific Chinese word segmentation. Table 9 shows the experimental results of word segmentation in geological reports using the method proposed in this paper. From the results, it is obvious that the geological report can be effectively segmented by the method of this paper into specialized words, toponymic institutions, rocks, lithologies, etc. The Jieba model uses the lexical matching method because it does not join the training corpus, and the geological terms 'Island arc' and 'Hanshan deer farm' are incorrectly segmented. ICTCLAS has better segmentation effect than Jieba model, especially for 'Maohi Gadaba' and 'Alatandabadobanzhi'.

Error analysis
This paper also analyzes that the method of this paper cannot deal with the following kinds of word splitting in geological reports (see Table 10): (1) there are inconsistencies in word splitting results in the context, for example, 'metallogenic zone' can sometimes be correctly split but sometimes not, so the word splitting is not strong enough globally, and it is necessary to add constraints to the context splitting to ensure the consistency of the context splitting; (2) the words that do not differ significantly in terms of characteristics in the context cannot be split. The words that do not differ significantly in terms of features in the context are not separated, such as 'non-ferrous precious metals'. In future work, we will try to add global constraints to improve the accuracy of word segmentation.

Discussion
As described previously, word and term discovery is a critical challenge in Chinese text processing, and its goal is to recognize unknown and unregistered Chinese words/terms from the texts of interest. In general, discovery of new words/terms is often accompanied by Chinese word segmentation, and it further compounds the difficulty. Several available and existing approaches/ tools for processing Chinese texts attempt word segmentation and hold a comprehensive and predefined dictionary or sufficient training corpus (usually required manually segmented and labelled). However, the performance of these methods drops dramatically once most of an input text is not part of the given dictionary or the target texts has a considerable portion of the training corpus (e.g. the model with domain-general training corpus is used to segment the domain-specific texts). Hence, it is necessary to develop an unsupervised model. Our proposed approach uses the pretraining BERT model, which is trained with massive multilingual text, and constructs an incidence matrix to identify new words/terms in Chinese texts. It is a domain-independent new word discovery method for processing Chinese texts. Note that recognizing new words without considering a set of the known/registered words thereby increases the flexibility of the model in detecting new words. As indicated in Figure 5, when there is a small amount of labelled data, our proposed method with unsupervised multi-granular word segmentation outperforms the state-of-the-art supervised models at word segmentation. For new words/terms detection, the proposed method achieved significantly better performance than all baseline models owing to its multigranular segmentation ability.
The maximal word length influences the results of the Chinese word segmentation and new word detection on the GEO test set (as demonstrated in Figure 5). As seen from Figure 5, BERTCWS(Seg+Dis) has the ability to adjust the various word lengths and outperforms other approaches that achieved the corresponding segmentation results. This indicates that BERTCWS(Seg+Dis) is more adaptive to the domain-geoscience segmentation. xing cheng gui mo jiao da de xiao jian huo dong ban kuai bian yuan (dao hu huo huo dong da lu bian yuan) gai long qi dai wei yu han shan lu chang -bei sha la da dui yi dai shang er die tong lin xi zu zu cheng lin xi zu de di ceng qing xiang yan gai dai ling xing fen bu wan zhu luo shi qin ru yan zai ben fu zhong jin zai bei bu bao han yi bu fen gai huo shan kou wei yu a la tan da ba dao  Inconsistent splitting results in context, insufficient splitting globalization 测区/成矿/背景/处于/内蒙古大兴安岭成矿省/II/级/成矿/带/ The word "mineralized zone" should be used as a single word, which can be separated in some places and not in others in the word separation experiment. 2 No significant differences in the performance of words in context in terms of features 以/有色金属贵金属/及/其它/近期/有/经济效益/的/矿产/ For the domain of geoscience, BERTCWS(Seg+Dis) attained the highest F1-score when the maximal word length was set to 6. This is because the length of words in the domain of Geosciences is generally long, such as 'Biotite plagioclase gneiss' (Chinese translation includes seven characters). In our opinion, the incidence matrix is one of the most important factors in our approach. In theory, if an appropriate incidence matrix is employed, we can obtain the correct segmentation. Therefore, the key goal is to construct an appropriate and available formula to compute the incidence matrix. Based on the theory of n-gram statistics, the transition probability and word frequency statistics are natural methods for calculating the matrix. Other ideas, such as pointwise mutual information and entropy, can also provide solutions. However, the pre-trained language model BERT has the ability to predict the probability of the next character. The designed method of computing the incidence matrix combined with the BERT model can better represent the compactness of words in sentences.

Conclusions and further work
Segmenting the given sentences into meaningful words is a typical Chinese word segmentation problem and becomes the key step for further processing of Chinese texts, such as information retrieval, text mining, and knowledge discovery. In the domain of geoscience, existing Chinese word segmentation tools/models cannot be applied directly; they often require a set of predefined dictionaries and/or a large amount of domain-relevant training datasets. To tackle this problem, we proposed an unsupervised Chinese word-segmentation approach named BERTCWS. Following this framework, the segmenter is used to identify the general terms and specific terms. The obtained results are then fed into the discriminator to validate the correctness of the segmented results. Experimental results on three datasets (two domain-general datasets and one domain-geoscience dataset) demonstrate the effectiveness of the proposed approach. More importantly, the proposed method can be extended to other domains and provide multi-granular segmentation results.
Although BERTCWS demonstrates improvement over other Chinese word segmentation tools/models, the segmentation accuracies of domain-specific new word recognition still have scope for improvement. In future work, we plan to construct and train domain-geoscience pre-training models to further improve segmentation performance.