Leveraging Rich Linguistic Features for Cross-domain Chinese Segmentation

This paper describes the system that we use for Chinese segmentation task in the 3rd CIPS-SIGHAN bakeoff. We use character sequence labeling method for segmentation, and in order to improve segmentation accuracy over multi-domain, we present a CRF-based Chinese segmentation system integrating supervised, un-supervised and lexical features. We ﬁrstly preliminarily segment the target data us-ing CRF model trained over three types of features mentioned above, from the result of which new words are detected and absorbed into the lexicon. To generalize across different domains, we then execute the second segment with the updated lexicon. The OOV recognition is further promoted with reﬁned post processing. All the features we used share a uniﬁed feature template trained by CRF. Our system achieves a competitive F score of 0.9730 for this bakeoff.


Introduction
Word is the fundamental unit in natural language understanding.Since people do not retain the boundary information between words in practical use, Chinese Word Segmentation (CWS) is the very first step in Chinese information processing.A considerable amount of research has shown that using character sequence labeling is a simple but effective formulation of Chinese word segmentation task (Xue and others, 2003;Peng et al., 2004;Low et al., 2005;Zhao et al., 2006a), among which the method using sequence labeling based on CRF (Lafferty et al., 2001) is widely used with attractive performance.However, most of the existing segmentation systems greatly rely on data that the model was trained over.The segmentation performance tends to would reduce significantly when the test data differs greatly from the training data in phraseology and vocabulary.Exploiting corpora in multi-domain for model learning can solve the problem above directly, whereas labeling corpora manually costs a lot, so that it is unrealistic to label mass corpora.
So far there are two ways to improve the performance of cross-domain word segmentation system.The first way is proposed in (Zhao and Kit, 2007;Zhao and Kit, 2008;Zhao and Kit, 2011), in which they put forward a unified framework that integrated supervised and unsupervised segmentation together, where they could take full advantage of unsupervised segmentation to discover new word from untagged corpora and obtain the ability of supervised segmentation to recognize the known words at the same time.The segmentation system is generalized to some extent.The second way is to build a segmentation system with multi-layers.The first layer is a set of distinctive word segmentation subsystems, who might has an outstanding performance on specific domain.And the second layer combines all the outputs of these subsystems, determining the most possible segmentation boundaries on test dataset.Gao and Vogel (2010) used this method achieved top performance in three test domains out of the four during Bakeoff-2010 (Zhao and Liu, 2010).In this paper we follow the first method to improve the performance of cross-domain segmentation, meanwhile add some of the effective features that mentioned in method two.And the performance of handling OOV is improved by adding lexical feature and new words discovery.
In Section 2, we describe the features we adopted in our system.Section 3 represents how we discover new words from preliminary segmentation results and how we expand the lexicon to update lexical feature before we segment test data again to improve the segmentation performance.

Word length
Tag sequence for a word 1 The experimental result that tested on Bakeoff dataset compared with the best official result is provided in Section 4. Section 5 leads to the conclusion.

System Description
We formulate Chinese word segmentation task into a sequence labeling problem and use CRF to train the segmentation model.Our implementation of CRF-based CWS system uses the CRF++1 package by Taku Kudo.We regard "，"，"。"，"？"，"！"，"；" as the boundary of a sentence and both the training and testing corpora are segmented by these boundaries.Zhao et al. (2006b) prove that CRF segmentation performance using 6-tag set for training is better than other tag set, so we adopt 6-tag (B，B 2 ， B 3 ，M，E，S) set labeling the characters in words.Table 1 explains how to label the characters in words with different length.We follow six n-gram character features that are used in (Zhao et al., 2006b;Zhao and Kit, 2008) in which C represents the character, subscript -1, 0 and 1 means the previous character, the current character and the next character.With respect to the other features in our system, the similar six n-gram feature template is also applied to them.

Character Type Features
We simply classify all the characters by its Unicode code point into 5 classes: Chinese character (C), English character (E), number2 (N), punctuation (P) and others (O).Denote character type feature as CTF, and define the feature template as 2.2 Conditional Entropy Feature Gao and Vogel (2010) improve the segmentation performance on 2010 Bakeoff (Zhao and Liu, 2010) dataset by using conditional entropy feature.The forward conditional entropy for specific character C is the entropy that combines all the entropy of characters which might appear in the following position after C throughout the corpora, recorded as H f (C), while the backward conditional entropy consists of all the entropy of characters that might appear in the next position after C throughout the corpora, denoted as H b (C).We could mix unlabeled corpora in multi-domain to calculate forward and backword conditional entropy, which makes this feature more domain adaptive.Forward and backward conditional entropy can be efficiently carried out with the aid of Statistical bigram matrixes.
Continuous values of conditional entropy can be mapped into discrete numeric values by means of the method proposed by Gao and Vogel (2010) as following: The template is similar to character feature template, and forward conditional entropy template is in accordance with the backward one.
Here, the forward conditional entropy feature templates are given:

Lexical Feature
Appropriately using of lexical feature has shown some improvement in Segmentation, and hence we adopt the definition of lexical feature from (Gao and Vogel, 2010).Feature L begin (C) represents the maximum length of words begin with character C in the lexicon via forward maximum matching from character C in the current sentence, and L end (C) represents the maximum length of words end with character C in the lexicon via backward maximum matching from character C. When processing forward and backward maximum matching, we only deal with the word with length equal or greater than 2, furthermore, the lexical feature value will be 0 where matching failed.Especially when feature value is equal or greater than 6, we set these feature values to 6.We hope to increase the performance by using a large-scale cross-domain lexicon.Six feature templates are defined for As six feature templates of L end (C) could be inferred from above.

Accessor variety feature
Accessor variety (AV) proposed by Feng et al. (2004) could be used to measure the possibility of whether a substring is a Chinese word.Zhao and Kit (2007) thought that the method above is agreed with the method proposed by Harris (1970), in which morpheme could be found in unfamiliar language.Zhao and Kit (2008)'s experiments proved that AV feature improves the performance of CRF segmentation model on dataset in Bakeoff-2003, Bakeoff-2005and Bakeoff-2006(Sproat and Emerson, 2003;Emerson, 2005;Levow, 2006) while achieved the best performance on close test in Bakeoff-2008 (Chen and Jin, 2008).Therefore in this paper, AV feature is employed and we make further improvement of the performance by making better use of AV feature method.As to substring s, AV feature is defined as follow: in which L av (s) and R av (s) represent the number of different characters before s and after s respectively, while the sign in the begin or the end of sentence would be double counted.
How we use AV is similar to (Zhao and Kit, 2008;Yang et al., 2011), considering the AV value of substrings with length is equal or less than 5 in sentence and designing several feature templates accordingly.We used the formula below to discrete AV value of substring s: Discrete value t is regarded as the feature value.The difference between our method and the method above is that for substring s, we marked the feature value of s on the first character of s, not on every character of s.Representation of lexical feature mentioned in Section 2.3 was used for reference because we believed labeling this way could highlight boundary information between words.Table 2 shows the differences in detail.For instance, consider all the substring consist of 4 characters.In this case, we have a substring "在 我 心 中 (in the middle of my heart)" with AV feature value t = 1.So that we updated
In order to prove the effectivity of improved AV feature in our method, we continued to use the experiment setting of (Zhao and Kit, 2008;Yang et al., 2011) and and had experiment on the dataset of Bakeoff-2005 (Emerson, 2005) and the simplified Chinese dataset of Bakeoff-2010 (Zhao and Liu, 2010).OldAV stands for their AV feature while our feature named as NewAV.6 n-gram character features and character type feature mentioned in Section 2.1 were used in each experiment.Evaluation indicator F score equals F = 2RP/(R +P ), in which R is the recall and P stands for precision.After combined corresponding training and test dataset of Bakeoff-2005 together without segmentation marks, statistical AV features were created.Then the training corpus, unlabeled corpus and test corpus of Bakeoff-2010 were combined together without segmentation marks to count AV features.The experiment results in Table 3 indicates that our improvement in AV feature is effective due to the performance is better than other old methods.These experiment results were not postprocessed so as to compare segmentation performance easily.

Post-processing
Post-processing aimed at handling segmentation error in English word, Arabic numeric string and URL.Faced with this situation, these characters should be regarded as a whole segment unit, but out system might make segmentation errors  4 we have an example of URL segmented incorrectly, and raw represents the original sentence; result shows the result after segmentation; final stands for the result after post-processing.To deal with this kind of problem, we have to make sure that when we take gaps away from the segmented sentence, it should be in correspondences with original characters in sentence.Here is a quick procedure of how we restored URL segmentation error.First, we put the original sentence in a string; then saved the segmented result in to a list.Every element in the list is a word with subscript starts from 0.
1. Use regular expression to find the start and the end position of the original sentence.In case http://t.cn/aBPxzO,the start and end index is 4 and 22 respectively.
2. Accumulating word length in the word list from left to right, we can get the start index of URL is 2 and end index is 3 according to word list.
3. Combine the 2nd and 3rd word in the word list as one word.
English word and Arabic numeric string can be handled in the same way.3 Improve The Segmentation Performance of New Words The segmentation system that we described in Section 2 was not very stable when it comes to new words.New words with some sort of context can be segmented correctly while other context might lead to mistake.For example, the word "涅维拉济莫夫 (涅维拉济莫夫)" with context "文官涅维拉济 莫夫在起草一封贺信 (civil officer Nie Vilage is making a draft of congratulatory letter)" can be segmented correctly, but the sentence "于是涅维拉济莫夫开始绞尽脑汁 (hence Nie Vialge began to rack his brain)" was wrongly segmented.CF,CTF,EF,AV Webdict Refined 2 CF,CTF,EF,AV Webdict 1 Webdict were used to calculate lexical feature for both testing and training. 2Webdict were used to calculate lexical feature for training, then the method mentioned in Section 3 was used for performance improvement.All the experiments in this section were linked to post-processing mentioned in Section 2.5.We tested our system on Bakeoff-2005 and Bakeoff-2010 dataset with major measure index F score.From the Refined results of both Table 6 and Table 7, we can observe that our strategy on detecting new words provide improvements over all the R OOV compared to all the Open system in general.Meanwhile, our Refined model provide more balanced F scores among all the dataset.
It is proved on two Bakeoff datasets that our Open feature combination and Refined feature combination are effective.On account of lacking training corpus of this Bakeoff, Open data test is required.Hence we used Open and Refined feature combination in Table 5.With purpose of making model to be more cross-domain adaptive, we made use of a large number of unlabeled corpora to extract conditional entropy feature and AV feature.Web crawler was used to get totally 1.5G • PKU-Raw: PKU-Corpus without segmentation boundaries.
• Web-Corpus: combines all the unlabeled corpora from web crawler.
Finally we used PKU-Corpus as training data, and extracted from Entropy-Corpus to extract conditional entropy feature while making use of AV-Corpus to extract AV features, together with character feature and character type feature to train CRF word segmentation model.Our results on this bakeoff are showed in Table 8, which achieves a competitive F score of 0.9730.From this table, we can catch that Refined feature combination outperforms Open, which further confirms that the new word detection is critical for cross-domain Chinese segmentation.

Conclusion
In this paper we attempted to implement a word segmentation system with the ability to handle tive.
Yet our system still have many deficiencies which can be improved from three aspects.First of all, we only used one kind of unsupervised feature and there might be other unsupervised features or feature combination that could achieve better performance.Next, we coined all the feature into one set of template mainly due to its simplicity in practice.However, there might exist a more fitting feature template for different features.At last, our rule-based method to discover new words could be changed into automatic discovery.

Table 3 :
Comparion experiment on AV feature, ngram feature and character type feature were used for each experiment Table raw 点击网址http://t.cn/aBPxzOresult 点击 网址 http://t.cn/aBPxzO final 点击 网址 http://t.cn/aBPxzO Lexicon test .If Lexicon test has two words with inclusion relation, we only reserved the word with longer length.Combine Lexicon train and Lexicon test together then we have a new word list named Lexicon new .This new word list could be used for calculating lexical feature of the test corpora to update segmentation result.
To solve this sort of problem, we tried to find these new words by rules, then added new words to the lexicon, re-calculated the lexical features of test corpora, segmented test corpora again in the end.Let 's mark the lexicon used for extracting lexical features when training segmentation model as Lexicon train , and count the Bigram statistical information on segmented corpora of People's Daily 1998 and 2000 as P KU bigram without smoothing.For the preliminary segmentation result, if word w meets the following conditions, we deemed w as a new word:1.(w with length between 2 to 6) or (w with length greater than 6 and w is a foreign name at the same time (en dash • exists in w)), 2. w does not exist in Lexicon train , 3. w is not a Chinese name, 4. w can not be the concatenation of w −1 and w 0 for ∀(w −1 , w 0 ) ∈ P KU bigram .

Table 5 :
Feature combination: CF represents 6 ngram features of character, CTF represents character type feature, EF represents conditional entropy feature and AV represents Accessor variety feature4 ExperimentIn order to prove the performance of our method, we considered four kinds of feature combination demonstrated in Table5, in which Closed means closed test, Open means open test in which we used a cross-domain lexicon -Webdict 3 .Refined represents that we added new words' process proposed in Section 3 on the basis of Open.For Refined, we needed corpora to create statistical Bigram information and a lexicon for training.Because of the limited scale of labeled data and we have merely sufficient simplified Chinese training data and lexicon, we didn't process both the AS and CityU of Bakeoff-2005 for Refined.

Table 6 :
Test result on Bakeoff-2005 dataset proved only by enlarging the amount of training corpora.Table 7 shows the test result on Bakeoff-2010 simplified Chinese dataset.When computing conditional entropy feature and AV feature, we needed to combine all of the simplified Chinese corpus together without segmentation boundaries of Bakeoff-2010 corpora to create the statistical feature values."Best closed" and "Best open" shows the best result on official closed test and open test.Our closed test result on test set A differs greatly from "Best closed", yet the result is closer to "Best closed" on other test sets.The performance on Closed improves a lot comparing to the baseline.In addition, our method exceeded "Best open" on dataset C, D in open test, while slightly poorer results than the best on dataset A and B but the differences are not significant.

Table 7 :
corpora in 5 domains, including finance, literature, Test result on Bakeoff-2010 dataset news, microblog and novel.The data we used is explained as followed:

Table 8 :
Results on Bakeoff-2014 dataset the situation of cross domain.We combined supervised and unsupervised global features together and improved the ability to recognize OOV through adding cross-domain lexical feature.Discovering new words from target test set then recomputing the lexical feature to refine the segmentation results makes the model more domain adap-