ABSTRACT
Automatic segmentation and overlapping bigrams are the most common methods for overcoming the lack of explicit word boundaries in Chinese text. Past studies have compared their effectiveness, but findings have been equivocal and site search has been little studied. We compare representatives of the two approaches using a 465,000 page crawl and test queries applicable to the university context. 503 pairs of result sets were judged by 56 Chinese students.
Although there are differences on certain queries, we find no overall advantage to either method. To understand the merits of each approach, we analyze cases where they performed differently. Our analysis enumerates situations which favour segmentation, and those which favour bigrams. We observe that further improvements in segmentation accuracy will not improve retrieval effectiveness.
- A. Broder. A taxonomy of web search. SIGIR Forum, 36(2):3--10, 2002. Google ScholarDigital Library
- G. Cao, P. He, G. Wu, and S. Nie. 中文分词对中文信息检索系统性能的影响 (Impact of Chinese Segmentation to Chinese Information Retrieval). Computer Engineering and Applications, 19:78--79, 2003.Google Scholar
- S. Foo and H. Li. Chinese word segmentation and its effect on information retrieval. Inf. Proc. & Management, 40(1):161--190, 2004. Google ScholarDigital Library
- Q. Fu. 基于搜索统计技术中文分词算法的应用研究 (Application of statistical techniques to Chinese word segmentation algorithm). China Sciencepaper Online, 2007. http://www.paper.edu.cn/releasepaper/content/200704-749.Google Scholar
- H. He, P. He, J. Gao, and C. Huang. Finding the better indexing units for Chinese information retrieval. In Proc. SIGHAN Workshop on Chinese Language Processing, pages 1--7, 2002. Google ScholarDigital Library
- P. Jin, Y. Liu, and S. Wang. 汉语分词对中文搜索引擎检索性能的影响 (Influence of Chinese word segmentation on web information retrieval). Journal of the China Society for Scientific and Technical Information, 25(1):21--24, 2006.Google Scholar
- I.-S. Kang, S.-H. Na, and J.-H. Lee. Combination approaches in information retrieval: words vs. n-grams, and query translation vs. document translation. In Proc. NTCIR, 2004.Google Scholar
- D. Kim and S. Ming. Effectiveness of segmentation granularity and indexing units for worst case evaluation in Chinese information retrieval. Proc. Int. Conf. Internet Information Retrieval, pages 177--180, 2005.Google Scholar
- K. L. Kwok. Comparing representations in Chinese information retrieval. SIGIR Forum, 31:34--41, 1997. Google ScholarDigital Library
- K. L. Kwok. Lexicon effects on Chinese information retrieval. In Proc. Empirical Methods in NLP, pages 141--8, 1997.Google Scholar
- M.-K. Leong and H. Zhou. Preliminary qualitative analysis of segmented vs bigram indexing in Chinese. In Proc. TREC-6, pages 551--557, 1997.Google Scholar
- X. Liu, Y. Hu, and X. Ai. 开源中文分词器在 web 搜索引擎中的应用 (The application of open source Chinese tokenizer in web search engine). Computer Engineering & Software, 34(3):80--83, 2013.Google Scholar
- S. Long, Z. Zhao, and H. Tang. Overview on Chinese Segmentation Algorithm. Computer Knowledge and Technology, 5(10):2605--2607, 2009.Google Scholar
- R. W. Luk, K.-F. Wong, and K.-L. Kwok. Hybrid term indexing: an evaluation. In Proc. NTCIR, pages 130--136, 2001.Google Scholar
- National Taiwan University. Chinese Information Retrieval Benchmark version 1.0 (CIRB010). Web site, Jun 2000. http://lips.lis.ntu.edu.tw/cirb/releases/CIRB010.htm.Google Scholar
- J. Y. Nie, M. Brisebois, and X. Ren. On Chinese text retrieval. In Proc. SIGIR, pages 225--233, 1996. Google ScholarDigital Library
- J. Y. Nie, J. P. Chevallet, and M. F. Bruandet. Between terms and words for European language IR and between words and bigrams for Chinese IR. In Proc. TREC-6, pages 697--710, 1998.Google Scholar
- J. Y. Nie, J. Gao, J. Zhang, and M. Zhou. On the use of words and n-grams for Chinese information retrieval. In Proc. Int. Work. Information Retrieval with Asian Languages, pages 141--148, 2000. Google ScholarDigital Library
- D. W. Oard and J. Wang. Effects of term segmentation on Chinese/English cross-language information retrieval. In Proc. SPIRE, pages 149--157, 1999. Google ScholarDigital Library
- D. Palmer and J. Burger. Chinese word segmentation and information retrieval. In Proc. AAAI Spring Symposium, pages 175--178, 1997.Google Scholar
- F. Peng, X. Huang, D. Schuurmans, and N. Cercone. Investigating the relationship between word segmentation performance and retrieval performance in Chinese IR. In Proc. COLING, pages 1--7, 2002. Google ScholarDigital Library
- M. Sun, J. Zou, et al. 汉语自动分词研究评述 (Chinese Automatic Segmentation Research Review). Contemporary Linguistics, 3(1): 22--32, 2001.Google Scholar
- P. Thomas and D. Hawking. Evaluation by comparing result sets in context. In Proc. of CIKM 2006, pages 94--101, 2006. Google ScholarDigital Library
- X. Tong, C. Zhai, N. Millic Frayling, and D. A. Evans. Experiments on Chinese text indexing: CLARIT TREC-5 Chinese track report. In Proc. TREC-5, pages 335--339, 1997.Google Scholar
- S. Wang. 面向大规模信息检索的中文分词技术研究 (Chinese Words Segmentation Technology in Large-scale Information Retrieval). PhD thesis, Beijing: Institute of Computing Technology Chinese Academy Of Sciences, 2006.Google Scholar
- L. Zhou. Investigating indexing units for Chinese web information retrieval: Chinese word segmentation versus N-grams. Master's thesis, Australian National University, 2013.Google Scholar
Index Terms
- Text segmentation and Chinese site search
Recommendations
Review of brain MRI image segmentation methods
Brain image segmentation is one of the most important parts of clinical diagnostic tools. Brain images mostly contain noise, inhomogeneity and sometimes deviation. Therefore, accurate segmentation of brain images is a very difficult task. However, the ...
A Novel Brain Tumor Segmentation from Multi-Modality MRI via A Level-Set-Based Model
Segmentation of brain tumor from magnetic resonance imaging is a challenging and time-consuming task due to the unpredictable appearance of tumor tissue in practical applications. In this paper we propose a novel level-set-based model for tumor ...
Segmentation of pituitary adenoma
Among all abnormal growths inside the skull, the percentage of tumors in sellar region is approximately 10-15%, and the pituitary adenoma is the most common sellar lesion. A time-consuming process that can be shortened by using adequate algorithms is ...
Comments