short-paper

Text segmentation and Chinese site search

Authors:
Liyuan Zhou

NICTA & ANU

NICTA & ANU
View Profile

,
David Hawking

Microsoft & ANU

Microsoft & ANU
View Profile

,
Paul Thomas

CSIRO & ANU

CSIRO & ANU
View Profile

ADCS '15: Proceedings of the 20th Australasian Document Computing SymposiumDecember 2015Article No.: 11Pages 1–4https://doi.org/10.1145/2838931.2838940

Published:08 December 2015Publication History

ADCS '15: Proceedings of the 20th Australasian Document Computing Symposium

Pages 1–4

ABSTRACT

Automatic segmentation and overlapping bigrams are the most common methods for overcoming the lack of explicit word boundaries in Chinese text. Past studies have compared their effectiveness, but findings have been equivocal and site search has been little studied. We compare representatives of the two approaches using a 465,000 page crawl and test queries applicable to the university context. 503 pairs of result sets were judged by 56 Chinese students.

Although there are differences on certain queries, we find no overall advantage to either method. To understand the merits of each approach, we analyze cases where they performed differently. Our analysis enumerates situations which favour segmentation, and those which favour bigrams. We observe that further improvements in segmentation accuracy will not improve retrieval effectiveness.

References

A. Broder. A taxonomy of web search. SIGIR Forum, 36(2):3--10, 2002. Google ScholarDigital Library
G. Cao, P. He, G. Wu, and S. Nie. 中文分词对中文信息检索系统性能的影响 (Impact of Chinese Segmentation to Chinese Information Retrieval). Computer Engineering and Applications, 19:78--79, 2003.Google Scholar
S. Foo and H. Li. Chinese word segmentation and its effect on information retrieval. Inf. Proc. & Management, 40(1):161--190, 2004. Google ScholarDigital Library
Q. Fu. 基于搜索统计技术中文分词算法的应用研究 (Application of statistical techniques to Chinese word segmentation algorithm). China Sciencepaper Online, 2007. http://www.paper.edu.cn/releasepaper/content/200704-749.Google Scholar
H. He, P. He, J. Gao, and C. Huang. Finding the better indexing units for Chinese information retrieval. In Proc. SIGHAN Workshop on Chinese Language Processing, pages 1--7, 2002. Google ScholarDigital Library
P. Jin, Y. Liu, and S. Wang. 汉语分词对中文搜索引擎检索性能的影响 (Influence of Chinese word segmentation on web information retrieval). Journal of the China Society for Scientific and Technical Information, 25(1):21--24, 2006.Google Scholar
I.-S. Kang, S.-H. Na, and J.-H. Lee. Combination approaches in information retrieval: words vs. n-grams, and query translation vs. document translation. In Proc. NTCIR, 2004.Google Scholar
D. Kim and S. Ming. Effectiveness of segmentation granularity and indexing units for worst case evaluation in Chinese information retrieval. Proc. Int. Conf. Internet Information Retrieval, pages 177--180, 2005.Google Scholar
K. L. Kwok. Comparing representations in Chinese information retrieval. SIGIR Forum, 31:34--41, 1997. Google ScholarDigital Library
K. L. Kwok. Lexicon effects on Chinese information retrieval. In Proc. Empirical Methods in NLP, pages 141--8, 1997.Google Scholar
M.-K. Leong and H. Zhou. Preliminary qualitative analysis of segmented vs bigram indexing in Chinese. In Proc. TREC-6, pages 551--557, 1997.Google Scholar
X. Liu, Y. Hu, and X. Ai. 开源中文分词器在 web 搜索引擎中的应用 (The application of open source Chinese tokenizer in web search engine). Computer Engineering & Software, 34(3):80--83, 2013.Google Scholar
S. Long, Z. Zhao, and H. Tang. Overview on Chinese Segmentation Algorithm. Computer Knowledge and Technology, 5(10):2605--2607, 2009.Google Scholar
R. W. Luk, K.-F. Wong, and K.-L. Kwok. Hybrid term indexing: an evaluation. In Proc. NTCIR, pages 130--136, 2001.Google Scholar
National Taiwan University. Chinese Information Retrieval Benchmark version 1.0 (CIRB010). Web site, Jun 2000. http://lips.lis.ntu.edu.tw/cirb/releases/CIRB010.htm.Google Scholar
J. Y. Nie, M. Brisebois, and X. Ren. On Chinese text retrieval. In Proc. SIGIR, pages 225--233, 1996. Google ScholarDigital Library
J. Y. Nie, J. P. Chevallet, and M. F. Bruandet. Between terms and words for European language IR and between words and bigrams for Chinese IR. In Proc. TREC-6, pages 697--710, 1998.Google Scholar
J. Y. Nie, J. Gao, J. Zhang, and M. Zhou. On the use of words and n-grams for Chinese information retrieval. In Proc. Int. Work. Information Retrieval with Asian Languages, pages 141--148, 2000. Google ScholarDigital Library
D. W. Oard and J. Wang. Effects of term segmentation on Chinese/English cross-language information retrieval. In Proc. SPIRE, pages 149--157, 1999. Google ScholarDigital Library
D. Palmer and J. Burger. Chinese word segmentation and information retrieval. In Proc. AAAI Spring Symposium, pages 175--178, 1997.Google Scholar
F. Peng, X. Huang, D. Schuurmans, and N. Cercone. Investigating the relationship between word segmentation performance and retrieval performance in Chinese IR. In Proc. COLING, pages 1--7, 2002. Google ScholarDigital Library
M. Sun, J. Zou, et al. 汉语自动分词研究评述 (Chinese Automatic Segmentation Research Review). Contemporary Linguistics, 3(1): 22--32, 2001.Google Scholar
P. Thomas and D. Hawking. Evaluation by comparing result sets in context. In Proc. of CIKM 2006, pages 94--101, 2006. Google ScholarDigital Library
X. Tong, C. Zhai, N. Millic Frayling, and D. A. Evans. Experiments on Chinese text indexing: CLARIT TREC-5 Chinese track report. In Proc. TREC-5, pages 335--339, 1997.Google Scholar
S. Wang. 面向大规模信息检索的中文分词技术研究 (Chinese Words Segmentation Technology in Large-scale Information Retrieval). PhD thesis, Beijing: Institute of Computing Technology Chinese Academy Of Sciences, 2006.Google Scholar
L. Zhou. Investigating indexing units for Chinese web information retrieval: Chinese word segmentation versus N-grams. Master's thesis, Australian National University, 2013.Google Scholar

Index Terms

Text segmentation and Chinese site search
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

Review of brain MRI image segmentation methods

Brain image segmentation is one of the most important parts of clinical diagnostic tools. Brain images mostly contain noise, inhomogeneity and sometimes deviation. Therefore, accurate segmentation of brain images is a very difficult task. However, the ...
Read More
A Novel Brain Tumor Segmentation from Multi-Modality MRI via A Level-Set-Based Model

Segmentation of brain tumor from magnetic resonance imaging is a challenging and time-consuming task due to the unpredictable appearance of tumor tissue in practical applications. In this paper we propose a novel level-set-based model for tumor ...
Read More
Segmentation of pituitary adenoma

Among all abnormal growths inside the skull, the percentage of tumors in sellar region is approximately 10-15%, and the pituitary adenoma is the most common sellar lesion. A time-consuming process that can be shortened by using adequate algorithms is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ADCS '15: Proceedings of the 20th Australasian Document Computing Symposium
December 2015
72 pages
ISBN:9781450340403
DOI:10.1145/2838931
Editors:
Laurence A. F. Park
Western Sydney University
,
Sarvnaz Karimi
CSIRO
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 December 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Chinese IR
segmentation
site search
Qualifiers
- short-paper
- Research
- Refereed limited
Conference

Acceptance Rates
ADCS '15 Paper Acceptance Rate5of14submissions,36%Overall Acceptance Rate30of57submissions,53%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 79
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Text segmentation and Chinese site search

ADCS '15: Proceedings of the 20th Australasian Document Computing Symposium

ABSTRACT

References

Cited By

Index Terms

Recommendations

Review of brain MRI image segmentation methods

A Novel Brain Tumor Segmentation from Multi-Modality MRI via A Level-Set-Based Model

Segmentation of pituitary adenoma