skip to main content
10.1145/2838931.2838940acmotherconferencesArticle/Chapter ViewAbstractPublication PagesadcsConference Proceedingsconference-collections
short-paper

Text segmentation and Chinese site search

Published:08 December 2015Publication History

ABSTRACT

Automatic segmentation and overlapping bigrams are the most common methods for overcoming the lack of explicit word boundaries in Chinese text. Past studies have compared their effectiveness, but findings have been equivocal and site search has been little studied. We compare representatives of the two approaches using a 465,000 page crawl and test queries applicable to the university context. 503 pairs of result sets were judged by 56 Chinese students.

Although there are differences on certain queries, we find no overall advantage to either method. To understand the merits of each approach, we analyze cases where they performed differently. Our analysis enumerates situations which favour segmentation, and those which favour bigrams. We observe that further improvements in segmentation accuracy will not improve retrieval effectiveness.

References

  1. A. Broder. A taxonomy of web search. SIGIR Forum, 36(2):3--10, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. Cao, P. He, G. Wu, and S. Nie. 中文分词对中文信息检索系统性能的影响 (Impact of Chinese Segmentation to Chinese Information Retrieval). Computer Engineering and Applications, 19:78--79, 2003.Google ScholarGoogle Scholar
  3. S. Foo and H. Li. Chinese word segmentation and its effect on information retrieval. Inf. Proc. & Management, 40(1):161--190, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Q. Fu. 基于搜索统计技术中文分词算法的应用研究 (Application of statistical techniques to Chinese word segmentation algorithm). China Sciencepaper Online, 2007. http://www.paper.edu.cn/releasepaper/content/200704-749.Google ScholarGoogle Scholar
  5. H. He, P. He, J. Gao, and C. Huang. Finding the better indexing units for Chinese information retrieval. In Proc. SIGHAN Workshop on Chinese Language Processing, pages 1--7, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Jin, Y. Liu, and S. Wang. 汉语分词对中文搜索引擎检索性能的影响 (Influence of Chinese word segmentation on web information retrieval). Journal of the China Society for Scientific and Technical Information, 25(1):21--24, 2006.Google ScholarGoogle Scholar
  7. I.-S. Kang, S.-H. Na, and J.-H. Lee. Combination approaches in information retrieval: words vs. n-grams, and query translation vs. document translation. In Proc. NTCIR, 2004.Google ScholarGoogle Scholar
  8. D. Kim and S. Ming. Effectiveness of segmentation granularity and indexing units for worst case evaluation in Chinese information retrieval. Proc. Int. Conf. Internet Information Retrieval, pages 177--180, 2005.Google ScholarGoogle Scholar
  9. K. L. Kwok. Comparing representations in Chinese information retrieval. SIGIR Forum, 31:34--41, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. L. Kwok. Lexicon effects on Chinese information retrieval. In Proc. Empirical Methods in NLP, pages 141--8, 1997.Google ScholarGoogle Scholar
  11. M.-K. Leong and H. Zhou. Preliminary qualitative analysis of segmented vs bigram indexing in Chinese. In Proc. TREC-6, pages 551--557, 1997.Google ScholarGoogle Scholar
  12. X. Liu, Y. Hu, and X. Ai. 开源中文分词器在 web 搜索引擎中的应用 (The application of open source Chinese tokenizer in web search engine). Computer Engineering & Software, 34(3):80--83, 2013.Google ScholarGoogle Scholar
  13. S. Long, Z. Zhao, and H. Tang. Overview on Chinese Segmentation Algorithm. Computer Knowledge and Technology, 5(10):2605--2607, 2009.Google ScholarGoogle Scholar
  14. R. W. Luk, K.-F. Wong, and K.-L. Kwok. Hybrid term indexing: an evaluation. In Proc. NTCIR, pages 130--136, 2001.Google ScholarGoogle Scholar
  15. National Taiwan University. Chinese Information Retrieval Benchmark version 1.0 (CIRB010). Web site, Jun 2000. http://lips.lis.ntu.edu.tw/cirb/releases/CIRB010.htm.Google ScholarGoogle Scholar
  16. J. Y. Nie, M. Brisebois, and X. Ren. On Chinese text retrieval. In Proc. SIGIR, pages 225--233, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Y. Nie, J. P. Chevallet, and M. F. Bruandet. Between terms and words for European language IR and between words and bigrams for Chinese IR. In Proc. TREC-6, pages 697--710, 1998.Google ScholarGoogle Scholar
  18. J. Y. Nie, J. Gao, J. Zhang, and M. Zhou. On the use of words and n-grams for Chinese information retrieval. In Proc. Int. Work. Information Retrieval with Asian Languages, pages 141--148, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. W. Oard and J. Wang. Effects of term segmentation on Chinese/English cross-language information retrieval. In Proc. SPIRE, pages 149--157, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Palmer and J. Burger. Chinese word segmentation and information retrieval. In Proc. AAAI Spring Symposium, pages 175--178, 1997.Google ScholarGoogle Scholar
  21. F. Peng, X. Huang, D. Schuurmans, and N. Cercone. Investigating the relationship between word segmentation performance and retrieval performance in Chinese IR. In Proc. COLING, pages 1--7, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Sun, J. Zou, et al. 汉语自动分词研究评述 (Chinese Automatic Segmentation Research Review). Contemporary Linguistics, 3(1): 22--32, 2001.Google ScholarGoogle Scholar
  23. P. Thomas and D. Hawking. Evaluation by comparing result sets in context. In Proc. of CIKM 2006, pages 94--101, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. X. Tong, C. Zhai, N. Millic Frayling, and D. A. Evans. Experiments on Chinese text indexing: CLARIT TREC-5 Chinese track report. In Proc. TREC-5, pages 335--339, 1997.Google ScholarGoogle Scholar
  25. S. Wang. 面向大规模信息检索的中文分词技术研究 (Chinese Words Segmentation Technology in Large-scale Information Retrieval). PhD thesis, Beijing: Institute of Computing Technology Chinese Academy Of Sciences, 2006.Google ScholarGoogle Scholar
  26. L. Zhou. Investigating indexing units for Chinese web information retrieval: Chinese word segmentation versus N-grams. Master's thesis, Australian National University, 2013.Google ScholarGoogle Scholar

Index Terms

  1. Text segmentation and Chinese site search

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ADCS '15: Proceedings of the 20th Australasian Document Computing Symposium
      December 2015
      72 pages
      ISBN:9781450340403
      DOI:10.1145/2838931

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 December 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper
      • Research
      • Refereed limited

      Acceptance Rates

      ADCS '15 Paper Acceptance Rate5of14submissions,36%Overall Acceptance Rate30of57submissions,53%
    • Article Metrics

      • Downloads (Last 12 months)6
      • Downloads (Last 6 weeks)1

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader