ABSTRACT
Inventories of manually compiled dictionaries usually serve as a source for word senses. However, they often include many rare senses while missing corpus/domain-specific senses. We present a clustering algorithm called CBC (Clustering By Committee) that automatically discovers word senses from text. It initially discovers a set of tight clusters called committees that are well scattered in the similarity space. The centroid of the members of a committee is used as the feature vector of the cluster. We proceed by assigning words to their most similar clusters. After assigning an element to a cluster, we remove their overlapping features from the element. This allows CBC to discover the less frequent senses of a word and to avoid discovering duplicate senses. Each cluster that a word belongs to represents one of its senses. We also present an evaluation methodology for automatically measuring the precision and recall of discovered senses.
- Cutting, D. R.; Karger, D.; Pedersen, J.; and Tukey, J. W. 1992. Scatter/Gather: A cluster-based approach to browsing large document collections. In roceedings of SIGI - 2. pp. 318--329. Copenhagen, Denmark. Google ScholarDigital Library
- Guha, S.; Rastogi, R.; and Kyuseok, S. 1999. ROCK: A robust clustering algorithm for categorical attributes. In roceedings of ICD. pp. 512--521. Sydney, Australia. Google Scholar
- Harris, Z. 1985. Distributional structure. In: Katz, J. J. (ed.) he hilosophy of inguistics. New York: Oxford University Press. pp. 26--47.Google Scholar
- Hindle, D. 1990. Noun classification from predicate-argument structures. In roceedings of C - 0. pp. 268--275. Pittsburgh, PA. Google ScholarDigital Library
- Hutchins, J. and Sommers, H. 1992. Introduction to achine ranslation,. Academic Press.Google Scholar
- Jain, A. K.; Murty, M. N.; and Flynn, P. J. 1999. Data clustering: A review. ACM Computing Surveys 31(3):264--323. Google ScholarDigital Library
- Karypis, G.; Han, E.-H.; and Kumar, V. 1999. Chameleon: A hierarchical clustering algorithm using dynamic modeling. I Computer Special lssue on Data nalysis and ining 32(8):68--75. Google ScholarDigital Library
- Landauer, T. K., and Dumais, S. T. 1997. A solution to Plato's problem: The Latent Semantic Analysis theory of the acquisition, induction, and representation of knowledge. sychological eview 104:211--240.Google Scholar
- Landes, S.; Leacock, C,; and Tengi, R. I. 1998. Building semantic concordances. In ord et n lectronic e ical Database, edited by C. Fellbaum. pp. 199--216. MIT Press.Google Scholar
- Lin, D. 1994. Principar - an efficient, broad-coverage, principle-based parser. roceedings of C I G-. pp. 42--48. Kyoto, Japan.Google Scholar
- Lin, D. 1997. Using syntactic dependency as local context to resolve word sense ambiguity. In roceedings of C-. pp. 64--71. Madrid, Spain.Google Scholar
- Lin, D. 1998. Automatic retrieval and clustering of similar words. roceedings of C I G C -. pp. 768--774. Montreal, Canada. Google ScholarDigital Library
- Lin, D. and Pantel, P. 2001. Induction of semantic classes from natural language text. In roceedings of SIGKDD-01. pp. 317--322. San Francisco, CA. Google ScholarDigital Library
- Manning, C. D. and Schütze, H. 1999. Foundations of Statistical atural anguage recessing. MIT Press. Google ScholarDigital Library
- Miller, G. 1990. WordNet: An online lexical database. International ournal of e icography, 1990.Google Scholar
- Pasca, M. and Harabagiu, S. 2001. The informative role of WordNet in Open-Domain Question Answering. In roceedings of C -01 orkshop on ord et and ther e ical esources, pp. 138--143. Pittsburgh, PA. Google ScholarDigital Library
- Salton, G. and McGill, M. J. 1983. Introduction to odern Information etrieval. McGraw Hill. Google ScholarDigital Library
- Shaw Jr, W. M.; Burgin, R.; and Howell, P. 1997. Performance standards and evaluations in IR test collections: Cluster-based retrieval methods. Information recessing and anagement 33:1--14, 1997. Google ScholarDigital Library
- Steinbach, M.; Karypis, G.; and Kumar, V. 2000. A comparison of document clustering techniques, echnical eport 00-0. Department of Computer Science and Engineering, University of Minnesota.Google Scholar
- Voorhees, E. M. 1998. Using WordNet for text retrieval. In ord et n lectronic e ical Database, edited by C. Fellbaum. pp. 285--303. MIT Press.Google Scholar
Index Terms
- Discovering word senses from text
Recommendations
Finding predominant word senses in untagged text
ACL '04: Proceedings of the 42nd Annual Meeting on Association for Computational LinguisticsIn word sense disambiguation (WSD), the heuristic of choosing the most common sense is extremely powerful because the distribution of the senses of a word is often skewed. The problem with using the predominant, or first sense heuristic, aside from the ...
Discovering corpus-specific word senses
EACL '03: Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2This paper presents an unsupervised algorithm which automatically discovers word senses from text. The algorithm is based on a graph model representing words and relationships between them. Sense clusters are iteratively computed by clustering the local ...
Word Sense Discovery for Web Information Retrieval
ICDMW '08: Proceedings of the 2008 IEEE International Conference on Data Mining WorkshopsWord meaning disambiguation has always been an important problem in many computer science tasks, such as information retrieval and extraction. One of the problems,faced in automatic word sense discovery, is the number of different senses a word can ...
Comments