Methods for automatic term recognition in domain-specific text collections: A survey

Astrakhantsev, N. A.; Fedorenko, D. G.; Turdakov, D. Yu.

doi:10.1134/S036176881506002X

Methods for automatic term recognition in domain-specific text collections: A survey

Published: 15 November 2015

Volume 41, pages 336–349, (2015)
Cite this article

Programming and Computer Software Aims and scope Submit manuscript

N. A. Astrakhantsev¹,
D. G. Fedorenko¹ &
D. Yu. Turdakov¹

508 Accesses
21 Citations
Explore all metrics

Abstract

Applications related to domain specific text processing often use glossaries and ontologies, and the main step of such resource construction is term recognition. This paper presents a survey of existing definitions of the term and its linguistic features, formulates the task definition for term recognition, and analyzes presently-available methods for automatic term recognition, such as methods for candidates collection, methods based on statistics and contexts of term occurrences, methods using topic models, and methods based on external resources (such as text collections from other domains, ontologies, and Wikipedia). This paper also provides an overview of standard methodologies and datasets for experimental research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Biomedical term extraction: overview and a new methodology

Article 25 August 2015

Juan Antonio Lossio-Ventura, Clement Jonquet, … Maguelonne Teisseire

ITEXT-BIO: Intelligent Term EXTraction for BIOmedical analysis

Article Open access 10 July 2021

Rodrique Kafando, Rémy Decoupes, … Mathieu Roche

Evaluation and analysis of term scoring methods for term extraction

Article Open access 10 August 2016

Suzan Verberne, Maya Sappelli, … Wessel Kraaij

References

Astrakhantsev, N. and Turdakov, D., Automatic construction and enrichment of informal ontologies: A survey, Program. Comput. Software, 2013, vol. 39, no. 1, pp. 34–42.
Article Google Scholar
Myakshin, K.A., Various approaches to definition of the concept “term,” Al’manakh Sovremennoi Nauki Obrazovaniya, ser. Yazykoznanie Literaturovedenie Sinkhronii Diakhronii Metodika Prepodavaniya Yazyka Literatury, 2007, vol. 3, no. 3, pp. 175–178.
Google Scholar
Pazienza, M., Pennacchiotti, M., and Zanzotto, F., Terminology extraction: An analysis of linguistic and statistical approaches, in Knowledge Mining, 2005, pp. 255–279.
Chapter Google Scholar
Komarova, R.I., Term system of the heuristics sublanguage (on the material of English), Extended Abstract of Cand. Phil. Sci. Dissertation, Odessa, 1991, p. 18.
Google Scholar
Vinokur, G.O., Grammatical observations in the field of technical terminology, Tr. Mosk. Inst. Filosofii Literatury Istorii, 1939, vol. 5.
Wüster, E., Einfuuhrung in die allgemeine Terminologielehre und terminologische Lexikographie (1979), Kobenhavn: Handelshojskolen, 1985.
Google Scholar
Felber, H., Basic principles and methods for the preparation of terminology standards, Standardization of Technical Terminology: Principle and Practices, ASTM STP, 1982, vol. 806, pp. 3–13.
Google Scholar
Terminology–Vocabulary: Standard, CH: International Organization for Standardization, Geneva, 1990.
Pearson, J., Terms in Context, John Benjamins, 1998, vol. 1.
Rondeau, G., Introduction ala terminologie, Quebec: Gaetan Morin, 1984, 2nd ed.
Google Scholar
Myakshin, K.A., On the question of main features of the term, Al’manakh Sovremennoi Nauki Obrazovaniya, ser. Yazykoznanie Literaturovedenie Sinkhronii Diakhronii Metodika Prepodavaniya Yazyka Literatury, 2008, vol. 2, no. 21, pp. 17–22.
Google Scholar
Khayutin, A.D., Compound terms: Functional type of complex linguistic units from the perspective of lexicography, in Otraslevaya terminologiya i leksikografiya (Industrial Terminology and Lexicography), Voronezh: Voronezh State Pedagogical Univ., 1981.
Google Scholar
Akhmanova, O.S., Linguistic terminology, Linguistic Encyclopedic Dictionary, Moscow: Sov. Entsikl., 1990.
Google Scholar
Judea, A., Schütze, H., and Bruegmann, S., Unsupervised training set generation for automatic acquisition of technical terminology in patents, Proc. 25th Int. Conf. Computational Linguistics: Technical Papers (COLING), Dublin, 2014, pp. 290–300.
Google Scholar
Bernier-Colborne, G. and Drouin, P., Creating a test corpus for term extractors through term annotation, Terminology, 2014, vol. 20, no. 1, pp. 50–73.
Article Google Scholar
Wu, W., Liu, T., Hu, H., et al., Extracting domain-relevant term using Wikipedia based on random walk model, Proc. 7th IEEE Int. Conf. Data Mining Workshops, 2012, pp. 68–75.
Google Scholar
Bordea, G., Buitelaar, P., and Polajnar, T., Domainindependent term extraction through domain modeling, Proc. 10th Int. Conf. Terminology and Artificial Intelligence (TIA), Paris, 2013.
Google Scholar
Bagot, R.E., Les unites de signification sprecialisrees relargissant l’objet du travail en terminologie, Terminology, 2002, vol. 7, no. 2, pp. 217–237.
Article Google Scholar
Kageura, K. and Umino, B., Methods of automatic term recognition: A review, Terminology, 1996, vol. 3, no. 2, pp. 259–289.
Article Google Scholar
Wermter, J. and Hahn, U., You can’t beat frequency (unless you use linguistic knowledge): A qualitative evaluation of association measures for collocation and term extraction, Proc. 21st Int. Conf. Computational Linguistic and 44th Annu. Meet. Association for Computational Linguistic, 2006, pp. 785–792.
Google Scholar
Zhang, Z., Brewster, C., and Ciravegna, F., A comparative evaluation of term recognition algorithms, Proc. 6th Int. Conf. Language Resources and Evaluation (LREC), Marrakech, 2008.
Google Scholar
Evans, D.A. and Lefferts, R.G., Clarit-trec experiments, Inf. Process. Manage., 1995, vol. 31, no. 3, pp. 385–395.
Article Google Scholar
Ahmad, K., Gillam, L., Tostevin, L., et al., University of surrey participation in trec8: Weirdness indexing for logical document extrapolation and retrieval (wilder), Proc. 8th Text Retrieval Conf. (TREC), 1999.
Google Scholar
Frantzi, K., Ananiadou, S., and Mima, H., Automatic recognition of multi-word terms: The c-value/nc-value method, Int. J. Digital Libr., 2000, vol. 3, no. 2, pp. 115–130.
Article Google Scholar
Kozakov, L., Park, Y., Fin, T., et al., Glossary extraction and utilization in the information search and delivery system for IBM technical support, IBM Syst. J., 2004, vol. 43, no. 3, pp. 546–563.
Article Google Scholar
Sclano, F. and Velardi, P., Termextractor: A web application to learn the shared terminology of emergent web communities, Enterprise Interoperability II, 2007, pp. 287–290.
Chapter Google Scholar
Braslavskii, P.I. and Sokolov, E.A., Comparison of four methods for automatic recognition of two-word terms in text, in Komp’yuternaya lingvistika i intellektual’nye tekhnologii (Computational Linguistics and Intellectual Technologies), 2006, pp. 88–94.
Google Scholar
Braslavskii, P. and Sokolov, E., Comparison of five methods for recognition of terms of arbitrary length, in Komp’yuternaya lingvistika i intellektual’nye tekhnologii (Computational Linguistics and Intellectual Technologies), 2008, no. 7, p. 14.
Google Scholar
Bourigault, D., Surface grammatical analysis for the extraction of terminological noun phrases, Proc. 14th Conf. Computational Linguistic, 1992, vol. 3, pp. 977–981.
Article Google Scholar
Baroni, M. and Bernardini, S., BootCaT: Bootstrapping corpora and terms from the Web, Proc. Conf. Language Resources and Evaluation (LREC), 2004, pp. 1313–1316.
Google Scholar
Dobrov, B.V., Lukashevich, N.V., and Syromyatnikov, S.V., Formation of the base of terminological phrases based on domain texts, Trudy 5oi Vseross. nauchn. konf. “Elektronnye biblioteki: perspektivnye metody i tekhnologii, elektronnye kollektsii” (Proc. 5th All-Russ. Sci. Conf. “Electronic Libraries: Promising Methods and Technologies, Electronic Collections”), 2003, pp. 201–210.
Google Scholar
Automatic Text Processing, Syntactic analysis. http://www.aot.ru/docs/synan.html.
Fedorenko, D., Astrakhantsev, N., and Turdakov, D., Automatic recognition of domain-specific terms: An experimental evaluation, Proc. Spring Researchers Colloquium on Databases and Information Systems (SYRCo- DIS), 2013, pp. 15–23.
Google Scholar
Nokel, M. and Loukachevitch, N., An experimental study of term extraction for real information-retrieval thesauri, Proc. 10th Int. Conf. Terminology and Artificial Intelligence, 2013, pp. 69–76.
Google Scholar
Ventura, J.A.L., Jonquet, C., and Roche, M., et al., Combining C-value and keyword extraction methods for biomedical terms extraction, Proc. Int. Symp. Languages in Biology and Medicine (LBM), 2013, pp. 45–49.
Google Scholar
Barron-Cedeno, A., Sierra, G., Drouin, P., et al., An improved automatic term recognition method for Spanish, in Computational Linguistics and Intelligent Text Processing, Berlin: Springer, 2009, pp. 125–136.
Chapter Google Scholar
Bordea, G., Domain adaptive extraction of topical hierarchies for expertise mining, Ph.D. Dissertation, Galway: National University of Ireland, 2013.
Google Scholar
Navigli, R. and Velardi, P., Semantic interpretation of terminological strings, Proc. 6th Int. Conf. Terminology and Knowledge Engineering, 2002, pp. 95–100.
Google Scholar
Dennis, S.F., The construction of a thesaurus automatically from a sample of text, Proc. Symp. Statistical Association Methods for Mechanized Documentation, Washington, 1965, pp. 61–148.
Google Scholar
Church, K., Gale, W., Hanks, P., et al., Using statistics in lexical analysis, in Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, 1991, p. 115.
Google Scholar
Dunning, T., Accurate methods for the statistics of surprise and coincidence, Comput. Linguist., 1993, vol. 19, no. 1, pp. 61–74.
Google Scholar
Church, K.W. and Hanks, P., Word association norms, mutual information, and lexicography, Comput. Linguist., 1990, vol. 16, no. 1, pp. 22–29.
Google Scholar
Daille, B., Combined approach for terminology extraction: Lexical statistics and linguistic filtering, Ph.D. Thesis, Paris: University Paris 7, 1994.
Google Scholar
Park, Y., Byrd, R., and Boguraev, B., Automatic glossary extraction: Beyond terminology identification, Proc. 19th Int. Conf. Computational Linguistic, 2002, vol. 1, pp. 1–7.
Google Scholar
Blei, D.M. and Lafferty, J.D., Topic models, Text Min. Classif., Clustering, Appl., 2009, vol. 10, p. 71.
Article Google Scholar
Bolshakova, E., Loukachevitch, N., and Nokel, M., Topic models can improve domain term extraction, in Advances in Information Retrieval, Berlin: Springer, 2013, pp. 684–687.
Chapter Google Scholar
Li, S., Li, J., Song, T., et al., A novel topic model for automatic term extraction, Proc. 36th Int. ACM SIGIR Conf. Research and Development in Information Retrieval, 2013, pp. 885–888.
Google Scholar
Meijer, K., Frasincar, F., and Hogenboom, F., A semantic approach for extracting domain taxonomies from text, Decis. Support Syst., 2014, vol. 62, pp. 78–93.
Article Google Scholar
Penas, A., Verdejo, F., Gonzalo, J., et al., Corpusbased terminology extraction applied to information access, Proc. Corpus Linguistics, 2001, vol. 13.
Manning, C. and Schutze, H., Foundations of Statistical Natural Language Processing, MIT Press, 1999.
MATH Google Scholar
Braslavskii, P.I. and Sokolov, E.A., Automatic term recognition using Internet retrieval engines, in Komp’yuternaya lingvistika i intellektual’nye tekhnologii (Computational Linguistics and Intellectual Technologies), 2007, pp. 89–94.
Google Scholar
Golomazov, D.D., Methods and techniques for scientific information management using ontologies, Cand. Sci. (Phys.–Math.) Dissertation, Moscow, 2012.
Google Scholar
Dobrov, B.V. and Loukachevitch, N.V., Multiple evidence for term extraction in broad domains, Proc. Recent Advances in Natural Language Processing, 2011, pp. 710–715.
Google Scholar
Xu, F., Kurz, D., Piskorski, J., et al., A domain adaptive approach to automatic acquisition of domain relevant terms and their relations with bootstrapping, Proc. Int. Conf. Language Resources and Evaluation, 2002.
Google Scholar
Milne, D., Medelyan, O., and Witten, I.H., Mining domain-specific thesauri from Wikipedia: A case study, Proc. IEEE/WIC/ACM Int. Conf. Web Intelligence, 2006, pp. 442–448.
Google Scholar
Strube, M. and Ponzetto, S.P., WikiRelate!: Computing semantic relatedness using Wikipedia, Proc. 21st AAAI Conf. Artificial Intelligence, 2006, vol. 6, pp. 1419–1424.
Google Scholar
Mihalcea, R. and Csomai, A., Wikify!: Linking documents to encyclopedic knowledge, Proc. 16th ACM Conf. Information and Knowledge Management, 2007, pp. 233–242.
Google Scholar
Milne, D. and Witten, I.H., Learning to link with Wikipedia, Proc. 17th ACM Conf. Information and Knowledge Management, 2008, pp. 509–518.
Google Scholar
Vivaldi, J. and Rodriguez, H., Using Wikipedia for term extraction in the biomedical domain: First experiences, Procesamiento del Lenguaje Natural, 2010, vol. 45, pp. 251–254.
Google Scholar
Vivaldi, J., Cabrera-Diego, L.A., Sierra, G., et al., Using Wikipedia to validate the terminology found in a corpus of basic textbooks, Proc. Conf. Language Resources and Evaluation (LREC), 2012, pp. 3820–3827.
Google Scholar
Astrakhantsev, N., Automatic term recognition in a domain-specific text collection using Wikipedia, Tr. Inst. Sistemnogo Program. Ross. Akad. Nauk, 2014, vol. 26, no. 4, pp. 7–20.
Google Scholar
Patry, A. and Langlais, P., Corpus-based terminology extraction, Proc. 7th Int. Conf. Terminology and Knowledge Engineering, Copenhagen, 2005.
Google Scholar
Astrakhantsev, N., Fedorenko, D., and Turdakov, D., Automatic enrichment of informal ontology by analyzing a domain-specific text collection, Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue,” 2014, vol. 13, pp. 29–42.
Google Scholar
Yang, Y., Yu, H., Meng, Y., et al., Fault-tolerant learning for term extraction. http://www.aclweb.org/anthology/Y10-1036.
Liu, X. and Kit, C., An improved corpus comparison approach to domain specific term recognition, Proc. Pacific Asia Conf. Language, Information, and Computing (PACLIC), 2008, pp. 253–261.
Google Scholar
Kim, J.-D., Ohta, T., Tateisi, Y., et al., GENIA corpus: A semantically annotated corpus for bio-textmining, Bioinformatics, 2003, vol. 19, no. Suppl. 1, pp. 180–182.
Article Google Scholar
Nenadie, G., Ananiadou, S., and McNaught, J., Enhancing automatic term recognition through recognition of variation, Proc. 20th Int. Conf. Computational Linguistics, 2004, p. 604.
Google Scholar
Krauthammer, M. and Nenadic, G., Term identification in the biomedical literature, J. Biomed. Inf., 2004, vol. 37, no. 6, pp. 512–526.
Article Google Scholar
Medelyan, O. and Witten, I.H., Domain-independent automatic keyphrase indexing with small training sets, J. Am. Soc. Inf. Sci. Technol., 2008, vol. 59, no. 7, pp. 1026–1040.
Article Google Scholar
Krapivin, M., Autaeu, A., and Marchese, M., Large dataset for keyphrases extraction. http://eprints.biblio.unitn.it/1671/1/disi09055-krapivin-autayeumarchese. pdf.

Download references

Author information

Authors and Affiliations

Institute for System Programming, Russian Academy of Sciences, ul. Solzhenitsyna 25, Moscow, 109004, Russia
N. A. Astrakhantsev, D. G. Fedorenko & D. Yu. Turdakov

Authors

N. A. Astrakhantsev
View author publications
You can also search for this author in PubMed Google Scholar
D. G. Fedorenko
View author publications
You can also search for this author in PubMed Google Scholar
D. Yu. Turdakov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to N. A. Astrakhantsev.

Additional information

Original Russian Text © N.A. Astrakhantsev, D.G. Fedorenko, D.Yu. Turdakov, 2015, published in Programmirovanie, 2015, Vol. 41, No. 6.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Astrakhantsev, N.A., Fedorenko, D.G. & Turdakov, D.Y. Methods for automatic term recognition in domain-specific text collections: A survey. Program Comput Soft 41, 336–349 (2015). https://doi.org/10.1134/S036176881506002X

Download citation

Received: 06 April 2015
Published: 15 November 2015
Issue Date: November 2015
DOI: https://doi.org/10.1134/S036176881506002X

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Methods for automatic term recognition in domain-specific text collections: A survey

Abstract

Access this article

Similar content being viewed by others

Biomedical term extraction: overview and a new methodology

ITEXT-BIO: Intelligent Term EXTraction for BIOmedical analysis

Evaluation and analysis of term scoring methods for term extraction

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Methods for automatic term recognition in domain-specific text collections: A survey

Abstract

Access this article

Similar content being viewed by others

Biomedical term extraction: overview and a new methodology

ITEXT-BIO: Intelligent Term EXTraction for BIOmedical analysis

Evaluation and analysis of term scoring methods for term extraction

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation