Terminology spectrum analysis of natural-language chemical documents: term-like phrases retrieval routine

Alperin, Boris L.; Kuzmin, Andrey O.; Ilina, Ludmila Yu.; Gusev, Vladimir D.; Salomatina, Natalia V.; Parmon, Valentin N.

doi:10.1186/s13321-016-0136-4

Methodology
Open access
Published: 29 April 2016

Terminology spectrum analysis of natural-language chemical documents: term-like phrases retrieval routine

Boris L. Alperin¹,
Andrey O. Kuzmin^1,2,
Ludmila Yu. Ilina¹,
Vladimir D. Gusev³,
Natalia V. Salomatina³ &
…
Valentin N. Parmon^1,2

Journal of Cheminformatics volume 8, Article number: 22 (2016) Cite this article

2256 Accesses
4 Citations
Metrics details

Abstract

Background

This study seeks to develop, test and assess a methodology for automatic extraction of a complete set of ‘term-like phrases’ and to create a terminology spectrum from a collection of natural language PDF documents in the field of chemistry. The definition of ‘term-like phrases’ is one or more consecutive words and/or alphanumeric string combinations with unchanged spelling which convey specific scientific meanings. A terminology spectrum for a natural language document is an indexed list of tagged entities including: recognized general scientific concepts, terms linked to existing thesauri, names of chemical substances/reactions and term-like phrases. The retrieval routine is based on n-gram textual analysis with a sequential execution of various ‘accept and reject’ rules with taking into account the morphological and structural information.

Results

The assessment of the retrieval process, expressed quantitatively with a precision (P), recall (R) and F₁-measure, which are calculated manually from a limited set of documents (the full set of text abstracts belonging to 5 EuropaCat events were processed) by professional chemical scientists, has proved the effectiveness of the developed approach. The term-like phrase parsing efficiency is quantified with precision (P = 0.53), recall (R = 0.71) and F₁-measure (F₁ = 0.61) values.

Conclusion

The paper suggests using such terminology spectra to perform various types of textual analysis across document collections. This sort of the terminology spectrum may be successfully employed for text information retrieval, for reference database development, to analyze research trends in subject fields of research and to look for the similarity between documents.

Background

The current situation in chemistry, as in any other field of natural science, can be characterized by a substantial growth of texts in natural languages (research papers, conference proceedings, patents, etc.), still being the most important sources of scientific knowledge and experimental data, information about modern research trends and terminology used in the subject areas of science. It greatly increases the value of such powerful information systems as Scopus^®, SciFinder^®, Reaxys^® which are capable of handling large text document databases and especially those fitted with advanced text information retrieval capabilities. In fact, both efficiency and productivity of modern scientific research in chemistry depend rigorously on quality and completeness of its information support, which is oriented firstly on advanced and flexible reference search, discovering and analysing of text information to afford the most relevant answers to user questions (substances, reactions, relevant patents or journal articles). The main ideas and developments in the information retrieval methods coupled with techniques of full text analysis are now well described and examined [1].

In conventional information systems, the majority of text information retrieval and discovery methods are based on using specific sets of pre-defined document metadata, e.g. keywords or indexes of terms characterizing the texts content. User queries are converted using index into information requests expressed by Boolean terms combination with bringing into play the vector space and terms weight. Probabilistic approaches may also be employed to take into account such features as terms distribution, co-occurrence information and their relationships derived from information retrieval thesauri (IRT) to include them into analytic process. Any kind of such indexes have to be produced and updated mainly manually by trained experts, but now the possibilities of automated indexes development attract closer attention.

It is assumed that the structural foundation of any scientific text is its terminology basis, which may be represented, in principle, by advanced IRT. However, it leads to difficulties in applying conventional IRTs in practical information text analysis procedures because of limitations inherent in them. Typically, such thesauri are made manually in a very labor-intensive process and often are constructed to reflect the general terminology only. Terms from thesauri originally represent a formally written description of scientific conceptions and definitions which may not exactly match the real usage and spelling used in scientific texts. Moreover, a thesaurus developed for one type of text may be less efficient or not applicable when used with another. A good example is the IUPAC “Gold Book” [2] compendium of chemical nomenclature, terminology, units and definition recommendations. Terminology drafted by experts of IUPAC spans a wide range of chemistry but does not describe any field in detail and represents only a well-established upper level of scientific terminology. Summarizing, IRT based text analysis alone is unable to solve the problem of the variability of scientific texts written in natural languages because the accuracy of matching thesaurus terms with real text phrases leaves much to be desired.

It should also be noted that the language of science is evolving faster than that of natural language, especially in chemistry and molecular biology. Thus, the analysis of terminology basis of subject text collection should be done automatically using both primitive extraction and sophisticated knowledge-based parsing. Only automated data analysis can process and reveal the variety of term-like word combinations in constantly changing world of scientific publications. Automated parsing and analysis of document collections or isolated documents for term-like phrases can also help to discover various contexts in which the same scientific terminology is used in different publications or even parts of the same publication.

There is nothing new in the idea of automated terms retrieval. Typically, the terminology analysis of text content is focused on recognition of chemical entities and automatic keyphrase extraction aimed to provide a limited set of keywords which might characterize and classify the document as a whole. Two main strategies are usually applied: machine self-learning and usage of various dictionaries with automated selection rules (heuristics) coupled with calculated features [3], such as TF-IDF [4, 5]. Therefore, keyphrase retrieval procedures typically involve the following stages: initial text preprocessing; selecting a candidate to a keyphrase; applying rules to each candidate; compiling a list of keyphrases [6]. A few existing systems had been analyzed in terms of precision (P), recall (R) and F₁-score attainable for existing keyphrase extraction datasets. For such well-known systems as Wingnus, Sztergak, KP-Mminer these values are reported as P = 0.34÷0.40, R = 0.11÷0.14, F₁ = 0.17÷0.20 [6]. Open-Source Chemistry Analysis Routines (OSCAR4) [7] and ChemicalTagger [8] NLP may also be mentioned as tools for the recognition of named chemical entities and for parsing and tagging the language of text publications in chemistry.

However, there are some inherent shortcomings in the above mentioned keyphrase extraction approaches due to the presence of a significant amount of cases where a limited set of automatically selected top ranked keyphrases does not properly describe the document in details (e.g., a paper may contain the description of a specific procedure of catalyst preparation while not being the main subject of the paper). It may also be seen from the aforementioned values of P, R and F that in many cases the extracted keyphrases do not match the keyphrases selected by experts to an adequate degree. Exact matching of keyphrases is a rather rare event, partially due to the difficulties of taking into account nearly similar phrases, for instance, semantically similar phrases. On the other hand, even though the widely used n-gram analysis can bild a full spectrum of token sequences present in the text, it may also produce a great level of noise, making it difficult to use them. Some attempts have been made to take into account the semantic similarity of n-grams and to differentiate between rubbish and candidates to plausible keyphrases [9, 10].

The problem of automatic recognition of scientific terms in natural language texts has been explored during last decades [11]. It is shown that taking into account the linguistic information may improve the terms extraction efficiency. The information about grammatical structure of multiword scientific terms, their text variants, context of their usage may be represented as a set of lexico-syntactic patterns. For instance, the values of P, R and F-measure equal to 73.1, 53.6 and 61.8 % respectively for term extraction from scientific texts (only in Russian) on computer science and physics were obtained [12].

A ‘terminology spectrum’ of a natural language publication may be defined as an indexed list of tagged token sequences with calculated weights, such as recognized general scientific notions, terms linked to existing thesauri, names of chemical entities and ‘term-like phrases’. The term-like phrases are not exactly the keyphrases or terms in the usual sense (like published in thesauri). Such term-like phrases are defined here as one or more consecutive tokens (represented by words and/or alphanumeric strings combinations), which convey specific scientific meaning with unchanged spelling and context as in a real text document. For instance, a term-like phrase may look similar to a specific generally used term but with different spelling or word order reflecting the usage of the term in a different context in natural language environment. Consequently, they may describe real text content and the essence of real processes that the scientific research handles, which makes the analysis of such phrases extremely useful. That sort of terminology spectrum of a natural language publication may be considered as some kind of knowledge representation of a text and may be successfully employed in various information retrieval strategies, text analysis and reference systems [13].

The present work is aimed to develop and test the methodology of automated retrieval of full terminology spectrum from any natural language chemical text collections in pdf format, with term-like phrases selection being the central part of the procedure. The retrieval routine is based on n-gram text analysis with sequential execution of a complex of ‘accept’ and ‘reject’ rules while taking into account the morphological and structural information. The term ‘n-gram’ denotes here a text string or a sequence of n consecutive words or tokens presented in a text. Numerical assessment of automated term-like phrases retrieval process efficiency done in the paper is calculated by comparing automatically extracted term-like phrases and those manually selected by experts.

Methods

Text collection used for experiments

Chemical catalysis is a foundation of chemical industry and represents a very complex field of scientific and technological researches. It includes chemistry, various subject fields of physics, chemical engineering, material science and a lot of more. One of the most representative research conferences in catalysis is «European Congress on Catalysis—EuropaCat», which has been chosen as a source of scientific texts covering the wide range of themes of researches. A set of abstracts of EuropaCat conferences of 2013, 2011, 2009, 2007, 2005 (about 6000 documents in all five Congress events) has been used for textual analysis in the present study. All abstracts are in pdf format.

General description of terminology spectrum retrieval process

The developed system of terminology spectrum analysis consists of the following sequentially running procedures or steps, as depicted in Fig. 1.

The server side of the terminology spectrum analysis system runs on Java SE 6 platform and the client is a PHP web-application to view texts and the results of terminology analysis. To store all data collected in the terminology retrieval process the cross-platform document-oriented database MongoDB is used [14]. The choice in favor of MongoDB was conditioned by the need to process nested n-gram structures up to level 7.

The main stages and analytic methods involved in the process are discussed in the following sections.

Text materials conversion with PdfTextStream library [15]

The scientific texts are mainly published in pdf format which does not typically contain any information about document structure and therefore is not suitable for immediate text analysis. Thus, at first, a document has to be preprocessed by converting a pdf file into the text format and analyzing its structure (highlighting titles, authors, headings, references, etc.) with the aim to make the text suitable for further content information retrieval (see Fig. 2). The following steps are used (stages 1–2 on Fig. 1) to make such kind of pdf transformation (for a detailed example see Additional File 1):

1.
Isolation of text blocks which have the same formatting (e.g. bold, underline and etc.);
2.
Removing empty blocks and merging blocks located on the same text row;
3.
Analyzing the document structure by classifying each block as containing information about the publication title, the headings, the authors, the organizations, the e-mails, the references and the content. To perform such analysis a set of special taggers has been developed which are executed sequentially to analyze and tag each text block. Taggers utilize such features as the position of first and last rows of text block, text formatting, a position of a block of text on a page, etc. All developed taggers have been adjusted to handle each conference event individually.
4.
Text block filtration to remove unclassified text blocks, for instance, situated before the publication title, because such blocks typically contain useless and already known information about a conference or journal.
5.
Unification of special symbols (such as variants of dash, hyphen, and quote characters), removal of space characters placed before brackets in writings of crystal indexes, etc. Regular expressions are used.

Text preprocessing

The text preprocessing stage #3 in Fig. 1 is to transform a text document obtained from stages 1–2 into a unified structured format with markup. During this stage the text is split into individual words and sentences (tokenization) followed by a morphological analysis that includes: highlighting objects such as formulas and chemical entities, removing unnecessary words and meaningless combinations of symbols, recognizing general English words and tokens with special meaning (units, stable isotopes, acronyms, etc.). The result of this stage is a fully marked structured text to be stored in the database. The following steps are involved in the text preprocessing stage.

Tokenization

A tokenizer from the OSCAR4 library is used for splitting a text into words, phrases and other meaningful elements. The tokenizer has been adapted for better handling of chemical texts.

The present study established that the original OSCAR4 tokenizer, in view of our needs, has some shortcomings. First one is a separation of tokens with a hyphen “-”, which often leads to mistakes in recognizing compound terms. To overcome this issue, the parts of the source code which are responsible for splitting tokens with hyphens were commented out (see Additional File 2). The next resolved problem is that some complex tokens, representing various chemical compositions, are considered by the tokenizer as a sequence of tokens (see Fig. 3). In such cases it is necessary to combine these isolated tokens into an integral one. The modified tokenizing procedure makes merging of tandem tokens separated with either “/” or “:” characters, provided that they are marked by OSCAR4 tag «CM» or incorporate a chemical element symbol sign. In addition, tokens looking as “number %” and situated at the beginning of a such phrase describing chemical compositions are merged into the integral token too (see Fig. 3).

An example of the work of the modified tokenizer is shown on Fig. 3. Blue frames hold the tokens identified by modified OSCAR4 tokenizer. Additional red frames outline tokens which are combined into integral ones. Such tokens are marked with the isolated tag «COMP». This tag is used by accept rule «ChemUnigramRule» to identify one-word n-grams describing chemical compositions.

Then the position of a token in the text is determined. Splitting series of tokens into sentences finalizes the tokenization process, which is realized with the help of WordToSentenceAnnotator routine of Stanford CoreNLP library [16, 17].

Morphological analysis and labeling tokens with their POS tags

Morphological analysis (Stanford CoreNLP library [18] is used) maps each word with a set of part-of-speech tags (Penn Treebank Tag Set [19] by Stanford CoreNLP is used). Typical tags used in the research are: «NN» (« NNS »)—nouns; «VB»—verb; «JJ»—adjective; «CD»—ordinal numeral, etc. For the full information about the POS tags used by terminology spectrum building procedure see Table 4.

Lemmatization

Lemmatization is the process of grouping together different inflected word forms so they can be treated as a single item. But, in the present work, lemmatization is only used to replace nouns in the plural form with their lemmas. Preliminary experiments demonstrate that additional lemmatization is not helpful and leads to a significant loss of meaningful information (for example, «reforming process» leads to «reform» and «process» lemmas with the loss of the name of a very important modern industrial chemical process in refinery).

Recognition of names of chemical entities

Meta-information about names of chemical entities is very important in various term-like phrases retrieval strategies. The open source OSCAR4 (Open Source Chemistry Analysis Routines) [7, 20] software package is applied for selection and semantic annotation of chemical entities across a text. Among a variety of tags and attributes utilized by OSCAR4 routine only the following ones are used in the present study:

1.
CM—chemical term (chemical name, formula or acronym);
2.
RN—reaction (for example, «epoxidation», «dehydrogenation», «hydrolysis», etc.);
3.
ONT—ontology term (for example, «glass», «adsorption», «cation», etc.).

When a token is a part of some recognized chemical entity the token gets the same OSCAR4 tag as a whole entity.

Recognition of tokens with special meaning

The significant part of text preprocessing stage is selection of individual tokens being the words of general English and recognition of various meaningful text strings which are: the general scientific terms (actually performed at the final «terminology spectrum building stage» but described here for convenience); tokens denoting chemical elements, stable isotopes and measurement units; tokens which cannot be a part of any terms in any way. This part of work is performed using specially developed dictionaries described in details in Table 1.

Table 1 Developed/modified dictionaries used for recognition of general English words, general chemical science terms and tokens with special meaning

Full size table

Some extra explanation needs to be given on the general English dictionary, the stop list dictionary and the procedure of recognition of general scientific terms.

More than 560 words either found in scientific terminology (for instance: “acid”, “alcohol”, “aldehyde”, “alloy”, “aniline”, etc.) or occurring in composite terms (for example, “abundant” may be part of the term “most abundant reactive intermediates”) were excluded from the original version of Corncob Lowercase Dictionary.

The IUPAC GoldBook Compendium [21] on chemical terminology (the only well-known and time-proven dictionary) is used as a source of general chemistry terms. To find the best way to match an n-gram to a scientific term from the Compendium a number of experiments have been performed which resulted in the following criteria:

1.
N-gram is considered a general scientific term if all n-gram tokens are the words of a certain IUPAC Goldbook term, regardless of their order;
2.
If (n − 1) of n-gram tokens coincide with the (n − 1) words of an IUPAC Goldbook term and the remaining word is among other terms in the dictionary, then the n-gram is considered a general scientific term too.

Some examples may be given. The n-gram “RADIAL CONCENTRATION GRADIENT” is a general scientific term because the phrase “concentration gradient” is in the Compendium and the word “radial” is part of the term “radial development”. The n-gram “CONTENT CATALYTIC ACTIVITY” is a general term because the term “catalytic activity content” is present in the Compendium and differs from the n-gram only by word order. The n-gram “TOLUENE ADSORPTION CAPACITY” is not considered a general term, despite the fact that two words coincide with the term “absorption capacity”, because the remaining word “TOLUENE” is special and is not found in the Compendium. The n-gram “COBALT ACETATE DECOMPOSITION” is not considered a general term either as only the term “decomposition” may be found.

The final comment is about the stop list dictionary that, at first glance, may look like a set of arbitrary words. But, actually, it is based on a series of observations performed with the set of wrongly identified term-like phrases by the earlier version of the terminology analysis system.

Strict filtering

The last but not least step in the text preprocessing stage is strict filtering developed to remove unnecessary words and meaningless combinations of symbols. If at least one of n-gram tokens is labeled by the strict filtering tag (“rubbish”: “true”) then such n-gram is not considered a term-like phrase. At this stage, certain character sequences as described by the filtering rules (Table 2) and not exempt by the list of exceptions (Table 3) are looked for. They are successive digits, special symbols, measurement units, symbols of chemical elements, brackets and so on. Custom regular expressions and standard dictionaries described in Table 1 are used for this procedure. A general scheme of strict filtering parsing is illustrated in Fig. 4.

Table 2 Rules for strict filtering procedure

Full size table

Table 3 Exceptions for strict filtering procedure

Full size table

The following examples may be given to illustrate the decision making process of defining a token as “valid” or “rubbish” (Fig. 5).

Summary of preprocessing stage

The final result of the text preprocessing stage is the marked and structured text with tagged tokens. These tags are used then by various rules for term-like phrase selection. As there is no need for all the tags from OSCAR4 and Penn Treebank Tag Set, only a few of them are used in term-like phrases retrieval procedure. The consolidated list of all tags is used, which may be assigned to tokens at different steps of the text preprocessing stage, as specified in the Table 4.

Table 4 The consolidated list of all tags assigned to tokens at different steps of the text preprocessing stage

Full size table

As an illustration of tag assignment the following example may be given. Figure 6 shows an example sentence where a few tokens have been tagged. For instance, there are the following different tags used in the example for token «2.7 %CO/10.0 %H2O/He» – (pos = “CD”; lemma = “2.7 %CO/10.0 %H₂O/He”; oscar = “CM”; rubbish = “false”, exception = “comp”). Every token has at least two tags—«pos» (it holds the part-of-speech information) and «lemma» (it corresponds to the lemma of a token). In addition some tokens related to chemistry (indicating chemical substances, formulas, reactions and etc.) have a tag «oscar» taking the values of “CM” or “ONT”. Last but not least is the tag «rubbish» (“true” or “false”) marking tokens for which strict filtering is to be applied.

N-grams spectrum retrieval procedure

As it is defined earlier within our study, the term «n-gram at length n» connotes a sequence or string of n consecutive tokens situated within the same sentence with omission of useless tokens (at the moment only definite/indefinite articles). N-gram set is obtained by moving a window of n tokens length through an entire sentence. This moving is performed token by token. This process is to be repeated for all sentences for a set of all texts: $T = \left\{ {T_{1} , T_{2} , \ldots , T_{m} } \right\}$.

For a set of texts, each n-gram may be characterized by textual frequency of n-gram occurrence $f_{T} \left( {T_{i} } \right)$—total number of n-gram occurrences within a text $T_{i}$ and by absolute frequency of occurrence $f_{A} = \mathop \sum \limits_{i} f_{A} \left( {T_{i} } \right)$—total number of n-gram occurrences. As a result each n-gram may be described by a vector ${\mathbf{F}}\left( T \right) = \left\{ {f_{T} \left( {T_{1} } \right), f_{T} \left( {T_{2} } \right), \ldots , f_{T} \left( {T_{m} } \right)} \right\}$ within a set of texts enabling us to develop the additional procedures for n-gram filtering and text information analysis.

The full n-gram data set is redundant and it creates difficulties for analysis. For specific purposes different filtration procedures are to be applied. For instance, threshold filtering based on the values of ${ \hbox{max} } f_{A} = { \hbox{max} } \mathop \sum \nolimits_{i} f_{T} \left( {T_{i} } \right)$ and ${ \hbox{max} } f_{T} \left( {T_{i} } \right)$ may be used.

Module of terminology spectrum building

The final stage of the analysis is to distinguish among the scores of n-grams such as the term-like phrases, general chemistry scientific terms, names of chemical entities and useless n-grams. The calculation of textual and absolute frequencies of terms occurrence finishes the terminology spectrum building.

To select term-like n-grams the sets of accept and reject rules are applied. They are all based on using token tags assigned at previous steps and developed dictionaries (Table 1). The intention of each set of rules is to determine whether an n-gram of defined length is a term-like phrase or not by analyzing its structure. All rules are applied in a consecutive manner. If an n-gram conforms to an accept or reject rule in the rule sequence, the procedure will be stopped with declaring the n-gram as either a non-term-like or a term-like phrase, probably having a special meaning (e.g. general chemistry scientific term or chemical entity). If no rule is applicable, the n-gram will be considered a term-like phrase too. There are a few general rules that can be used for analysis of n-grams of any length. There are also tailored sets of rules for 1-grams (Table 5), 2-grams (Table 6) and for long (n > 2)-grams (Table 7).

Table 5 Accept and reject rules succession for unigrams (1-grams)

Full size table

Table 6 Reject and accept rules consecution for bigrams (2-grams)

Full size table

Table 7 Reject and accept rules consecution for n-grams (n ≥ 3)

Full size table

The following examples may be given to illustrate the decision making process whether an n-gram may be considered a term-like phrases or not (Fig. 7).

The next step in the terminology analysis stage is the tagging of term-like phrases to describe their roles as entities having a special meaning. There are the following tags at the moment: «term-like phrase», «general chemistry term», and «chemical entity». The final step is the additional filtration procedure aimed to reduce the number of term-like phrases performed by removing short term-like phrases which are parts of n-grams with more length. The criterion of filter application is equality of the absolute frequencies of occurrence for short and long n-grams.

Results and discussion

An example of automatic term-like phrases retrieval is shown in Fig. 8 with some term-like and filtered-off n-grams highlighted. For the filtered-off n-grams the reject rules used are given as well. For the detailed results of terminology analysis for one preselected Congress abstract see the Additional file 1.

To understand the overall performance of term-like phrases retrieval routine the full set of text abstracts belonging to 5 EuropaCat events were processed. Obtained data were statistically analyzed (see Table 8). It may be seen that term-like phrases retrieval procedure reduces the total number of all available n-grams to a range of 1÷3 %, which depends on the n-gram length n.

Table 8 Consolidated table of experimental results on terminology analysis of EuropaCat abstracts set

Full size table

Table 8 demonstrates that the maximum absolute amount of term-like n-grams corresponds to the value of n = 2 (bigrams), which is in good accordance with the well-known fact of the average term length in scientific texts. On the other hand, term indexes are often limited to the n-grams lengths n = 1, 2, 3. The limit n = 3 looks good enough for general science vocabulary (see N_GS value from Table 8—a number of general scientific terms found), but it is not sufficient for a specialized thesaurus (e.g. for catalysis). The numbers of term-like n-grams with COMP tag are also large for different n including n > 3. Summarizing, it should be said that long-length terms retrieval is the distinctive feature of the suggested approach.

It is also seen from Table 8 that near half of total amount of 1-grams have an OSCAR tag “CM”. It should be noted also that if a plausible term-like phrase has just one token with OSCAR tag, it will be considered also as having the same tag by the system. It may explain the close values (in percentages) for phrases with different length.

To assess the overall effectiveness of the term-like phrases retrieval procedure it seems necessary to quantitatively answer the questions about what precision and recall values are possible to be achieved. To do that a preliminary study on comparison between automatically and manually selected term-like phrases was performed with the help of two professional chemical scientists who picked out the term-like phrases from a limited set of a few arbitrarily selected documents. To include a phrase in the list of term-like phrases a consent among both experts was required. It should be noted here that experts were not required to follow the same procedure of moving a window of n tokens length on an entire sentence used by n-grams isolation. Moreover, experts took into account and analyzed the information put into some simple grammatical structures, which are typical for scientific texts, such as structures with enumeration and so on. It leads to additional differences between the sets of expert and automatically selected term-like phrases (for an example see Fig. 9).

The data obtained through expert terminological analysis were compared with the automatically retrieved terms. The precision (P), recall (R) and F-measure values were calculated. In the paper, the precision [22] indicates a fraction of automatically retrieved term-like phrases which coincide with expert selected ones. Recall is a fraction of an expert’s selected term-like phrases that are retrieved by the system.

$$\begin{aligned} P = \frac{{{\text{Number}}\,{\text{of}}\,{\text{coincidences}}}}{{{\text{Number}}\,{\text{of}}\,{\text{term-like}}\,{\text{phrases}}}};\quad R = \frac{{{\text{Number}}\,{\text{of}}\,{\text{coincidences}}}}{{{\text{Number}}\,{\text{of}}\,{\text{terms}}\,{\text{retrived}}\,{\text{by}}\,{\text{experts}}}} \hfill \\ \left( {{\text{Number}}\,{\text{of}}\,{\text{coincidences}}} \right) = {\text{Number}}\,{\text{of}}\,\left\{ {\left( \begin{aligned} {\text{Term-like}}\,{\text{phrases}} \hfill \\ {\text{retrived}}\,{\text{by}}\,{\text{experts}} \hfill \\ \end{aligned} \right) \cap \left( \begin{aligned} {\text{Term-like}}\,{\text{phrases}} \hfill \\ {\text{retrived}}\,{\text{by}}\,{\text{the}}\,{\text{system}} \hfill \\ \end{aligned} \right)} \right\} \hfill \\ \end{aligned}$$

Both precision and recall therefore may be used as a measure of term-like phrases retrieval process relevance and efficiency. In simple terms, high precision values mean that substantially more term-like phrases are selected than the number of erroneous phrases, while high recall values mean that the most term-like phrases are selected from the text.

Very often these two measures (P and R) are used together to calculate a single value named as F₁-measure [23] to provide an overall performance system characteristic. F₁-measure is a harmonic mean of P and R, where F₁ can reach 1 as its best and 0 as its worst values:

$$F_{1} = 2PR/\left( {P + R} \right)$$

The results on the number of expert selected and automatically retrieved term-like phrases, number of coincidences and calculated P, R and F₁ values are represented in Table 9. For the detailed results of terminology analysis for one preselected text, see the Additional file 1.

Table 9 Precision, Recall and F-measure estimated from the data obtained for 5 arbitrarily selected texts

Full size table

It may be concluded therefore that further improvements can be made with term-like phrase retrieval efficiency by bringing into consideration the knowledge of typical grammatical structures used in scientific texts [12, 24] as well as numeric values of both textual and absolute frequencies of n-gram occurrences.

It is also seen that the first version of the terminology analysis system delivers sufficiently high values for precision and recall achievable in term-like phrases retrieval process. Some comparison can be made with P = 0.34÷0.40, R = 0.11÷0.14, F₁ = 0.17÷0.20 values reported [6] by such well-known keyphrases retrieval systems as Wingnus, Sztergak, KP-Mminer, although such disparity does not look consistent enough to be credible due to different goals of the systems (term-like phrases vs. keyphrases retrieval) being brought into comparison.

Conclusions

As mentioned in the introduction, scientific publications are still the most important sources of scientific knowledge and new methods aimed to retrieve meaningful information from natural language documents are particularly welcome today. The structural foundation of any such publication is widely accepted terms and term-like phrases conveying useful facts and shades of meaning of a document content.

The present study is aimed to develop, test and assess the methodology of automated extraction of full terminology spectrum from natural language chemical pdf documents, with retrieving as much term-like phrases as is possible. Term-like phrases are defined as one or more consecutive words and/or alphanumeric string combinations, which convey specific scientific meaning with unchanged spelling and context as in a real text. Terminology spectrum of a natural language publication is defined as an indexed list of tagged entities: recognized general science notions, terms linked to existing thesauri, names of chemical substances/reactions and term-like phrases. The retrieval routine is based on n-gram text analysis with sequential application of complex accept and reject rules. The main distinctive feature of the suggested approach is in picking out all parsable term-like phrases, not just selecting a limited set of keyphrases meeting any predefined criteria. The next step is to build an extensive term index of a text collection. The developed approach neither takes into account semantic similarity nor differentiates between similar term-like phrases (distinct evaluation metrics may be employed to do it at the later stages). The approach which includes a number of sequentially running procedures appears to show good results in terminology spectrum retrieval as compared with well-known keyphrases retrieval systems [6]. The term-like phrase parsing efficiency is quantified with precision (P = 0.53), recall (R = 0.71) and F₁-measure (F₁ = 0.61) values calculated from a limited set of documents manually processed by professional chemical scientists.

Terminology spectrum retrieval may be used to perform various types of text analysis across document collections. We believe that this sort of the terminology spectrum may be successfully employed for text information retrieval and for reference database development. For example, it may be used to develop thesauri, to analyze research trends in subject fields of research by registering changes in terminology, to derive inference rules in order to understand particular text content, to look for the similarity between documents by comparing their terminology spectrum within an appropriate vector space, to develop methods to automatically map document to a reference database field.

For instance, if a set $T = \left\{ {T_{1} , T_{2} , \ldots , T_{m} } \right\}$. contains a collection of texts from different time periods (in our research, several different events from the EuropaCat research conference were used), the analysis of textual and absolute frequencies of occurrence will allow to follow up the “life cycle” of each term-like phrase on the quantitative level (term usage increasing, decreasing and so on). That gives a unique capability to find out research trends and new concepts in the subject field by registering changes in terminology usage in the most rapidly developing areas of research. Moreover, similar dynamics of change over time for different terms often indicates the existence of an associative linkage between them (e.g. between a new process and developed catalyst or methodology). Indicator words or phrases such as “for the first time”, “unique”, “distinctive feature” and so on may also be used in order to detect things like new recipes or catalyst composition for the explored process.

Usage of terminology spectrum for information retrieval will be the subject of our subsequent publications.

References

Salton G (1991) Developments in automatic text retrieval. Science 253:974–980.
http://goldbook.iupac.org/
Richard Hussey SW, Mitchell R (2012). Automatic keyphrase extraction: a comparison of methods. In: eKNOW 2012: the fourth international conference on information, process, and knowledge management, pp 18–23
Salim S E a N (2014) Chemical named entities recognition: a review on approaches and applications. J Cheminform 6(17):1–12
Google Scholar
Gurulingappa H et al (2013) Challenges in mining the literature for chemical information. RSC Adv 3(37):16194–16211
Article CAS Google Scholar
Kim SN, Medelyan O, Kan M-Y, Baldwin T (2013) Automatic keyphrase extraction from scientific articles. Lang Resour Eval 47:723–742
Article Google Scholar
Jessop DM et al (2011) OSCAR4: a flexible architecture for chemical text-mining. J Cheminform 3(1):41
Article CAS Google Scholar
Hawizy L et al (2011) ChemicalTagger: a tool for semantic text-mining in chemistry. J Cheminform 3(1):17
Article CAS Google Scholar
Kim SN and Kan M-Y (2009). Re-examining automatic keyphrase extraction approaches in scientific articles. In: Proceedings of the workshop on multiword expressions: identification, interpretation, disambiguation and applications. Association for Computational Linguistics, Suntec, Singapore, pp 9–16
Zesch T and Gurevych I (2009) Approximate matching for evaluating keyphrase extraction. In: International conference recent advances in natural language processing, RANLP
Castellvi M, Bagot R, Palatresi J (2001) Automatic term detection: a review of current systems. In: Bourigault D, Jacquemin C, L’Homme M-C (eds) Recent advances in computational terminology. John Benjamins, Amsterdam, pp 53–87
Chapter Google Scholar
Bolshakova EI, Efremova NE (2015) A heuristic strategy for extracting terms from scientific texts analysis of images. Social Networks and Texts. Springer International Publishing, Berlin, pp 297–307
Google Scholar
Salton G, Buckley C (1991) Global test matching for information retrieval. Science 253:1012–1015
Article CAS Google Scholar
Chodorow K, Dirolf M (2010) MongoDB: The definitive guide (1st ed). O’Reilly Media, CA. ISBN 978-1-4493-8156-1
PDF Text Extraction for Java &.NET—Snowtide. http://snowtide.com/
Stanford CoreNLP—A Suite of Core NLP Tools. http://nlp.stanford.edu/software/corenlp.shtml
Manning CD, Surdeanu M, Bauer J, Finkel J, Bethard SJ, and McClosky D (2014) The Stanford CoreNLP Natural Language Processing Toolkit. In: Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp 55–60
Toutanova K, Klein D, Manning C, Singer Y (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of HLT-NAACL, pp 252–259
Taylor A et al (2003) The Penn Treebank: an overview. In: Abeillé A (ed) Treebanks, vol 20. Springer Netherlands, Dordrecht, pp 5–22
Chapter Google Scholar
Batchelor CR and Corbett PT (2007) Semantic enrichment of journal articles using chemical named entity recognition. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. Association for Computational Linguistics, Prague, Czech Republic, pp 45–48
https://en.wikipedia.org/wiki/Precision_and_recall
https://en.wikipedia.org/wiki/F1_score
Bolshakova E, Efremova N, Noskov A (2010) LSPL-patterns as a tool for information extraction from natural language texts. In: Markov K, Ryazanov V, Velychko V, Aslanyan L (eds) New trends in classification and data mining. ITHEA, Sofia, pp 110–118
Google Scholar
Gusev VD, Salomatina NV, Kuzmin AO, Parmon VN (2012) An express analysis of the term vocabulary of a subject area: the dynamics of change over time. Autom Doc Math Linguist 46(1):1–7
Article Google Scholar

Download references

Authors’ contributions

BA contributed to software development and architecture. AK conceived of the project and the tasks to be solved. AK and LI designed and performed the experiments, tested the applications and offered feedback as chemical experts. NS and VG were responsible for L-gram analysis algorithm and scientific feedback. VP conceived and coordinated the study. All authors contributed to the scientific and methodological progress of this project. All authors read and approved the final manuscript.

Acknowledgements

Financial assistance provided by Russian Academy of Science Project No. V.46.4.4 are gratefully acknowledged.

Competing interests

The authors declare that they have no competing interests.

Author information

Authors and Affiliations

Boreskov Institute of Catalysis SB RAS, Pr. Lavrentieva 5, Novosibirsk, Russia, 630090
Boris L. Alperin, Andrey O. Kuzmin, Ludmila Yu. Ilina & Valentin N. Parmon
Novosibirsk State University, Pirogova 2, Novosibirsk, Russia, 630090
Andrey O. Kuzmin & Valentin N. Parmon
Sobolev Institute of Mathematics SB RAS, Acad. Koptyug Avenue 4, Novosibirsk, Russia, 630090
Vladimir D. Gusev & Natalia V. Salomatina

Authors

Boris L. Alperin
View author publications
You can also search for this author in PubMed Google Scholar
Andrey O. Kuzmin
View author publications
You can also search for this author in PubMed Google Scholar
Ludmila Yu. Ilina
View author publications
You can also search for this author in PubMed Google Scholar
Vladimir D. Gusev
View author publications
You can also search for this author in PubMed Google Scholar
Natalia V. Salomatina
View author publications
You can also search for this author in PubMed Google Scholar
Valentin N. Parmon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrey O. Kuzmin.

Additional files

13321_2016_136_MOESM1_ESM.pdf

Additional file 1. The detailed example of PDF transformation with terminology analysis performed by experts and by automatic analysis.

Additional file 2. OSCAR4 tokenizer modification.

Additional file 3. List of excluded words from general English Corncob- Lowercase list.

Additional file 4. List of stop words used.

Additional file 5. List of stable isotopes.

Additional file 6.. List of chemical element symbols.

Additional file 7. List of measurement units.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Alperin, B.L., Kuzmin, A.O., Ilina, L.Y. et al. Terminology spectrum analysis of natural-language chemical documents: term-like phrases retrieval routine. J Cheminform 8, 22 (2016). https://doi.org/10.1186/s13321-016-0136-4

Download citation

Received: 26 November 2015
Accepted: 20 April 2016
Published: 29 April 2016
DOI: https://doi.org/10.1186/s13321-016-0136-4

Terminology spectrum analysis of natural-language chemical documents: term-like phrases retrieval routine

Abstract

Background

Results

Conclusion

Background

Methods

Text collection used for experiments

General description of terminology spectrum retrieval process

Text materials conversion with PdfTextStream library [15]

Text preprocessing

Tokenization

Morphological analysis and labeling tokens with their POS tags

Lemmatization

Recognition of names of chemical entities

Recognition of tokens with special meaning

Strict filtering

Summary of preprocessing stage

N-grams spectrum retrieval procedure

Module of terminology spectrum building

Results and discussion

Conclusions

References

Authors’ contributions

Acknowledgements

Competing interests

Author information

Authors and Affiliations

Corresponding author

Additional files

13321_2016_136_MOESM1_ESM.pdf

Additional file 2. OSCAR4 tokenizer modification.

Additional file 3. List of excluded words from general English Corncob- Lowercase list.

Additional file 4. List of stop words used.

Additional file 5. List of stable isotopes.

Additional file 6.. List of chemical element symbols.

Additional file 7. List of measurement units.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Journal of Cheminformatics

Contact us