Automatic extraction of materials and properties from superconductors scientific literature

The automatic extraction of materials and related properties from the scientific literature is gaining attention in data-driven materials science (Materials Informatics). In this paper, we discuss Grobid-superconductors, our solution for automatically extracting superconductor material names and respective properties from text. Built as a Grobid module, it combines machine learning and heuristic approaches in a multi-step architecture that supports input data as raw text or PDF documents. Using Grobid-superconductors, we built SuperCon2, a database of 40324 materials and properties records from 37700 papers. The material (or sample) information is represented by name, chemical formula, and material class, and is characterized by shape, doping, substitution variables for components, and substrate as adjoined information. The properties include the Tc superconducting critical temperature and, when available, applied pressure with the Tc measurement method.


Introduction
In recent years, with the creation of computational databases, such as the Materials Project (MP) [1] and the Open Quantum Materials Database (OQMD) [2], and then experimental data repositories such as NIMS MDR (http://mdr.nims.go.jp) [3], focus has been steadily shifting towards a data-driven design of materials, which is often called Materials Informatics (MI). Such an approach is expected to accelerate the exploration of functional materials because it is not limited to the intuition or experience of very little genius researchers. In this new paradigm, the efficient use of data to guide experiments and material property prediction through the use of machine learning methods takes center stage. For example, data-driven methods have been used to search/design magneto-caloric materials [4][5][6], photo-catalysts for hydrogen splitting [7], thermoelectrics [8], and superconductors [9]. In such a data-driven search, one of the most important keys lies in the availability of the data, which should at least should consist of compositions of materials and their physical properties. In the specific case of superconductivity, most of the data-driven works [9][10][11] rely on a single database: SuperCon (http://supercon.nims.go.jp).
SuperCon is a structured database of superconductor materials and properties;it was developed at the National Institute for Materials Science (NIMS) in Japan. At the time of writing this paper, SuperCon contained about 33000 inorganic and 600 organic materials and is the "de-facto" standard in data-driven research for superconductors materials (about 4400 articles contain the mention "SuperCon database" in Google Scholar). However, the SuperCon harvesting process is currently fully manual "from scratch": humans have to read the human-readable printed matter such as PDF documents and enter the information into the system. The efficiency is directly proportional to the number of available human curators. Considering the cost of database construction, it is necessary to consider an assisted or alternative system that improves throughput while ensuring data quality equivalent to that of manual extraction.
As a solution, we are developing a hybrid data extraction method from scientific literature that combines automation using text data mining and manual curation. The automated system extracts and formats potential data and proposes them to the curator as "pre-cooked" structured data by (1) highlighting the relevant entities on the original document and (2) pre-filling the extracted information in a tabular format. In building the automated part of this hybrid system (training, evaluation) we used SuperMat, an annotated linked dataset from scholarly documents in superconductor research, which we recently constructed [12].
In this work, we present Grobid-superconductors: a system that automatically extracts structured information related to superconductor materials and properties from scientific literature. The tool is a specialised module of Grobid (Generation of Bibliographic Data) [13], a machine learning library designed to parse and structure scientific documents. Grobid provides an open-source platform for building specialised modules including astronomical entities recognition [14], dictionaries [15], software mentions [16], and physical measurements extraction [17]. Grobid provides several built-in features including access to PDF document layout information, citation resolution, bibliographic information consolidation through biblio-glutton [18] (a fast open-source reference matching service for CrossRef data), and a diverse set of machine learning (ML) architectures from a fast linear Conditional Random Field (CRF) to the latest state-of-the-art deep learning implementations.
Using Grobid-superconductors and other sub-tools, we established a pipeline to process a large number of documents and obtain an automated database of superconductor materials and properties. We processed 37770 papers from ArXiv (https: //arxiv.org) and obtained a database of 40324 records. This new database, named SuperCon 2 , can become an automated staging area for SuperCon, when bridged by a curation interface. During the project, there was also an opportunity to focus on properties that that have recently become of greater interest and that are underrepresented in SuperCon. For example, the "pressure" applied to obtain superconductivity (about 20 records in SuperCon), has gained attention because it can radically change the physical structure of a material. In addition, the "method" used to measure the superconducting transition temperature T c (about 600 records in SuperCon) can be used to semantically recognise multiple T c 's obtained from the same material or sample (e.g., distinguish calculated and experimental values of T c ).

Grobid-superconductors
We developed Grobid-superconductors as a Grobid module following principles (multistep, sentence-based, full-text-based) discussed in a previous preliminary study [19]. Grobid has several advantages: 1) it can be integrated with pdfalto (https://github. com/kermitt2/pdfalto), a specialised tool for converting PDF to XML, which mitigates extraction issues such as the resolution of embedded fonts, invalid character encoding, and the reconstruction of the correct reading order, 2) it allows access to PDF document layout information for both machine learning and document decoration (e.g., coordinates in the PDF document); and, 3) it provides access to a set of high-quality, pre-trained machine learning models for structuring documents. Grobidsuperconductors is structured as a three-steps process illustrated in Figure 1 and described in the Sections 2.1, 2.2, and 2.3.
Abstract versus full-text At the time of writing this paper, we are aware of related works that utilise text from abstracts as training data for machine learning. The main reason for using abstracts is that they are usually freely available as text [20], and contain condensed information [21,22]. Accurately parsing the full-text presents more challenges, however, but they are mitigated by Grobid and, the full-text contain a broader range of information, including the sample preparation process, negative results (e.g., absence of superconductivity for certain samples), and background information (e.g., reports on other materials from referenced works). Thus, grobid-superconductors is built to support full-text documents.
Paragraphs versus sentences Another question related to natural language processing (NLP) is whether to use sentence-based or paragraph-based text. While paragraphs can be extracted as part of the layout of PDF documents, obtaining sentences adds an additional step in which text is processed with a sentence segmenter. However, sentences are almost always shorter by definition, and in deep learning, this has advantages. In training and prediction, sentences will likely be shorter than the "max sequence length" limitation (e.g., 512 tokens for transformers). During training, sentences also use less memory and allow us to train models with a larger "batch size", which has been shown to improve efficiency and obtain better results [23].
We chose to use sentence-based text in Grobid-superconductors after performing preliminary experiments on our tasks typologies, but on a smaller scale. For the Named Entities Recognition (NER) task we trained and evaluated a sequence labelling model for each version (paragraph-based and sentence-based) on four annotated documents (3/1 document partition for training/evaluation) from SuperMat [12]. As indicated in Table 1, the F1-score increased by 17.94 percentage points when the sentence-based text was used.
For the Entity Linking (EL) task, we want to maximise precision. In our previous work [24] we noticed that limiting linking entities within the same sentence (versus paragraph) would obtain higher precision (68.7% versus 57%) at the expense of lower recall (6.5% versus 10.7%), and F1-score (11.87% versus 18.01%). Therefore, in both our tasks we found evidence that a sentence-based dataset is more beneficial than paragraph-based dataset.

Document structuring and pre-processing
In the first step of our process, the PDF document is converted into an internal model based on a list of text statements, tokens, and features. The input document is processed using the Grobid original models, where we apply customised processes for document header and content. We select a subset of bibliographic information from the header: title, authors, DOI, publisher, journal, and year of publication, and we consolidate them via Grobid to match the publisher's quality (even by processing the "preprint version" of the publication). The superconductors entities extraction is applied to the content, only on relevant text items: title, abstract, text content from body or annexes, text content from figure and table captions ( Figure 2).
We use the collected reference markers (also called reference callouts) from the text as features for improving the paragraph segmentation in sentences: the segmentation is cancelled if the end of sentence falls within the boundaries of a reference marker. For example, a sentence containing a reference in the form "Foppiano et. al." may be mistakenly segmented in the middle at the token "et.".

Named Entity Recognition
In the second step, the Named Entity Recognition (NER) task is performed on the previously extracted text.

Overview
As illustrated in Figure 3 the "superconductor parser" extracts the main superconductor-related information by aggregating the resulting entities from two ML models. The Superconductors ML model was developed based on the SuperMat schema [12], and the Quantity ML model was developed in a separated Grobid module for measurement extraction [17] and the output is limited to only temperatures and pressures. Overlapping entities are merged, exacted duplicates are removed, and the largest entities (in terms of string length) are preserved. The resulting entities are summarised in Table 2. Entities of type <material>, which may contain mixed heterogeneous information, are passed in the cascade to the "Material parser" which aggregates ML and other tools. First, the entity is passed through a Material ML model to segment and identify its content (Table 3). Then, different processes are applied, depending on which information is available. These processes include the following: • Formulas are decomposed into a structured composition. We identify each element-stoichiometry pair (e.g., "O": 7.0) using mat2chem [20] and Pymatgen [25]; if only the material name is available, we lookup its formula (e.g., hydrogen to H ), • Using heuristics, we classify the formula by assigning multiple classes as they are understood from superconductor researchers, for example cuprate, oxides, alloys, etc. Finally, after all entities are extracted, the post-processing aggregates different mentions of the same materials using the parsed formulas at the document-level. For example, formula with partial substitutions such as La 2 Fe 1-x O 7 (x = 0.1, 0.2) will be aggregated with materials like La 2 Fe 0.9 O 7 appearing in other sections of the same document.

Machine Learning study
In this section we discuss the novel ML models we have trained for extracting specialised entities: the Superconductor ML model and the Material ML model ( Figure 3). SuperMat [12], our training dataset, contains 164 papers as of the time of writing and is composed of annotated full-text and layout features from PDF documents.
For both ML models we trained and evaluated the following four architecture/implementations: linear CRF (CRF), bidirectional LSTM with CRF [26] (BidLSTM CRF), bidirectional LSTM with CRF with Features [26] (the same as (BidLSTM CRF) with an additional input channel for features; BidL-STM CRF FEATURES), and SciBERT [27] using a CRF as the activation layer (Scibert).
The ML models are interfaced by Grobid, which uses the Wapiti [28] implementation for linear CRF, and DeLFT (Deep Learning For Text) [29] for deep learning models. The architectures CRF and BidLSTM CRF FEATURES make use of the orthogonal features we have summarised in Table A5.

Superconductor ML model
Holdout set The holdout set evaluation consists in using a fixed part of a dataset for validation. The selection must be performed to reproduce the same distribution of entities of the original dataset. We assembled the holdout set by manually selecting 32 documents (24%) from SuperMat, making sure they had a similar ratio of examples, entities and unique entities with the remaining 76% (132 documents) which was used as training set (Figure 4a). Maintaining the same rate for entity type distribution between the two sets was more challenging: on average, we obtained about 15-18% of labels of each type in the holdout set (Figure 4b), except for the <material> label (23%).
We defined the "out-of-domain" ratio as the number of unique entities from the holdout set that were not in the training set. The holdout set "out-of-domain" ratio was on average around 72%, which challenge the model generalisation (every 100 entities in the holdout set, 72 were never seen before during training). Most of the labels had an "out-of-domain" ratio above 50% ( Figure 5); <material>, the most important label, had the highest ratio (82%) while <me method> and <pressure> have the lowest (25% and 33%). The low ratio of <me method> can be explained by their low entity variability (11.44%).
Positive sampling We trained the model with positive sampling by removing the examples without entities (negative examples, Figure 4a). This approach provided an improvement of 2% in both precision and recall as compared to the result without sampling when testing against the holdout set. Additional experiments with active and random sampling [16] with ratios of negative examples of 0.1, 0.25, 0.5 and 1.0 did not provide stable evidence suggesting scoring improvements when testing against the holdout set.
Evaluation The best results were obtained by Scibert with an F1 of 77.03% and a recall of around 80.69% (Table 4). The features did not provide any improvements with RNN models: BidLSTM CRF and BidLSTM CRF FEATURES resulted in the same F1 score. This result comes as a surprise because features such as superscript/subscript were expected to be determinants for recognising material sequences.
The <pressure> label had the lowest performance scores in all architectures. We believe that 274 training examples are not a sufficient large number considering that pressure expressions can be dependent on the context because they can refer to different types of pressures (e.g., annealing pressure). The label with the highest score was <material>, with F1 values of 80.77% and 78.06% for Scibert and BidLSTM CRF, respectively. In addition, <material> had the highest "out-of-domain" ratio in the holdout set (greater than 75%, Figure 5) and the highest "label variability" (the ratio between unique entities and total entities, about 42%), which suggests that the model recognises correctly materials that has not been "seen" during the training. On the other hand, the <me method> label, which has lower "label variability" (around 11%) and a low "out-of-domain" ratio, had an F1 score of 66.56% with Scibert and 65.92% with BidLSTM CRF. For <tc>, the CRF outperformed the other architectures (F1 score of 83.96%), especially Scibert (78.35%). This outcome can be explained by the extremely low variability (12.69%) of entities labelled as <tc>.
Scibert shows good generalisation capacity for unseen examples or examples appearing in different contexts. For example, in Figure 6a, only Scibert correctly extracts "above 100K", while CRF misses it completely and BidLSTM CRF misses "above". In the training data, "above 100K" is not present, but "below 100K" and "100K" are present, and several other entities contain the token "above" and Scibert can understand that the token "above" is relevant to the temperature. In a second example (Figure 6b), only Scibert can correctly extract "W-C nanowire" which is not present in the SuperMat training data. Unfortunately, we cannot check whether "above 100K" or "W-C nanowire" are also present in the dataset used in the pre-train of SciBERT by their authors [27] because the data are not available.
Material ML model To train the Material ML model we created a special dataset with an additional layer of labels (Table 3), which included the material information represented by entities annotated as <material> in the SuperMat documents.
Holdout set In this model we created an independent holdout set because the manual annotation work is performed on smaller chunks of text and requires less effort than annotating sentences as when we developed SuperMat. We used material data extracted from a dataset of 500 documents (500-papers) from three publishers: American Institute of Physics (AIP), American Physical Society (APS) and Institute of Physics (IOP) [24]. The resulting holdout set has a average coverage greater than 25% ( Figure 7) and an average "out-of-domain" ratio of 83.93% (Figure 8).
Evaluation Scibert obtained the best results, with F1 at 84.15% (Table 5). The inclusion of features in the BidLSTM CRF architecture only improved results by less than 1% (from 83.13 to 83.76%). The label <fabrication> did not perform well with any architecture, most likely because it is too generic (Table 3), and the content is too heterogeneous. Another label, <substrate> has only one-third of the training exam-ples of <fabrication> but obtained results that were three times higher with Scibert, suggesting that <fabrication> should be split into separate and more homogeneous labels.

Entity Linking
Entity linking (EL) links materials and their corresponding properties.
We use a rule-based algorithm, but there are other approaches such as the use of dependency parsing [30][31][32][33]. We did not use these because it was difficult to find a suitable dependency parser for scientific texts, and complementary methods based on complex rule sets were needed to compensate for the poor performance of the parser.
In our algorithm, pairs of entities are linked focusing on three types of link: • material-tcValue: The link between a material and its corresponding T c .
• tcValue-pressure: The link between T c and its related critical pressure.
• me method-tcValue: The link between T c and its corresponding measurement method.
Entities of type <tcValue> are pre-processed through a classifier that establishes whether or not they temperatures related to the superconductivity. This is used to exclude other temperatures (e.g., annealing, transition, Curie) which might be incorrectly extracted by the previous step. This rule-based classifier combines the extracted entities of T c expressions (label <tc>) with a set of predefined standard terms. If a temperature is not considered a T c , it is excluded from the list of possible linking candidates.
Two scenarios are considered. First, if entities to be linked in the sentence are only two they are linked automatically, else further rules are applied. If the word "respectively" appears in the sentence, we apply "order-linking". For example, consider the following sentence: P-or Ba-122 and Co-doped Ba-122 have lower T c 's of about 30 K and 24 K, respectively, which makes helium free operation questionable.
It contains the word "respectively", and by applying "order-linking", P-or Ba122 is assigned to 30 K and Co-doped Ba-122 to 24 K.
If the word "respectively" does not appears in the sentence, we apply "distancelinking" which works by defining the distance measurement d as a value calculated as the numbers of characters between the centroid of each entity. Entities surrounded by parenthesis are expanded to the whole parenthesis, and its centroid is updated. As an example, in the sentence We tested two materials MgB2 (Tc = 39 K) and FeSe (Tc = 16 K). 39 K is closer to FeSe (d =10) than to MgB2 (d =11). In this example, however, both temperatures entities would be expanded to their containing parenthesis (e.g. "39 K" to "(Tc = 39 K)". In this case the centre of the entity "39 K" is shifted toward the left, from the initial value of 38 to 35 and the distance from MgB2 is reduced from d =11 to d =8. As a result, the MgB2 entity is correctly linked to "39 K".
The distance calculation is also adjusted with the addition of "penalties" by doubling the calculated distance when certain keywords or punctuations (",", ".", ";", "and", "but', "while', "whereas', "which", "although") appear between two entities because they represent a logical separation of predicates [34]. In the above example, the distance between 39 K and FeSe would be doubled (d =20) and the link would not be made.
This rule-based linking was evaluated using the linked entities from Super-Mat [12] (Table 6) and is divided considering each link type. The F1 score for the material-tcValue was about 80% with a precision of 88.40%. tcValue-pressure F1 score was 3% lower than material-tcValue considering much less data available (support was 118 compared with 726).

End to end evaluation
End-to-end evaluation (E2EE) measures the capacity of the system from the PDF documents until the final linked results. We limited the scope of the E2EE to the triplet 'material-T c -pressure' which, at the moment, is the backbone upon which the database is built. We performed the E2EE on the "500-papers" dataset where we manually examined the resulting database as follows: 1) we marked invalid records and 2) we identified the cause of failure from a predefined set of five error types ( Figure 9): • From table: the extracted text is wrongly extracted from a table. Although table content is ignored, the error rate from the Grobid library is still relevant due to the lack of training data. • Extraction: entities are not recognised, wrongly recognised, or partially recognised. • Quantity extraction: quantity entities (pressure, temperature) are not correctly extracted. We measured this error separately to identify the failure that could be shared with the Quantity ML model. • T c classification: the temperature is wrongly classified as superconducting T c .
• Linking: given the initial steps were performed correctly, the resulting entities are not linked correctly.
The E2EE scores are summarised in Table 7. Recall is omitted because it is less relevant and difficult to calculate manually. The precision score (micro average) was 72.60% for all the subsections, although the error rates of figure captions (59.28%) and unknown subsections (57.14%) were clearly lower than those of the other subsections (> 70%). The 'unknown' subsections indicate that the extracted text's structure was not well identified by Grobid but it was nevertheless aggregated. The overall score increases to 73% when excluding unknown subsections, 75.24% when excluding figure captions, and 79.14% when excluding both. Excluding these two subsections will not impact the amount of text, because both account for less than 20% of the total number of subsections.
The error types are summarised in Figure 10. The most common failures originate from T c classification (40%), Linking (32%), and Extraction (20%). The most common T c classification failures are as incorrect recognition of 1) relative values of T c (e.g., 1 K higher than material X); 2) values indicating the transition temperature width (∆T c ); 3) temperature values that are not T c , for example, material synthesis temperatures (T ), other critical transition temperatures that are not superconducting (e.g., T Curie ); and 4) values of temperature at which there is no superconductivity (e.g., "at 70 K there is no superconductivity"). "Linking errors" mainly occur when the text compare relative values of T c using materials as the basis for comparison (e.g., "The Tc = 38 K is similar to the one of MgB 2 "). Finally, "Extraction" issues mainly originate from: 1) implicit mention of the main material when experimented using different "substrates" combination, and 2) mismatches between <material> and <class> which, by definition, overlap.

Supercon 2
We created SuperCon 2 by processing 37770 research papers belonging to the category cond-mat.supr-cond in ArXiv. Currently SuperCon 2 contains 40324 records including 2052 triplets with applied pressure (material-T c -pressure), and 3602 records with explicit measurement method (material-T c -measurement method ). The schema of SuperCon 2 is summarised with examples in Table 8.
The data is processed and ingested through the asynchronous Map-Reduce approach [35]. The "extraction task" (Map) processes the PDF documents by accessing Grobid-superconductors via REST API and stores their processed representation together with the original PDF document. Furthermore, the "aggregation task" (Reduce) reduces the document information into a synthesised tabular format. We store the processed document representation in JSON format. The processed documents are kept separately and used for displaying the enhanced PDF document ( Figure 11). The pipeline uses a persistence layer for storage and reporting (logger).
We built a visualisation interface to exploit the extracted information. Users can search in the synthesised tabular data, access the PDF document enriched with the extracted information (Figure 11), and export locally in CSV, TSV and Microsoft Excel formats.

Conclusion
In this work, we present our solution for automatically building a database of materials and properties from scientific literature. Our contribution is composed of: 1) Grobid-superconductors, a specialised open-source system that processes PDF documents combining ML and rule-based methods to extract and link relevant information in superconductors research; 2) a pipeline allowing large-scale document processing; and 3) a visualisation interface for rapid data exploration, which includes PDF document information enrichment.
We made SuperCon 2 , a database with 40324 records of superconductors materials and properties, including the applied pressure and the T c measurement method. SuperCon 2 is available in text format at https://github.com/lfoppiano/supercon.
In the future, we plan to improve our tools by 1) extracting more properties, such as crystal structure type, space groups type, and lattice structure; 2) training supervised models for the "Linking step"; and 3) extending the interface to support data correction toward efficient curation. We confirmed the good generalisation ability of the Scibert architecture for the entity extraction task. Although we hope to obtain better results using materials science pre-trained BERT, such as MatSciBERT [36], the gain might be just minimal for relatively larger models [37].

Data and code availability
Grobid-superconductors is available on Github at https://github.com/lfoppiano/ grobid-superconductors and the code is released under license Apache 2.0. SuperCon 2 is available in text format at https://github.com/lfoppiano/supercon.               79.14 657 Figure 10. Error type distribution in the E2EE of the 500-papers dataset. Figure 11. Example of a superconductors research PDF document [40] enriched with extracted annotations. Materials information (class, formula) and properties (Tc) are summarised in the information box when the users click on the highlighted annotated entity in the text.

Hash, Timestamp
Hash calculated on the binary content of the original PDF document and the timestamp when the document was processed.