DigiMOF: A Database of Metal–Organic Framework Synthesis Information Generated via Text Mining

The vastness of materials space, particularly that which is concerned with metal–organic frameworks (MOFs), creates the critical problem of performing efficient identification of promising materials for specific applications. Although high-throughput computational approaches, including the use of machine learning, have been useful in rapid screening and rational design of MOFs, they tend to neglect descriptors related to their synthesis. One way to improve the efficiency of MOF discovery is to data-mine published MOF papers to extract the materials informatics knowledge contained within journal articles. Here, by adapting the chemistry-aware natural language processing tool, ChemDataExtractor (CDE), we generated an open-source database of MOFs focused on their synthetic properties: the DigiMOF database. Using the CDE web scraping package alongside the Cambridge Structural Database (CSD) MOF subset, we automatically downloaded 43,281 unique MOF journal articles, extracted 15,501 unique MOF materials, and text-mined over 52,680 associated properties including the synthesis method, solvent, organic linker, metal precursor, and topology. Additionally, we developed an alternative data extraction technique to obtain and transform the chemical names assigned to each CSD entry in order to determine linker types for each structure in the CSD MOF subset. This data enabled us to match MOFs to a list of known linkers provided by Tokyo Chemical Industry UK Ltd. (TCI) and analyze the cost of these important chemicals. This centralized, structured database reveals the MOF synthetic data embedded within thousands of MOF publications and contains further topology, metal type, accessible surface area, largest cavity diameter, pore limiting diameter, open metal sites, and density calculations for all 3D MOFs in the CSD MOF subset. The DigiMOF database and associated software are publicly available for other researchers to rapidly search for MOFs with specific properties, conduct further analysis of alternative MOF production pathways, and create additional parsers to search for additional desirable properties.


Article Retrieval
Article retrieval is achieved by using DOIs to automatically download articles from journal websites. Two methods may be used to retrieve article DOIs when assembling a corpus of MOF articles using CDE. For the first method, a web scraping script developed in the most recent version of CDE can be used to send a search query to Elsevier and the Royal Society of Chemistry to extract the DOIs which are then used to download the article in the form of a HTML file. We also developed a second method which involves retrieving MOF reference codes and their associated DOIs from the Cambridge Structural Database (CSD) using the CSD Python API. Both methods produce a CSV file which stores the DOIs of the articles to be downloaded. Here, we used the CSD Python API as utilising search queries was found to significantly increase the time required for web scraping. We wrote a Python script which calls the Selenium webdriver to navigate to the article webpage and the PyAutoGui library to save the articles as HTML files. After running the web scraping script, to get access to the publications, a window appears where the user must sign 3 into their DOI account via their institution's website. The scraper will then automatically copy and paste article DOIs from the CSV file where they are stored, prior to downloading them. The web scraping script can download approximately three articles per minute. We recommend researchers use high-performance computing clusters to assemble corpuses that contain thousands of articles to avoid a bottleneck in the pipeline.

Database Overview and Performance
As the data mined for the DigiMOF database consisted of text-text relationships, in contrast to the text-numerical records from previously conducted text mining studies, there were considerably more linguistic and syntactical variations in the reporting of the properties of interest compared to previous projects.  whereas this project mined qualitative text-text relationships. Note that for the example for this database a real record was used and in this instance no topology or solvent was associated with the MOF compound by the parser. This is typical as it is rare for all 5 properties to be found in one compound record. It has been attempted to represent records from previous projects as faithfully as possible but the exact format of the scraped records is not always available in the source material. 3.6 6 The machine-learning assisted version of CDE enlisted in the Neel and Curie Temperature database achieved a precision of 82% on its test-set, but this is expected to converge to 66% over time as the algorithm is trained on broader datasets 2 . Park et al. 3 reported their accuracy to be 79% but recall and F-score were not reported in their work. Luo et al. 4 also had an accuracy report of 78.9% which was referred to as consistency, this value was obtained by matching the manually extracted records in the SynMOF-M database with the automatically extracted records in the SynMOF-A database. It's also important to note that when sentences contained multiple compound names associated with other properties, our parsers could only identify the properties correctly if the MOF compound name was preceded or followed by a property without another MOF compound name separating the two. Some sentences however have multiple MOF names listed first with their corresponding properties listed second and this resulted in the erroneous association of the last MOF compound name with the first property name. Finally, a filter for MOF names was created using a regular expression to ensure that only MOF compound names were extracted into the database which further limited the entries, increasing precision. 7 The performance of each individual parser was also manually assessed on 50 random journal articles. For each property (synthesis routes, topologies, solvents, linkers, and metal precursors) in the database, both the precision and recall was calculated as shown in Table S.3 below. For each property, the precision was calculated by manually extracting each property from all 50 papers. Following this, the values extracted by the parsers were given the value of "1" if the match was correct and a value of "0" if the match was incorrect. The total of correct extractions was then divided by the total number of identified properties (incorrect are false positives) to obtain the precision. To calculate the recall, the total of correct values was divided by the all the correct possible values (false negatives) that the parsers could have extracted from the papers.

Data Transformation and Visualization
Following the data extraction process, the data was converted from a JSON format to a Microsoft Excel (.csv) file. While the filter was able to exclude many non-MOF names, it did miss some such as "[Co2]", "Cd-", and "Cu()" which were therefore removed using Excel's find and replace function. Additionally, the transfer of the data to Excel format led to the addition of special characters such as "Â", "â€, and "âˆž" and these were also removed. Furthermore, data that were obviously not linkers or metal precursors such as "KOH" and "NbO" were also deleted from the 14 database with notes made of frequent misidentifications to be added to exclusion lists. During this transformation process, synonyms were also combined such as "DMF", "N,Ndimethylformamide", and "dimethylformamide" to ensure that data entries were only counted once. After the data was transformed, it was combined with the data extracted from the CSD using Excel's Power Query which combined the data based on the article download number which corresponded to the row number in the CSD thus matching the two separate data records.

Building Blocks and Topology
Further analysis was performed to compare the most common MOF building blocks and available topologies. The structures which most commonly reported topology and metal cluster in the experimental manuscripts were all metal nitrates, and primarily hydrated nitrates of transition metals. As for the linker types, these are primarily organic compounds which bond to metal clusters at each end of a straight chain. In Figure S.4, 'bipy' refers exclusively to 2,2'-bipyridine. We also investigated the ratio of LCD/PLD of the building blocks with respect to the topology.  Lastly, Figure S6. shows a comparison of the linker length against the LCD/PLD ratio for the pcu topology. At shorter lengths, the ratio can be seen to reach a higher maximum value of 5.5, as well as a higher median value. This pattern of ratio decrease continues as the linker length increases.
Given that the topology is the same across all of these structures, it would be unlikely that this change in ratio could be attributed to a structure being restricted to a single pore shape as opposed to a variety of mesopores and micropores. Whilst the pore sizes may vary, we would expect to see more uniform pore shapes for matching topologies, however given that pcu is one of the most basic nets, and that the topological assignment was performed using the Single Node algorithm, it is possible that these structures do form different structures and therefore do display some variety in micro and mesopores.

Parsed Articles
The test set of 50 journal articles can be found below.