Reflections on one million compounds in the open quantum materials database (OQMD)

Density functional theory (DFT) has been widely applied in modern materials discovery and many materials databases, including the open quantum materials database (OQMD), contain large collections of calculated DFT properties of experimentally known crystal structures and hypothetical predicted compounds. Since the beginning of the OQMD in late 2010, over one million compounds have now been calculated and stored in the database, which is constantly used by worldwide researchers in advancing materials studies. The growth of the OQMD depends on project-based high-throughput DFT calculations, including structure-based projects, property-based projects, and most recently, machine-learning-based projects. Another major goal of the OQMD is to ensure the openness of its materials data to the public and the OQMD developers are constantly working with other materials databases to reach a universal querying protocol in support of the FAIR data principles.


History and motivation
For much of the history of materials science, the generation of phase-based materials thermodynamic data has been slow and tedious, relying on delicate calorimetric measurements of carefully prepared single-phase materials samples. Decades of early effort resulted in compilations of thermodynamic quantities (e.g. JANAF [1], Kubaschewski [2], etc) containing entries for ∼1000-2000 phases (including solids, liquids, and gases). Similarly, the exploration of phase equilibria in complex multi-component systems required such time, cost, and precision, resulting in similar compilations of phase diagrams (ASM [3], ACerS [4], and Pauling File [5]). Under these early conditions, the task of materials discovery (i.e. finding a novel single-phase structure to satisfy some performance metric) was difficult, relying extensively on expert intuition or gleaned correlations in the known materials space (e.g. Pettifor/Zunger maps [6,7] or the Miedema model [8]).
Density functional theory (DFT) was a critical enabling technology for accelerating modern materials discovery. By generating ab initio predictions of total energy for arbitrary arrangements of atoms, a self-consistent dataset of thermodynamic quantities and phase stability could be produced for entirely unknown systems. From early calculations of individual molecules and simple crystalline solids, improving DFT software performance and computing performance eventually enabled many complex structures to be readily simulated in a given study.
The prospect of calculating the DFT properties of every known crystal structure became possible, and several teams took on this task, including the Materials Project [9], AFLOW [10], and our open quantum materials database (OQMD) [11]. These databases each contain predicted thermodynamic properties of tens of thousands of experimentally observed structures from the inorganic crystal structure database (ICSD) [12], dwarfing the earlier experimental compilations and becoming foundations for materials discovery efforts. among many others. In fact, 83% of all compounds in the ICSD share a prototype with another compound, and 27% of all compounds share a prototype with ⩾50 other compounds [18]. Reason #2 is that we can use a prototype to arrive at a stable or metastable crystal structure for hypothetical compounds. This is done by taking a known compound of that prototype, substituting in elements of the hypothetical composition, and using DFT to relax unit cell parameters and atomic coordinates along the symmetry directions.
It is natural to start with the most common prototypes as the blueprint for generating hypothetical compounds. In the OQMD, we used several common prototypes to conduct 'exhaustive' high-throughput DFT of hypothetical compounds, i.e. calculate nearly all possible compounds by substitution of elements from the periodic table. The prototypes completed include binaries B1 (NaCl), B2 (CsCl), B3 (zincblende), B h (WC), C15 (MgZn 2 ), D0 3 (BiF 3 ), D0 19 (Ni 3 Sn), D0 22 (Al 3 Ti), L1 0 (CuAu), L1 1 (CuPt), L1 2 (Cu 3 Au); and ternaries C1 b (Half Heusler), and L2 1 (Full Heusler). This has amounted to 393 879 DFT calculations (54 030 binary and 339 849 ternary), and has been quite fruitful: 1973 of these hypothetical compounds (396 binary and 1577 ternary) are on the convex hull of stability and do not have an ICSD polymorph that is close in energy. On the other hand, most of the hypothetical compounds are above the convex hull and therefore much less likely to be synthesizable (see figure 2). DFT calculations of unstable compounds are still useful to have for boosting training sets of machine learning (ML) models as well as general understanding. However, while we can continue these exhaustive high-throughput DFT calculations using other common prototypes, we cannot do this for all 10 203 prototypes we have from ICSD. If we wanted to do exhaustive DFT for all prototypes up to five components using 76 elements in the periodic table (Z ⩽ 83 excluding noble gases, Tc, and Pm), then we would have to do trillions of DFT calculations.
We are interested in computationally more efficient methods to cheaply target hypothetical compounds that are on or near the convex hull. For example, we can take advantage of several already-developed methods that efficiently recommend compounds based on chemical similarity to other, known compounds. One such method, executed by Marques et al, recommends compounds starting from known compounds based on how often the relevant elemental substitution, e.g. Fe → Co, occurs in the ICSD between compounds of the same prototype. They used this approach to discover a staggering 18 479 new compounds on the convex hull, a major expansion which we have incorporated into the OQMD [19,20]. Other methods based on chemical trends include one developed by Hautier et al, which performs common ion substitutions to generate new ionic compounds [21], as well as a method developed by Fischer et al which exploits correlations between prototypes that occur across phase diagrams [22].

Machine learning accelerated materials discovery
Machine learning (ML) algorithms are designed to automatically extract new knowledge out of data. We can use ML to learn more about materials and to create models that can accelerate the materials discovery. Among the first efforts in applying ML in materials informatics, Meredig et al [23] used a data-driven approach to learn the rules of chemistry from DFT calculations and make predictions for new compounds without knowing the structure information. Their ML model predicted around 4500 unknown ternary compounds by screening 1.6 million compositions, which largely increased the computational efficiency. To enable the fast development of ML-based models in materials discovery, Ward et al created a general approach and the associated software the Materials Agnostic Platform for Informatics and Exploration (Magpie) [24] to compute attributes for new materials, which includes both composition-dependent attributes of elemental properties and crystal structure-dependent attributes derived from Voronoi tessellation of crystal structures [25]. Magpie has been widely used to create materials ML models, which include the discovery of new compounds for photovoltaic applications [24], the design of bulk metallic glasses (BMG) [26], and the prediction of ternary mixed-anion semiconductors with tunable band gaps [27].
Recent advances in deep learning have also increased the prediction power in materials informatics. The crystal graph convolutional neural network (CGCNN) framework developed by Xie and Grossman [28] involves a material representation in the format of a crystal graph that encodes both atomic information and bonding interactions between atoms, and can be trained to make predictions of various materials properties through a convolutional neural network. An improved variant of the CGCNN model called iCGCNN was developed by Park and Wolverton [29], which augmented the original model by incorporating the Voronoi tessellation and explicit three-body correlations of neighboring constituent atoms in the crystal graphs. The iCGCNN was demonstrated to have a mean absolute error of just 30.5 meV atom −1 for formation energies when trained on ∼200 000 OQMD compounds and tested on ∼230 000 OQMD compounds. The iCGCNN has already been used to discover hundreds of compounds with the ThCr 2 Si 2 and AMM'Q 3 compounds, many of which are highly chemically dissimilar to known ICSD compounds and have unique properties [29,30].
In addition, we are confronting the challenge of finding new compounds in currently unexplored regions of composition and structure space. For example, we have found new stable compounds in a previously unknown family of 'double' half Heusler compounds [31]. To find more such unexpected compounds, we can explore ML methods designed to make predictions for any arbitrary crystal structure, i.e. not limited to known prototypes. ML methods can be used to conduct 'materials forecasting' , where we take a network of compounds on the convex hull and their tie-lines and extrapolate the network forward in time to find new compounds [32]. Another largely untapped but valuable region of composition space are the compositions that can be formed by taking compounds where one element sits on multiple sublattices and plugging multiple elements into those sublattices. For example, Pal et al [33] has studied the AMM'Q 3 Cmcm structure, in which the Q element (S, Se, or Te) occupies two distinct sublattices; thus, mixed chalcogenide compounds could be formed by occupying these two sites with different anions. The number of possible compositions in this space is combinatorically explosive, and ML methods may be necessary to efficiently sample them. Also, as countless solid solution compounds are known to exist on the sublattices of parent prototypes, one could search for potential solid solutions by screening through ordered versions of these compounds.

Statistical summary of current OQMD
In the middle of 2021, the total number of converged compounds in the OQMD surpassed one million, with 1022 603 compounds in the most current public database at oqmd.org. To commemorate the one million material milestone, we created a charged balanced compound KMgAlWS 6 to acknowledge the authors of the original OQMD paper [11] by translating their last names into elements-sulfur (S, James Saal), potassium (K, Scott Kirklin), aluminum (Al, Muratahan Aykol), magnesium (Mg, Bryce Meredig) and tungsten (W, Chris Wolverton).
Among all converged compounds in the OQMD, 37 624 of them are from ICSD and we are continuously adding and calculating compounds reported in experiments, including 521 compounds from the Powder Diffraction File that we recently solved using OQMD prototypes [18]. Within the OQMD, the convex hull method [34] is used to determine the thermodynamic stability of a compound, and a zero distance means stable while a positive distance means unstable. It is common for multiple 'polymorphs' of a stable composition to be nearly identical in structure and energy. In such cases, we choose only one stable compound out of all polymorphs within 5 meV atom −1 of the convex hull. If a polymorph is from ICSD, that polymorph is chosen; otherwise, the polymorph with the most prevalent prototype is chosen. Following these criteria, we currently find that 49% of the ICSD compounds are stable. 87% of them are within 100 meV atom −1 convex hull distance and the histogram of stability is shown in figure 3(a). Most of the unstable ICSD compounds are high temperature/pressure polymorphs, while others become unstable due to the introduction of more stable hypothetical compounds either at the same composition or as their competing phases. These hypothetical compounds in the OQMD are created by prototype decorations (details are discussed in section 2.1) and to date, they have increased the total number of stable compounds by ∼200% and increased that of nearly-stable compounds (i.e. convex hull distance within 100 meV atom −1 ) by ∼670% compared to the ICSD compounds (see figure 3(b)), bringing unprecedented possibilities in materials design and discovery. The majority of entries in the OQMD are ternary compounds, followed by the quaternary and binary ones (see figure 3(c)), which aligns with our material design decision that started from the simplest structures and gradually moves into more complex ones. The number of binary compounds is relatively less than that of ternary or quaternary ones due to its limited degrees of freedom. For the same reason, our high-throughput approach discovers 2118 new stable binary compounds, while 19 322 ternary and 14 375 quaternary hypothetical compounds are predicted to be stable (see figure 3(d)). With the improvement in computing power, we are now able to calculate more complex multi-component compounds and 2130 new stable quinary compounds have already been predicted in our ongoing calculations. Most of the hypothetical stable compounds in the OQMD have less than 20 atoms in their primitive cell, as shown in figure 3(e), but we are continuously expanding our database with compounds having more than 20 atoms. The band gaps of stable compounds in the OQMD are also summarized in figure 3(f). Our current research mainly focuses on conducting materials and semiconductors, and we expand the phase space of materials with band gap 0, (0, 1] and (1, 2] eV by ∼360%, ∼220% and ∼170%, respectively. As can be visualized in figures 3(g) and (h), within all stable ICSD compounds, the top three occurrences of space groups are Pnma (62), P2 1 /c (14) and I4/mmm (139), while the top three space groups of hypothetical compounds are P62 m (189), Fm3m (225), and C2/m (12). This deviation is mainly caused by the specific research focuses, and several major projects performed using the OQMD platform are summarized in the following section 3.2.

Major projects
The growth of the OQMD mostly depends on the project-based high-throughput DFT studies performed by members of the Wolverton Research Group. We herein summarize six projects categorized by two major materials discovery logistics-structure-based projects (perovskites and mixed-anions) and property-based projects (thermoelectrics, solar fuels, batteries, and high-strength alloys).

Perovskites
Perovskite oxides have been at the center of numerous computational and experimental studies for a variety of applications, exhibiting properties suitable for technologies ranging from gas sensors [35] to solid oxide fuel cells [36,37]. As such, a significant portion of the computational efforts through which the OQMD has grown over the years has been dedicated to populating the phase space with this structure type across compositions, considering not only several distortions of the ideal cubic phase but also mixing on both the A and B sites.
For simple ABO 3 perovskites, with over 5000 calculations, Emery et al computed all possible permutations of metallic elements on both A and B sites in the cubic Pm-3 m structure, as well as (for cases where the energy of the cubic phase was within 500 meV atom −1 from the convex hull) in its most common distortions (orthorhombic Pnma, rhombohedral R3m and tetragonal P4/mmm), finding over 300 stable compounds [17]. For A 2 BB'O 6 and AA'B 2 O 6 mixed perovskites, He and Wolverton performed a wide-ranging study comprising over 35 000 calculations [38]. To extensively cover the more commonly reported B site ordering, 10 different distortions of A 2 BB'O 6 perovskites were considered (P2 1 /c, C2/m, P1, P1, I4/m, I4/mmm, R3, R3, R3m, Fm3m), with permutations of Ca, Sr, Ba, and La (and, for the most common P2 1 /c structure, Zn, Cd, Hg, and Pb) on the A site, and 50 metal elements on the B sites, totaling around 10 000 compositions, for over 500 of which a stable perovskite structure was identified. The more rarely observed A site ordering was then also investigated with over 3000 calculations of AA'B 2 O 6 perovskites considering 2 different distortions (Cmmm and P4/mmm) and combining rare earth, alkaline earth and transition metals on the A sites and again 50 metal elements on the B site. In order to increase the completeness of the convex hull, and thus the reliability of stability results, the studies have also included a number of competing phases by decorating the most common non-perovskite structures observed on the ICSD for generic ABO 3 , A 2 BB'O 6 and AA'B 2 O 6 compositions.

Mixed anions
Materials with two or more types of anions (mixed-anion or heteroanionic materials) bring in new opportunities in materials design, which could lead to new properties that are not easily accessible in single-anion (homoanionic) materials [39]. For example, functional mixed-anion compounds have various energy-related applications including thermoelectrics [40], battery cathodes [41], and hydrogen evolution photocatalysts [42]. However, mixed-anion materials known to date only account for a small portion of the total number of combinatorial possibilities with periodic table elements, raising the question of whether is due to the scarcity of known mixed-anion materials being underexplored or simply relatively uncommon. Therefore, high-throughput DFT studies have the potential in unveiling the mystery of mixed-anion materials.
Various types of mixed-anion materials have been studied using the platform of the OQMD. Shen et al [43] performed DFT calculations on 1188 ternary oxypnictides and found 42 of them that are experimentally unknown but predicted to be thermodynamically stable, largely expanding the number of stable oxypnictides. Amsler et al [27] utilized an ML approach combined with a high-throughput method to investigate the ternary mixed-anion materials and predicted 21 new stable X 4 Y 2 Z compounds exhibiting band gaps suitable for energy applications. Quaternary oxychalcogenides are a well-known material family having promising applications including thermoelectrics, transport conductors and solid-state electrolytes for Li/Na ion batteries. He et al [44] screened over 5000 oxychalcogenides compounds and found 129 hitherto unreported stable compounds and most of them are semiconductors. Given the large deficiency in the exploration of mixed-anion compounds, the OQMD is continuously working on the high-throughput calculations of these materials, hoping to provide more insights for both experimental and theoretical research in mixed-anion materials.

Thermoelectrics
Thermoelectric (TE) materials have been widely studied in the past decade for their capacity to directly convert heat into electricity, which has significant promise in energy-related applications. Thermoelectric performance (conversion efficiency) is characterized by the figure of merit ZT = S 2 σT/(κ e + κ l ), where S is the thermopower, σ is the electrical conductivity, T is the temperature and κ e and κ l are the electronic and lattice thermal conductivities, respectively. To increase the ZT value of a material, we need to maximize its power factor (PF) S 2 σ and minimize its thermal conductivity (κ e and κ l ) simultaneously.
The OQMD provides a means to perform high-throughput studies to search for hitherto unknown materials exhibiting low lattice thermal conductivities, which may be potential thermoelectric materials. Through high-throughput screening, He et al [45] discovered a new class of stable full Heusler compounds with ten valence electrons that have ultralow lattice thermal conductivity close to the theoretical minimum. Further investigation showed that the strong anharmonic rattling of the heavy noble metals in these materials accounts for the low lattice thermal conductivity and makes them excellent 'phonon-glass electron-crystal' materials. In another example, Pal et al [46] predicted an enormous number of 628 thermodynamically stable quaternary chalcogenides AMM'Q 3 , and showed many of these compounds to be potential TE materials by validating the presence of low lattice thermal conductivity using the Peierls-Boltzmann transport equation. High-throughput DFT approach can also be used to search for new multicomponent bulk-nanostructured TE materials. As stated by Doak et al [47], the screening-and-sorting procedure sorted out 130 candidates for high-performance thermoelectrics from a search space of 29 700 five-element systems, largely increasing the success rate in exploring new TE materials. Using a similar approach, Kocevski and Wolverton [48] screened for nanostructured two-phase Heusler TEs in the OQMD and predicted 29 pairs that have not been considered previously for TE applications.

Solar fuel materials
Solar thermochemical water splitting (STCH) offers a renewable alternative to fossil fuels by enabling the production of hydrogen through a carbon-free process that can theoretically harness the energy of the entire solar spectrum. STCH consists of a two-step cycle consisting of a reduction step, in which an oxide compound loses oxygen upon heating, and a water-splitting step, where H 2 O is exposed to the reduced material, which re-oxidizes producing hydrogen [49]. Numerous efforts are being directed to the search for metal oxides capable of increasing the currently limited reported efficiencies, seeking compounds capable of reduction at suitable temperatures with fast kinetics while maintaining their structural stability through temperature change and oxygen loss.
In this context, high-throughput DFT and materials databases are powerful tools to expedite materials search by allowing for screening based on properties of interest on a large scale. Over the last few years, multiple such screenings have been performed utilizing the OQMD. Focusing on structure types with reported attractive STCH properties such as the ability to withstand oxygen nonstoichiometry and ease of oxygen diffusion, the OQMD has been populated with thousands of decorations of such prototypes. For the hundreds of new stable compounds thereby identified, the enthalpy change in the reduction step was then quantified by computing the oxygen vacancy formation energy, and utilized as a criterion to further screen the compounds. The screening was based on thermodynamic analysis [50], which showed the range of reduction enthalpies that allows for both steps of the STCH reactions to be favorable. Applying this methodology to perovskite, pyrochlore and spinel structure types, hundreds of new candidates [17,51,52] have been identified in recent years, with two mixed perovskites being successfully experimentally synthesized and cycled and displaying favorable redox thermodynamics and high fuel production [53,54].

Batteries
To meet the increasing energy storage demands that accompany the expansion of the renewable energy and electric vehicle market, extensive effort is being directed to increasing batteries performance on multiple fronts. Leveraging the OQMD platform, numerous studies have been conducted to both accelerate the discovery of new anodes, cathodes, electrolytes and coatings materials for traditional lithium-ion batteries (LIBs) and aid in the development of batteries exploiting alternative chemistries.
To identify new anode materials with superior performance to the traditionally employed graphitic carbon, Kirklin and Wolverton, as one of the earliest applications of the OQMD, computed the lithiation reactions of transition metal silicides, stannides and phosphides and assessed their potential based on gravimetric and volumetric capacity, cell potential and volume expansion [15]. Examining Li, Na and Mg based anodes, Snydaker et al addressed the degradation in cell performance brought about by their reactivity with the electrolyte by surveying the database for thermodynamically stable materials to serve as coatings, therefore exhibiting chemical equilibrium with each anode type, then ranking the candidates according to their electronic insulation [55]. Addressing cathode degradation instead, Aykol et al screened the OQMD for various cathode coatings tailored to different battery chemistries, identifying the top 30 candidates for physical barrier, HF-barrier and HF-scavenger coatings by selecting thermodynamically and electrochemically stable materials and applying multi-objective optimization of filters relative to each functionality [56].
In the search for new cathode materials, Bhattacharya and Wolverton predicted several new candidates by surveying spinel oxides of the 3d transition metal series, computing their thermodynamic stability, cation site preference, and delithiation voltage [57]. Focusing on the Li 2 MO 3 family, Kim et al identified a number of Li 2 MO 3 -Li 2 MO 3 active inactive electrode pairs, again computing their thermodynamic stability and delithiation voltage [58]. In a subsequent study, the authors then explored the possibility of embedding lithiated cobalt spinel in Li 2 MO 3 based cathodes searching the Li-Co-Mn-Ni-O space and identified LiCo 0.1875 Ni 0.8125 O 2 as a promising candidate [59].
In a more open-ended approach, Aykol et. al explored the Li 3 X 3 Y 2 O 12 garnet type structure identifying several stable compositions, then classified their potential applications as either anodes, cathodes, or electrolytes based on their computed electrochemical window [60]. Going

High-strength alloys
Strong structural materials have various technological and constructional applications in nearly all aspects of human life. Age hardening, one way to increase the yield strength and hardness of pure metals and alloys, is realized by forming precipitates from the supersaturated solid solution through quenching and aging. Many known age-hardenable alloy systems have been discovered and optimized by extensive experimental trial and error. However, it is not feasible to exhaustively search for new precipitate strengtheners using brute force experimentation due to the extremely large number of possible compositional combinations.
Kirklin et al [16] conducted a high-throughput computational study on ∼200 000 compounds to search for effective strengthening precipitates for three major alloy/precipitate systems, including FCC metals (with L1 2 X 3 Y precipitates), HCP metals (with D0 19 X 3 Y precipitates), and BCC metals (with L2 1 X 2 YZ precipitates). By using the OQMD, the thermodynamic stability of each compound was evaluated and the precipitate compounds were screened to be either in stable equilibrium with, or likely to form metastable precipitates in, the host matrix. Kirklin et al proved the effectiveness of their effort by recovering most of the widely known precipitate strengthening systems through their screening metrics. In addition, they also predicted 34 L1 2 precipitate compounds, 29 D0 19 precipitate compounds and 50 L2 1 precipitate compounds as promising precipitate strengtheners in several common host matrices (e.g. Al, Fe, Mg, Ni, Co).

Data accessibility
The OQMD is comprised of a SQL database, an Application Programming Interface (API), and a web interface created using the open-sourced Django webframework. The function of the API in the OQMD is to populate the SQL database with new data generated via DFT and transfer relevant data from the database to the web interface upon the users' request. The API for OQMD, called qmpy, is under active development at https://github.com/wolverton-research-group/qmpy. Users may download the entire OQMD database from OQMD.org and if desired, host it on their local machines. A locally hosted OQMD database can be accessed and modified using qmpy following the standard Django API protocols. Alternatively, users are able to interact with the database via the HTML-based web interface at OQMD.org to browse materials, create phase diagrams, and analyze the data generated from DFT calculations. The OQMD also supports the RESTful querying of the database via OQMD.org which enables data retrieval over basic HTTP connections in JSON format without requiring a web browser on the user side to parse the responses. The OQMD RESTful queries are to be made in accordance with the Open Databases Integration for Materials Design (OPTIMADE) protocols [62]. The OPTIMADE team has designed a universal API to make materials databases accessible and interoperable. The adherence to OPTIMADE is in support of creating an open and universal RESTful querying system for all material databases. Currently, the OQMD.org interface provides data including material crystal structures, bandgap energy, formation energy, thermodynamic stability, electronic density of states, and phase diagrams, among others.
The qmpy API is constantly updated to use the stable python and Django versions in the interest of achieving network safety and data robustness. As of the writing of this article, the current version of qmpy (v1.4) uses python 3.7 and Django v2.2. Recent upgrades to qmpy are backward-incompatible with previous versions of MySQL DB files due to the drastic changes in dependent packages.

Open access to the public
From the very beginning of the OQMD, a central guiding philosophy was to ensure open access to all public data. From the first release of the database in 2013, users have had the option to download the entirety of the database for use under a highly permissive license (CC-BY 4.0). We believed that there was significant value in performing science on the entirety of the database (rather than on individual entries or small samples) and that there would have great benefit to the community by allowing everyone that capability. This vision is strongly aligned with the notion of FAIR data principles, introduced one year later in 2014. The current success and proliferation of data-driven materials informatics methods can be partially attributed to the availability of these large, uniform, self-consistent HT-DFT datasets.

External use of OQMD
The openness of the OQMD has facilitated its use in advancing materials research across the world. In the years since its public release, the database contents have been leveraged by a large number of researchers across five continents for a variety of applications. Thanks to the richness of the phase space covered by the over one million compounds included, several computational studies have referred to the OQMD convex hull to select stable phases for applications such as photovoltaics [63], hydrogen production [64], and fuels for nuclear reactors [65]. Screening based on properties other than stability has also been widely employed in materials search for technologies ranging from batteries [66] to magnetic compounds [67]. Multiple properties included in the database have also been utilized for studies based on structure types [68][69][70][71], and specific phase spaces [72][73][74].
On an even larger scale, OQMD data has found widespread employment in the training and validation of newly developed ML models. Focusing on formation energy prediction, Krajewski et al employed the compositional and structural features introduced by Ward et al [24,25] and tested several neural network architectures, identifying the best-performing ones and developing an open sources tool with a user interface [75]. Including more generic structural information, Jain et al trained a representation learning feed-forward neural network on the 20 most common structure types on the OQMD exploiting exclusively atomic number and crystallographic symmetry information [76], and Jørgensen et al developed a message passing neural network based only on local information about bonding and symmetry [77]. Approaching the problem in a structure agnostic fashion, Jha et al developed a deep neural network model relying only on the compositional attributes [78], and, in a later work, employed residual learning [79] to tackle the decrease in accuracy observed with increasing depth of the network architecture. Targeting the bandgap as well as the formation energy, Goodall et al once again took a structure agnostic approach taking only stoichiometric data as input and reformulating it as a weighted graph between elements [80]. Min et al on the other hand, employed a gradient boost regressor model and an active learning process to identify new ferroelectric materials [81], and Stanley et al utilized a kernel ridge regressor to identify new mixed halide perovskites for photovoltaic applications [82]. Focusing on structural properties, Williams et al [83] and Zheng et al [84] both employed convolutional neural networks to predict the lattice parameters of, respectively, cubic perovskite oxides and full heusler compounds. In very recent works, Dan et al [85] and Zhao et al [86] both pursued the goal of designing new chemically valid hypothetical materials by employing generative adversarial networks, and Hu et al developed a global optimization-based algorithm to reconstruct crystal structures from atomic contact maps [87].

FAIR data in OQMD
The acronym FAIR in data stands for four aspects of data: Findable, Accessible, Interoperable, and Reusable. The OQMD is committed to adhering to the FAIR data principles. Three noteworthy efforts toward FAIR-principles implementation are implemented, namely-Persistent Identifiers (PID), OPTIMADE RESTful API data transfer, and structured data for web indexing. The Handle persistent identifier registry has assigned the prefix https://hdl.handle.net/20.500.12856/ to the OQMD. Currently, every material entry in the OQMD has a PID URL associated with it in support of the Availability and Reusability of the data. The PID of a material is a permanent reference at the handle.net server, redirecting to the OQMD page of that . Percentage of all compounds on the OQMD convex hull that are sourced from the ICSD, plotted against the date of calculation. The OQMD started with ICSD compounds (100%), but ICSD compounds now make up just 32% of the OQMD convex hull. For this plot, all ICSD compounds are assumed to have been calculated prior to 2014, although many were calculated later than that. material regardless of the URL changes at the oqmd.org domain. Perseverance of the PID also enables a unified reference to the original OQMD page when the OQMD database is hosted locally and modified according to external researchers' preferences. The RESTful querying of the OQMD based on OPTIMADE API specifications contributes to the efforts of creating a unified material query system. A universal RESTful query system for all material databases is in support of the Findability, Availability, and Interoperability of FAIR principles. A structured version of materials data, in accordance with the vocabularies provided by schema.org, is included in the OQMD web pages for better indexing of relevant data by search engines and subsequently aiding in the Findability and Availability of the data. Both Material Entry pages and Material Composition pages in OQMD contain structured data, and are indexed by most of the popular web search engines.

Challenges for the future of HT-DFT and OQMD
The OQMD started out as a DFT database of primarily experimental compounds from the ICSD. However, recent high-throughput DFT projects, including exhaustive compound searches with common prototypes, application-driven searches, and efficient ML-based searches, have transformed the OQMD into a database of both experimental and hypothetical compounds. Since we first began with only ICSD compounds, the number of compounds that are stable, or on the convex hull, has more than doubled in size. In figure 4, we show how the OQMD convex hull has evolved over the years. Experimental compounds now make up just 32% of all compounds on the OQMD convex hull. This tells us that experimental databases like ICSD may only be scratching the surface of what's possible for materials synthesis; there may be tens of thousands (or more) compounds that have yet to be synthesized. In addition, thousands of ICSD compounds have not been incorporated in the OQMD (e.g. solid solutions or disordered structures), which could impact the number of new predicted compounds. Going forward, we aim to explore ways to efficiently invest our computational resources to rapidly expand the OQMD to new materials. The reproducibility of high-throughput DFT data is also important to our exploration. A recent paper [88] presented a thorough study and compared the reported properties among three major HT-DFT databases including OQMD. According to their conclusion, the reproducibility of the DFT data is accepted among three databases compared to the differences between DFT and experiment.
However, a fair question to ask is how valuable the calculations of new materials will be as the database keeps growing beyond millions of compounds, especially with ML helping prioritize candidates that are more likely to be stable [89][90][91]. As the chemical spaces become more populated with stable compounds, finding new ones becomes more difficult due to the energetic competition with the existing convex-hull. In addition, higher-order compounds face more competition as they compete against lower-order ones as well (e.g. a new ternary compound in A-B-C chemistry competes not only with the other ternaries, but also with the binary compounds in A-B, B-C and A-C chemistries). As an outcome, the formation energies of higher-order stable compounds are heavily skewed towards lower energies [89]. Overall, this observation is in conflict with the notion that many new materials are awaiting discovery in complex, high-order chemistries [89]. Nevertheless, we should look beyond the narrow definition of the 'material' we have for structures in DFT databases, and for example, acknowledge that in higher-order systems, configurational entropy effects will become a dominant factor in stability assessment, missing from the purely enthalpic convex-hull construction in the current database.
In effect, this is an indication that the next phase of high-throughput DFT databases, including the OQMD, should include calculations that require larger, complex structures such as for modeling of entropy effects in disordered compounds (e.g. via workflows for special quasirandom structures [92]) or such as modeling of point or line defects, surfaces, interfaces, thermal or ionic transport or incorporation of computationally-expensive functionals, such as HSE. There are, in fact, numerous early examples of such efforts [93][94][95][96]. In other words, we anticipate the utility of the OQMD and other HT DFT databases in the near future to be enhanced by going beyond perfect crystal structures, and starting to model more complex or 'realistic' aspects of materials in an HT fashion.

Conclusion
To celebrate the one million compounds milestone in the OQMD, we wrote this perspective to document the motivation for developing the OQMD, summarize the current status of the database, and raise some open-ended questions for the future of both high-throughput DFT calculations and materials databases. As an ever-growing materials database, the OQMD has been expanded for years through project-based HT DFT studies and is now taking the advantage of ML approaches to accelerate materials discovery. From the birth of the OQMD, the openness of its materials data to the public has been ensured as the main objective of the database and we are constantly working with other materials databases to reach a universal querying system in support of the FAIR principles. Researchers from all over the world have been intensively utilizing the OQMD to predict new materials for a variety of applications and to train some state-or-the-art ML models in materials discoveries. The OQMD has been and will continuously be serving as a pioneering fully open platform to advance the materials research for the whole community.

Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: https://OQMD.org.