Thin Solid Films

Data-driven approaches are becoming increasingly valuable for modern science, and they are making their way into industrial research and development (R&D). Supervised machine learning of statistical models can utilize databases of materials parameters to speed up the exploration of candidate materials for experimental synthesis and characterization. In this paper we introduce the HADB database, which contains properties of industrially relevant chemically disordered hard-coating alloys, focusing on their thermodynamic, elastic and mechanical properties. We present the technical implementations of the database infrastructure including support for browse, query, retrieval, and API access through the OPTIMADE API to make this data findable, accessible, interoperable, and reusable (FAIR). Finally, we demonstrate the usefulness of the database by training a graph-based machine learning (ML) model to predict elastic properties of hard-coating alloys. The ML model is shown to predict bulk and shear moduli for out out-of-sample alloys with less than 6 GPa mean average error.


Introduction
Databases containing theoretically derived materials parameters in combination with data-driven methods provide an increasingly important foundation for many approaches in materials science.The development is often referred to as the ''fourth paradigm'' of materials design [1].The increasing computational power and the rise of machine learning (ML) methodologies motivate rapid growth of existing databases together with development of databases [2][3][4].Some of the most well-known examples are the publicly available Materials Project [5], NOMAD [6], AFLOW [7], and Materials Cloud [8] databases all contains hundreds of thousands of records, which are derived from firstprinciples density functional theory (DFT) [9,10] calculations using high-throughput infrastructure [11][12][13].The properties available in these databases range from total energy to piezoelectric constants and X-ray absorption spectra.For a list of free and commercial databases the reader is referred to Ref. [2].A common use case for the data is to feed it into a ML workflow to train ML models to predict properties of materials that have not been observed and discover materials with desired properties e.g., see [14,15].
However, this potential speed up of the materials exploration via ML is not yet fully utilized in predicting and analyzing properties of alloys for hard-coatings applications.Recent studies have demonstrated reliability of first-principles calculations to investigate thermodynamic, vibrational or elastic properties of disordered hard-coating alloys.The thermodynamic driving force of spinodal decomposition in multicomponent quasi-ternary coatings, such as the (Ti-Al-X)N with X=Cr,Y,Zr,Nb,Hf,Ta alloys, has been predicted within the coherent potential approximation [16][17][18] or using the special quasirandom structure approach [19].The thermal expansion of quaternary nitrides has been predicted by a quasi-harmonic Debye model with a piece-wise linear interpolation of the ab-initio elastic constants calculated at 0 K only for binary materials and 50-50 ternary alloys [20].High-temperature thermodynamics and elastic constants of Ti 1− Al  N alloys have been investigated using different molecular dynamics approaches [21][22][23].It has been underlined that with machine learned interatomic potentials one can efficiently and accurately predict the elastic moduli of Ti 0.5 Al 0.5 N [24].Further success of utilizing machine learned interatomic potentials to investigate high temperature properties of materials were the prediction of the Elinvar effect in pure bcc titanium [25] and the thermodynamic and elastic property exploration of the high-entropy alloy TiZrHfTa  [26].Still, the complexity of the computational tasks prevented high-throughput data generation for hard-coating alloys, and we are not aware of databases that contain information on their properties.
To advance the impact of materials informatics for the important class of materials with industrial relevance we have developed the Hard-coating Alloys DataBase (HADB) database.In this paper, we describe how the database is designed and demonstrate the utility https://doi.org/  of the data it contains.We are interested in storing information that can be computed from first principles and that is related to materials performance in cutting applications: e.g., the elastic stiffness tensor, polycrystalline elastic properties (such as bulk and shear moduli).We first describe the technical aspects of the database infrastructure and then we validate its usefulness.We test how well the machine learning approach developed in Ref. [27] matches a set of DFT reference calculations that are not part of the training set.

Technical details
There are several reasons to curate research data in a well-organized database.A database can track the information about how the data was produced.This metadata ensures the reproducibility of the data.Because the database can be fully reconstructed from the raw data, e.g., output files of electronic structure calculations, we also have a clear chain of provenance going from raw data to the database.Storing the generated data in a database will also guard against data loss and makes the data available in accordance with the FAIR principles, i.e., findable, accessible, interoperable, and reusable.While the database itself is the main ingredient, we still need easy and efficient ways for users to access and interact with the database.Two of the most common ways to accomplish this is by providing a web-based user interface that allows users to explore the data and download files using an internet browser, and by providing a REpresentational State Transfer Application Programming Interface (REST-API) [28].The REST-API builds on top of the ubiquitous HTTP protocol and allows automated interaction with the databases, which can be commanded via programs, scripts, and directly from a command line interface [29].Both of these ways of accessing the database are described in more detail below.

Data and database
The data in the database have been generated by performing DFT calculations and then processing the output files that the DFT code produces.These DFT calculations are managed using the high-throughput toolkit (httk) [12].Both data creation and database management are handled by httk, which is an advantage over having to use multiple separate tools to manage the whole.With httk it becomes easy to create the input files for the calculations, send/receive files to/from a computing cluster, execute complex workflows, monitor the state of calculations and fix broken calculations, among other things.Currently httk has extensive support for the Vienna Ab Initio Simulation Package (VASP) [30][31][32].Of course, support for other computational codes can be added.
The DFT output files are processed by httk and transformed into a representation that is ready to be stored in an SQLite database [33].In order to hide the intricacies of storing data in an SQLite database, httk provides many different data types that make the task easy.Data can be stored and retrieved from the database in terms of these convenient data types.The data is automatically serialized and stored in one or more efficiently indexed SQLite tables, with no need for manual implementation.Fig. 1 shows a flowchart overview of the architecture around the database.

Example of web interface
A very common access mechanism for open databases is via a web-based user interface provided via the website of the database.Some examples of publicly available materials database web-based user interfaces include AFLOW [7], Materials Project [5], Open Quantum Materials Database [34], and Open Materials Database [35].The common goal of this kind of user interface is to make exploration of the database as simple as possible.The basic features that these websites typically include are: (i) search using keywords, (ii) information page listing the properties of each entry in the database, (iii) visualization of structures, (iv) allow users download calculation input/output files and the properties data of one or more entries.Our web interface implements all of the basic features listed above, as well as some more advanced features, which are: • users can to visualize the relationship of any two properties for a chosen set of entries as a 2D scatter plot (''Property map'' in Fig. 2), • users can predict materials properties of materials using ML models.
The information listed for an entry varies based on the calculations that the entry derives from.The information that is usually available includes, among others, description of the structure (formula, symmetry information, and visualization of the structure), computational details (e.g., number of k-points and energy cutoff for the basis set), basic information about the computational method used (e.g., capabilities and limitations), description of the methodologies used (including references to relevant publications), simulation temperature, total energy, elastic constants, and mechanical properties (e.g.bulk modulus, shear modulus, and Poisson's ratio).
The web interface is built using httk, which has a Python based template engine for constructing the HTML pages the user sees on the fly.The basic structure of the website is defined by a set of template files, which contain special tags that can refer to Python variables.These Python variables hold values needed to construct the information the user will see on HTML page, for example data retrieved from the SQLite database for the particular system(s) the user is searching for.When the user navigates to a HTML page, httk will convert the template file to an HTML page by replacing all tags with values stored in the corresponding Python variables that the tags are referring to.An example of the web interface can be seen in Fig. 2.

REST-API
Our database also implements a REST-API that is OPTIMADE [36,37] compliant.OPTIMADE is an open API that allows different materials databases to be queried in a standard way.Before OPTIMADE databases that provided API access did this in a database-specific way, which meant that in order to search multiple databases one had to express the query in a different way for each database to be queried.In the case of most relational databases, interaction with the underlying databases are already governed by the Structured Query Language (SQL) standard.And now materials data fetched from databases have their own standard as well.
In order to use the REST-API, an URL of a special format is constructed.This URL is then accessed in the browser or using command line tools.The OPTIMADE server processes the contents of the URL string and returns the search result in the form of a JSON document.The following are examples of valid OPTIMADE query URLs: • <optimade_implementation_url>/v1/ calculations?filter=_httk_total_energy>-900,• <optimade_implementation_url>/v1/structures?filter=nsites=8.The ''v1'' refers to the main version number of the OPTIMADE API the user wants to use.The ''calculations'' and ''structures'' keywords are endpoints defined in the OPTIMADE specification.These endpoints define the general category of information that we are searching for.The two URLs above also include a filter to only return calculations where the total energy is above a certain limit in the first query, and only those structures that have 8 atomic sites in the second query.There are many more options in the OPTIMADE standard.For more information the reader is referred to the OPTIMADE website at Ref. [38] and the OPTIMADE publication [36].

Viability of HADB -validation of predicting elastic moduli of alloys
One highly promising application of databases is the opportunity to use the data for ML models to be able to predict properties and use them to identify materials with desired combination of properties.In our previous work [27], we trained the crystal graph convolutional neural network (CGCNN) [39] using data for ordered crystals from Materials Project and showed that the model can predict the bulk modulus  and shear modulus  of disordered hard-coating alloys with good accuracy.The disordered alloys on which we have performed the validation were ternary (and binary) nitrides  1−   N, where ,  ∈ {Al, Ti, Zr, Hf} and the concentration  = 0, 1  4 , 1 2 , 3 4 , 1.A common ML operation is to re-train ML models after more data has been added in the database in order to make the models more robust and accurate.In Ref. [27], only data from the Materials Project was used for the training set.The corresponding model in that work is referred as ''Original'' in the following.In the present work, we retrain the ML model by extending the Original training set by the data used for validation in Ref. [27].The model created thus is referred as ''Re-trained'' in the following.That is, the training set of the Re-trained model contains Materials Project data for ordered compounds and data on disordered ternary nitrides  1−   N, where ,  ∈ {Al, Ti, Zr, Hf} and the concentration  = 0, 1  4 , 1 2 , 3 4 , 1.To validate the ''Re-trained'' model, we calculate more first-principles data for disordered ternary nitrides with compositions not included in calculations reported in Ref. [27].The calculations were performed using the VASP DFT code and the methodology and the computational details are the same as those reported in Ref. [27].The calculated elastic properties, including elastic constants   , bulk modulus , shear modulus , Young's modulus , Poisson's ratio  for the here calculated alloys are listed in Table 1.We notice that many of the alloys with the initial B4 structures relaxed into the Bk structure (in terms of the Strukturbericht designation).Only the Al-rich B4 alloys remain in the B4 structure.
The performance of the ''Original'' and ''Re-trained'' ML models are compared in Fig. 3 and in Table 1.Fig. 3 is a parity plot for the bulk and shear moduli showing how much the ML prediction deviates from the calculated DFT value.It is evident that the re-trained model has a noticeably higher accuracy for the test set compared to the old model.This higher accuracy is expected, since the training set of the re-trained model contains nitride alloys that are similar to the alloys in the test set.The thin black lines in Fig. 3 show that the ''corrections'' made by the re-trained model are consistent, meaning that data points that were underpredicted by the old model are shifted upwards and overpredicted points are shifted down.Table 2 gives a summary of the statistical errors of the results in Fig. 3 in terms of mean absolute error (MAE) Fig. 3. Parity plot comparison of the calculated DFT and predicted machine learning CGCNN bulk () and shear moduli ().The blue symbols indicate values calculated with the original ML model of Ref. [27] and the green symbols are calculated using the re-trained ML model.The thin black lines show the differences between the predictions made by the Original and the Re-trained ML models.

Table 1
The calculated elastic properties of the alloys. is the bulk modulus,  the shear modulus,  Young's modulus, and  Poisson's ratio.The SB column refers to the Strukturbericht designation.using data for ordered compounds (which is abundantly available from existing databases, e.g., Materials Project) yields a model that has decent extrapolation capabilities for disordered alloys.However, extending the training set by just limited amount of data on properties of disordered alloys, one greatly improves the accuracy of the model.Our results underline that as the HADB database grows the quality of our machine learning models should improve further.

Summary
We have developed HADB, the Hard-coating Alloys DataBase, a database of properties calculated for chemically disordered hard-coating alloys of interest for hard coating applications, which are a class of systems not commonly available through other materials databases.We have also implemented access via a browser-based web interface as well as a REST-API conforming to the OPTIMADE API standard.The HADB database is intended to accelerate exploration and design of hard-coating materials.The usefulness of the database is illustrated by training of a machine learning model that allows making predictions of the elastic properties of several hard-coating alloys that were not part of the database and which the machine learning model had not seen before.

Declaration of competing interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Igor Abrikosov reports financial support was provided by Sweden's Innovation Agency.Rickard Armiento reports financial support was provided by Vetenskaprådet.Davide Sangiovanni reports financial support was provided by Vetenskaprådet.

Fig. 1 .
Fig. 1.A flowchart illustrating how data is stored in the database and the different ways how users can interact with the data.

Fig. 2 .
Fig. 2. Example of the database web interface.

Table 2
The mean average error MAE (in GPa) and mean average relative error MARE (in %) of the bulk modulus  and the shear modulus  of ML predictions for the test set.