Machine learning-based prediction of phases in high-entropy alloys: A data article

A systematic framework for choosing the most determinant combination of predictor features and solving the multiclass phase classification problem associated with high-entropy alloy (HEA) was recently proposed [1]. The data associated with that research paper, titled “Machine learning-based prediction of phases in high-entropy alloys”, is presented in this data article. This dataset is a systematic documentation and comprehensive survey of experimentally reported HEA microstructures. It contains microstructural phase experimental observations and metallurgy-specific features as introduced and reported in peer-reviewed research articles. The dataset is provided with this article as a supplementary file. Since the dataset was collected from experimental peer-reviewed articles, these data can provide insights into the microstructural characteristics of HEAs, can be used to improve the optimization HEA phases, and have an important role in machine learning, material informatics, as well as in other fields.


Specifications
Materials Science Specific subject area High Entropy Alloys (HEA) Type of data Text file, tables and figures How data were acquired The dataset was gathered from experimental observations reported in peer-reviewed research articles. Each entry of the 1362 dataset entries is therefore an experimental observation as reported in the literature. Data format Raw, tabulated and plotted Parameters for data collection The following wide range of parameters were collected: i. Identification of the multi-component alloy, ii. the number of elements making up the alloy system, iii. the valence electron concentration, iv. the atomic size difference, v. the difference in Pauling negativities, vi. the enthalpy of mixing for a alloy system, vii. the entropy of mixing of a multi-component alloy system, viii. how the alloy was synthesis and processing conditions (post-process working and heat treatment), ix. the experimentally observed and reported, and x. the bibliographical references against each entry.

Description of data collection
Secondary data (i.e. composition-specific features, alloy processing and post-processing parameters, and the resulting phases) were collected. Some typical empirical HEA design parameters were calculated using well-known approaches. Data was processed using Excel and R, a language and environment for statistical computing, for purposes of visualization and data analysis.

Value of the Data
• This dataset documents synthesis routes, processing conditions (post-process working and heat treatment), and the resulting microstructural observations which can be valuable for researchers in the field of Materials Science in the development of experiments. • This dataset specifically contains experimentally reported HEA microstructures which provides enough observations to train and test machine learning and deep learning algorithms. • Different machine learning and material informatics computational methods can be applied to this dataset inorder to extract insights and trends not immediately available from individual studies thereby advancing the real-world applications of these alloys.

Data Description
The data presented in this article are related to the research articles [ 1 , 2 ]. The data is presented in the supplementary data file.

The dataset
This dataset is a systematic documentation and comprehensive survey of experimentally reported HEA microstructures. The dataset was constructed from microstructural observations reported in peer-reviewed experimental HEA research articles; it is built upon datasets prior published by Miracle et al. [3] , Couzinié et al. [4] , and Ye et al. [5] . The dataset presents metallurgyspecific features and microstructural phases experimentally observed.
The dataset, provided with this article as supplementary material, has seventeen columns and 1422 entries.
• Columns 1 and 2 correspond to the identification of the dataset entry and the multicomponent alloy system -Alloy_ID and Alloy , respectively. Composition-specific features can further be developed from the Alloy specification • Columns 3 to 27 correspond to the elemental compositions of the multi-component alloy while Column 28 corresponds to the number of elements making up the multi-component alloy system ( Num_of_Elem ). • Columns 29 to 37 correspond to some typical empirical HEA design parameters [ 6 , 7 ] such as the density estimate ( Density_calc ), the enthalpy of mixing for a multi-component alloy system ( dH mix ), entropy of mixing of a multi-component alloy system ( dS mix ), melting temperature estimate ( T m ), valence electron concentration ( VEC ), atomic size difference ( δ), and difference in Pauling negativities ( χ),. While this article is limited to these typical parameters for succinctness, others can also be developed from the Alloy specifications [7][8][9][10][11][12][13] . • Columns 38 to 45 correspond to metallurgy-informed alloy processing and post-processing parameters indicating that: how the alloy was synthesised ( Sythesis_Route ); hot-or coldworked ( Hot-Cold_Working ); undergone homogenization processing at temperature ( Homoge-nization_Temp ) and time ( Homogenization_Time ); undergone annealing processing at temperature ( Annealing_Temp ) and time ( Annealing_Time ); and/or undergone some quenching processing ( Quenching ).  Table 1 gives a summarised description of these features.

Experimental Design and Methods
The dataset is built upon datasets prior published by Miracle et al. [3] , Couzinié et al. [4] , and Ye et al. [5] . It is constructed from HEA microstructural observations reported in experimental peer-reviewed research articles.
Selected HEAs reported in the literature through the end of August 2020 make up the dataset presented in this article. The as-constructed dataset has at least 1362 multicomponent alloys. After removing entries missing some data and eliminating glassy, amourphous, and hexagonal close-packed phases, the as-constructed dataset is reduced to 1362 multicomponent alloys. Supplementary materials documents accompanying this article contain the full list of alloys and references.
The equations used to estimate each of the selected empirical design parameters are summarised in Table 1 . Fig. 1 shows the distribution of phase classes in the dataset (the atomic size difference, δ as a function of VEC ) after Refs [ 7 , 14 ]. The visualization also shows that solid solution phases are particularly more sensitive to empirical parameters than IM phases [ 1 , 2 ]. Table 1 Descriptions of the empirical and metallurgy-specific features cites ML-based studies that attempted predicting HEA phases therefrom.

Symbol
Description of Feature References

Num_of_Elem
Number of elements in a multi-component alloy system δ A parameter describing the atomic size mismatch or difference in a multi-component alloy system where c i and r i is the atomic percentage and atomic radius of the i th component and r is the average atomic radius of the components of the alloy, respectively. [7][8][9][10][11][12][13] X Pauling negativities mismatch for multi-component alloy system where c i and χ i is the atomic percentage and Pauling electronegativity of the i t h component and χ is the mean value of electronegativity for a multi-component alloy system, respectively. [7-13]

VEC
The valence electronic concentration of a multi-component alloy system calculated on the basis of the rule of mixtures approach where c i and V E C i are the atomic percentage and the valence electron concentration of the i t h component, respectively. [7-13]

S mix
The entropy of mixing of a multi-component alloy system calculated as follows is the universal gas constant and c i is the atomic percentage of the i th , component.

Synthesis_Route
A categorical feature indicating that the alloy was synthesised via vacuum melted (AC), powder metallurgy (PM), or otherwise

Hot-Cold_Working
A categorical feature indicating that the alloy has been subjected to a cold/hot working treatment.
(nan, CW -cold worked, HW -hot worked, HIPhot-isostatically pressed) Homog_Temp A feature indicating the temperature at which the alloy was subjected to a homogenization treatment (in °C)

Homog_Time
A feature indicating the duration for which the alloy was subjected to a homogenization treatment (in minutes).

Annealing_Temp
A feature indicating the temperature at which the alloy was subjected to a annealing treatment (in °C) Annealing_Time A feature indicating the duration for which the alloy was subjected to a annealing treatment (in minutes).

Quench_Proc
Categorical feature indicating that the alloy has been subjected to an quenching heat treatment Microstructure Experimentally observed microstructure(s), namely BCC_SS (body-centered cubic solid solutions), FCC_SS (face-centered cubic single solid solutions) FCC + BCC_SS (dual-phase solid solutions) and Im (intermetallic, intermetallic, laves, martensitic, and other phases but excluding glassy or amorphous and hexagonal close-packed -HCP)

Ethics Statement
None.

Data Availability
Dataset for High-Entropy Alloys Phases (Reference data) (Mendeley Data).

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.