Statistical data set for first-principles calculations of stacking fault energies in an AlNbTaTiV high entropy alloy

High-entropy alloys (HEA), a new class of engineering alloy, are characterized by high concentrations of multiple main elements. These alloys have revealed a vast and largely unexplored compositional space that gives substantial promise for the discovery of new and interesting alloys and properties. In this data article, calculated data and applied inferential statistics are given for six structures related to the calculation of stacking fault energy in a refractory AlNbTaTiV BCC high-entropy alloy (HEA). Global populations of 120 atomic permutations of a special quasirandom structure are calculated for four of the six structures, and a complete statistical inference analysis is performed. Partial sample distributions are created for two of the six structures, and the trends and statistical parameters of the unknown global populations are predicted. The dataset refers to the research article “Stacking fault energies on the {112} planes of an AlNbTaTiV BCC high-entropy alloy from first-principles calculations, analyzed with inferential statistics” by Strother and Hargather [1].


Value of the Data
• Data are useful because it demonstrates a method for generating a database of materials properties based on statistical inference, with predictive error bars. • Researchers in the field of materials science and engineering who do computational alloy design, ICME alloy design, or are interested in database generation will benefit from this data. • The data can be reused to predict different sets of error bars and confidence intervals by using a different percentage of the global population. The method presented here can be applied to any other system for materials alloy design. • The method and data presented in this work demonstrate a technique for rapid development of properties of high-entropy alloys.

Data Description
Six types of structures are investigated in the present work in a BCC AlNbTaTiV high-entropy alloy. This HEA is selected due to its equiatomic composition, solid solution BCC structure, presence of non-BCC elements, and non-magnetic elements. A defect free structure is investigated first. The five faulted structures studied are listed in the Table 1 below, and are described in detail in the research article that accompanies this data article [1] . Global populations that include Table 1 The BCC (112) structures with their names [1] and # of atomic permutations calculated in the present work.

Name # atomic permutations
Defect-free 120 NT S 120 T S 1 120 T S 2 30 T S 3 20 T 120 the ground state energies of full sets of 120 atomic permutations are calculated for the NT S, T , and T S 1 faults. Partial populations of 30 and 20 atomic permutations of the ground state energy calculations are calculated for the T S 2 and T S 3 faults, respectively.

Raw data: Ground state energy calculations
The following table contains the raw data for the statistical analysis performed in the present work. Table 2 shows the 120 ground state energy calculations for each atomic permutation of the Defect-free, NT S, T S 1 , and T structures in the AlNbTaTiV system. There are 30 ground state energy calculations for the T S 2 fault, and 20 ground state energy calculations for the T S 3 fault given in Table 2 . Results are listed in no particular order for any of the faults.

Analyzed data: Data sampling
Using the raw data given in Table 2 , five types of analyzed data are presented for each of the structures listed in Table 1 . The data in the following figures represents the random selection of the ground state energy of the atomic permutations, n, without replacement from a given faults' population. Specific values relating to inferential statistics, defined below as (a) -(e), are calculated based on the random selection of the data from Table 2 . In the case of the defect-free ( Fig. 1 ) and T S 1 ( Fig. 4 ) faults, some of the figures appear in the original research article that accompanies this data article [1] and they are not duplicated here.
(a) shows the sample mean, μ s , given in Eq. 3 , as a function of the sample size, n, for an AlNbTa-TiV SQS cell with no defects. As the ground state energy of each new atomic permutation is randomly drawn, E 0 is calculated and the mean is recalculated. Each dot on the figure represents the value of μ s , cumulative sample mean of E 0 , for each value of n . Table 2 Raw data for the statistical inference analysis performed in the present work. Ground state energies of the various atomic permutations for the Defect-free, NT S, T, T S 1 , T S 2 , and T S 3 structures in the AlNbTaTiV system, provided in eV/atom.  (b) shows the standard deviation of the sample, σ s as given in Eq. 4 , as n increases for given structure. The ground state energy for each atomic permutations was randomly selected from the population without replacement, and the standard deviation was calculated. The standard deviation was recalculated for each n as each new E 0 value was added to the sample set. The statistical analysis for the defect-free AlNbTaTiV BCC cell is given in Fig. 1 below and in Fig. 4 of the research article accompanying this work [1] . 120 atomic permutations of ground state energy calculations were performed on this cell. In [1] , Fig. 4 gives (a) the sample mean, (b) the standard deviation, (c) the standard sampling error and (d) the sampling distribution of the mean. Below, Fig. 1 shows the Q-Q plot of the distribution which is a visual interpretation of the samples' normality. The numerical values used to create this plot can be found in the Supplementary Material that accompanies this article.
The statistical analysis for the faulted NT S AlNbTaTiV BCC cell is shown in Fig. 2 . 120 atomic permutations ground state energy calculations were performed on this cell. Fig. 2 (a) shows the mean cell energy, E 0 , as the sample size, n, increases from 3 to 120. Fig. 2 (b) shows the standard deviation of the sample as the sample size increases from 3 to 120. Fig. 2 (c) shows the standard sampling error as a function of increasing sample size from 3 to 120, which corresponds to the expected error bar from a given sample size, n . Fig. 2 (d) shows the sampling distribution of the mean with an n of 20 when 1,0 0 0,0 0 0 sample sets were taken. Finally, Fig. 2 (e) shows the Q-Q plot of the distribution which is a visual interpretation of the samples' normality. The numerical values used to create these plots can be found in the Supplementary Material that accompanies this article.
The statistical analysis for the faulted T AlNbTaTiV BCC cell is shown in Fig. 3 . 120 atomic permutations ground state energy calculations were performed on this cell. Fig. 3 (a) shows the mean cell energy, E 0 , as the sample size, n, increases from 3 to 120. Fig. 3 (b) shows the standard deviation of the sample as the sample size increases from 3 to 120. Fig. 3 (c) shows the standard sampling error as a function of increasing sample size from 3 to 120, which corresponds to the expected error bar from a given sample size, n . Fig. 3 (d) shows the sampling distribution of the mean with an n of 20 when 1,0 0 0,0 0 0 sample sets were taken. Finally, Fig. 3 (e) shows the Q-Q   The statistical analysis for the faulted T S 1 AlNbTaTiV BCC cell is given in Fig. 4 below and in Fig. 5 of the research article accompanying this work [1] . 120 atomic permutations of ground state energy calculations were performed on this cell. In [1] , Fig. 5 gives (a) the sample mean, (b) the standard sampling error, and (c) the sampling distribution of the mean. Below, Fig. 4 gives (a) shows the standard deviation of the sample as the sample size increases from 3 to 120, and (b) the Q-Q plot of the distribution which is a visual interpretation of the samples' normality.
The numerical values used to create these plots can be found in the Supplementary Material that accompanies this article.
The statistical analysis for the faulted T S 2 AlNbTaTiV BCC cell is shown in Fig. 5 . 30 atomic permutations ground state energy calculations were performed on this cell and used to predict the properties of a global population set that would include 120 atomic permutations. Fig. 5 (a) shows the mean cell energy, E 0 , as the sample size, n, increases from 3 to 30. Fig. 5 (b) shows the standard deviation of the sample as the sample size increases from 3 to 30. Fig. 5 (c) shows the standard sampling error as a function of increasing sample size from 3 to 30, which corresponds to the expected error bar from a given sample size, n . Fig. 5 (d) shows the partial sampling distribution of the mean with an n of 20 taken with replacement when 1,0 0 0,0 0 0 sample sets were taken. Finally, Fig. 5 (e) shows the Q-Q plot of the partial distribution which is a visual interpretation of the samples' normality. The numerical values used to create these plots can be found in the Supplementary Material that accompanies this article.
The statistical analysis for the faulted T S 3 AlNbTaTiV BCC cell is shown in Fig. 6 . 20 atomic permutations ground state energy calculations were performed on this cell and used to predict the properties of a global population set that would include 120 atomic permutations. Fig. 6 (a) shows the mean cell energy, E 0 , as the sample size, n, increases from 3 to 20. Fig. 6 (b) shows the standard deviation of the sample as the sample size increases from 3 to 20. Fig. 6 (c) shows the standard sampling error as a function of increasing sample size from 3 to 20, which corresponds to the expected error bar from a given sample size, n . Fig. 6 (d) shows the partial sampling distribution of the mean with an n of 20 taken with replacement when 1,0 0 0,0 0 0 sample sets were taken. Finally, Fig. 6 (e) shows the Q-Q plot of the partial distribution which is a visual interpretation of the samples' normality. The numerical values used to create these plots can be found in the Supplementary Material that accompanies this article.

Skewness and kurtosis
To present a quantitative check that the sample populations are normal distribution and to prove that n values selected from Table 2 is sufficiently large for the images presented in Fig. 1 -Fig. 6 , it is common is to look at the central moments of the distribution and compare them to those of a normal distribution. These moments are a measure of how the data are distributed around the mean. Table 3 shows these parameters for the full sampling distributions of the mean with n = 20 for the defect free, NT S, T , and T S 1 structures. T S 2 and T S 3 are not included because full populations sets were not calculated. As can be seen in the table, the skewness and kurtosis values are all close to the values expected of a normal distribution, 0 and 3, respectively. Values within 0.1-0.2 of the ideal value are common for most normal distributions [2] .

Density functional theory calculations
The Vienna Ab initio Simulation Package (VASP) was used for all DFT calculations [3,4] . The projector augmented wave, PAW, pseudo-potentials with the generalized gradient approximation exchange correlation functional, GGA, as implemented by Perdew, Burke, and Ernzerhof were used for all calculations [5][6][7][8] . All cells were converged to at least 0.1 meV/atom during relaxation. The cell shape and volume were constrained to prevent de-shearing of the faults during relaxation. Since the stacking fault structure differs from a defect-free structure by a shear displacement, the cell will de-shear during a full relaxation such as to produce a defect free structure and minimize the energy. In some cases, full relaxation was used, and is indicated in the text. A gamma centered k-mesh and reciprocal space projectors were used for all calculations [9] . Each k-mesh was at least 6x6x6. The fast Fourier transform mesh was set to contain all reciprocal vectors up to twice the largest basis vector. This mesh prevents wrap around errors during the fast Fourier transform [9] . A plane wave cut-off energy of 380 eV, 1.3 times the largest ENMAX value, was used for all calculations. The internal algorithm was set to the blocked Davidson iterative scheme [9] . The partial occupancy of orbitals was smeared with the Methfessel-Paxton first order method for relaxations and the tetrahedron method with Blöchl corrections for accurate total energy calculations [9] .

Special quasi-random structures for stacking faults
Special quasi-random structures (SQS) are required for HEA calculations in DFT. These structures approximate a random solid solution with a given number of elements and their concentrations for a specific structural cell. An SQS cell must be generated for each structure, such as the different stacking fault cells listed in Table 1 .

Creation of special quasi-random structures
The Monte Carlo SQS (mcsqs) generation code from the Alloy Theoretic Automated Toolkit (ATAT) is used for all SQS cells generated in the present work [10] . mcsqs uses Monte Carlo simulations to search for a SQS that best represents a perfectly random structure [10] . The code randomly labels the atoms in a given cell structure and calculates the coordination parameters. It then compares the results to the theoretical coordination parameters for perfectly random structure [10] . Based on this comparison, the code identifies the best SQS structure found during the simulation using an objective function [10] .
The input cell structure for mcsqs is generally a unit cell that defines the structure parameters of the overall crystal [10] . The stacking fault cells need a unit cell that represents the stacking fault and the surrounding matrix. The simplest method of producing a unit cell for a stacking fault is to create a one atom wide stacking cell along the plane normal. Each atom plus the periodic boundaries fully represent a complete plane, the stacking plane in this case. Their relative positions along the cell axis represent the stacking sequence. For details on the faulted cells, see Fig. 1 and Fig. 2 in the research article that accompanies this data article [1] . During the SQS generation process, this unit cell is expanded parallel to the stacking plane to form the full supercell. Fig. 7 shows a standard set of input and output cells from an SQS generation. Fig. 7 (a) shows a unit cell defining the structure and location of atoms. Fig. 7 (b) shows a full SQS cell with increased size and explicitly labeled atomic species.
The final inputs for the mcsqs code are the correlation parameters to consider [10] . These parameters contain both the order and distance. The order refers to the number of atoms that are considered at a time. For instance, a second order correlation indicates how likely two atoms of specific species are to be near each other in the structure. The distance refers to the number of nearest neighbor shells that correlations are considered across. These settings were specified to consider at least second and third order correlations over at least three nearest neighbor shells for all SQS generations.

Averaging of special quasirandom structure calculations
There is an error associated with using an SQS to represent a perfectly random solid solution [10] . There needs to be a metric for the quality of an SQS cell defined by how well it approximates a random structure. Such a metric can be developed by looking at the method used to generate an SQS. During generation, the cell is created to closely approximate the correlation parameters of a random alloy [10] . Fig. 7 (b) shows an example SQS cell with the atomic types marked by colors. A perfect SQS would be invariant with respect to atomic ordering. In other words, all atoms labeled "green" could be relabeled "purple" and all "purple" atoms labeled "green" without affecting the overall structure or its calculated ground-state energy. The quality of an SQS cell can be measured by calculating the energy associated with every permutation of atomic ordering and calculating the standard deviation of the energies [11] . This method provides a metric for how far the calculated ground state energy of the SQS can be expected to be from average.
Calculations averaged from different permutations of atomic assignments provide two useful insights into the system: an increase in accuracy to the calculated energy, and an insight into the sensitivity of the structure to variations in assignment. The first result has the benefit of increasing the fidelity of the calculations as a whole. The usefulness of the second result is more nuanced. For pure structure calculations, the standard deviation, or relative sensitivity, is simply a measure of the SQS quality [11] . For structures with a defect, this sensitivity is a superposition of both the SQS quality and the sensitivity of the defect to its surrounding atomic species [11] . This superposition makes it impossible to determine the percentage of the standard deviation that is attributed to each effect. Therefore, even if the SQS itself is perfect there will still be variations in the calculated energy due to the presence of a defect [11] .
The calculations that required SQS cells were run with all possible permutations of atomic assignment. For an HEA of n equiatomic elements, there are n ! possible permutations. A five element HEA has n ! = 120 permutations. For each structure, n ! identical calculations were set up and each was assigned a unique cell from the list of possible atomic permutations. Afterwards, the n ! energy values were collected, and the arithmetic mean and standard deviation were calculated from the set. Finally, the average energy value was used for further calculations, such as calculating SFE, and the standard deviation was stored to indicate the variation and approximate accuracy of the calculation.

Inferential statistics
The purpose of inferential statistics is to make inferences about a complete data set by examining a partial data set [2] . The primary usage of inferential statistics is an estimation of the variance of a sample parameter when compared to the known global parameter.
Inferential statistics are performed by sampling from a global population of N independent values, x i , with i being an index. In the present work, x i are the ground state energy values represented in Table 2 for each of the different faulted structures. Each sample contains an independent and random selection of n values from the global population. m sample sets are then taken and used to produce a relative frequency distribution for a statistical parameter, such as mean or standard deviation. In this sampling distribution, each sample is used to produce one value. For example, a sampling distribution of the means is a distribution produced by creating the frequency diagram from the m sample means. This sampling procedure produces an effect on the data given by the central limit theorem [2] . The central limit theorem states that as the sample size increases, the sampling distribution of the mean approaches a normal distribution. The sampling distribution has the same mean as the original population and a variance of σ 2 n , where σ 2 is the population variance and n is the sample size [2] . Effectively, as n increases, the frequency distribution of the sample means converges towards the population mean and takes the shape of a normal distribution with a converging variance. The properties of a normal distribution can be attributed to the sampling distribution regardless of the shape of the global distribution [2] . There is an important requirement for this statistical analysis. The value of n must be sufficiently large. The main reason is that the average sample must reasonably represent the original population. A single value can vary drastically from the mean and contains no information on population variance. However, as more values are added, the sample mean and variance become estimators for the for the total population mean and variance. As n becomes large enough, the central limit theorem ensures the sampling distribution becomes normal. Therefore, a normally distributed sampling distribution can be used to prove that the value of n was sufficient.
In this data article, the sampling distribution of the mean will be used to estimate the global mean and the expected variance of the sample means from the global mean. For SFE calculations in HEA systems, the desired cell ground state energy value is determined to be the mean from all possible atomic permutations. The important parameters of the global population are the population size, the mean, and the variance. The population mean, μ p , is calculated in Eq. 1 : (1) where N and x i are as described above. The population variance, σ 2 p , is calculated as shown in Eq. 2 [2] : A sample of size n permutations is selected from the calculated cell energies, x i , given in Note there is a n − 1 in the denominator instead of the familiar n [2] . The n − 1 in the denominator makes it an unbiased estimator due to the degrees of freedom [2] . Substituting s 2 for σ 2 p in Eq. 8 yields the following relation: for large n [2] .
The variance of the sample means, σ 2 m , is important, as it is the metric for how far from μ p the μ s , of n values is expected to vary [2] . Essentially, σ 2 m is a predicted truncation error. This truncation error is the difference between the mean of the total population and the mean of the sample. A more intuitive form of this error is the standard sampling error which is the square root of the variance and is given as: with σ m being the standard sampling error [2] . This is also called the standard error of the mean and is more intuitive as it has the same units as the x values [2] . Since the sampling distribution is a normal distribution the normal sigma probabilities apply [2] . As such, 95% of all samples will lie within bounds of twice the SSE from the mean, or 2-sigma [2] .

Sampling without replacement
In Section 2.3 , sampling with replacement from a global population was discussed. This means that each value selected was truly random and independent of previous selections. In DFT calculations, sampling from Table 2 without replacement is desired. Sampling without replacement means that each value is randomly selected and then withheld from the population for the next selection and so forth. Sampling without replacement, from a finite population, leads to a negative correlation between the values already selected and the next value to be selected [2] . If an especially large value is selected and removed from the available dataset, the next value will be more likely to be smaller and a negative correlation results. This correlation has an effect on the variance of the sampling distribution and a correction factor is required.
The new equation for calculating the σ 2 m value, when σ 2 p is known, is given as: The variance of the sampling distribution when σ 2 p is not known is written as: These equations are for sampling without replacement [2] . All other variables remain the same. This method of sampling and the resulting equation has an immediate benefit. The value of σ 2 m goes to zero as n goes to N when sampling without replacement. Another benefit of sampling without replacement is a consistent number of calculations. These statistics will be used to calculate the estimated statistical parameters from a sample of the total calculations. For a sampling procedure with sample size n, only n atomic permutations will be randomly selected, ensuring a consistent number of calculations.

Skewness and kurtosis
Skewness and kurtosis are are a measure of how the data from a particular set are distributed around the mean. The third and fourth central moments are commonly called skewness and kurtosis, respectively. The skewness represents the symmetric nature of the tails of the distribution and is zero for a perfectly normal distribution. Eq. 14 shows how skewness is calculated: where the subscript z represents the sample population ( z = s ) or the global population ( z = p). N is the total number of data points being considered in the calculation.
Kurtosis is defined as the fourth central moment of the population distributed around the mean. The kurtosis represents the relative weight of the tails versus the center of the distribution and equals three for a normal distribution. Eq. 15 shows how kurtosis was calculated in the present work: where the subscript z represents the sample population ( z = s ) or the global population ( z = p). N is the total number of data points being considered in the calculation.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.