How big is Big Data?

Big data has ushered in a new wave of predictive power using machine learning models. In this work, we assess what {\it big} means in the context of typical materials-science machine-learning problems. This concerns not only data volume, but also data quality and veracity as much as infrastructure issues. With selected examples, we ask (i) how models generalize to similar datasets, (ii) how high-quality datasets can be gathered from heterogenous sources, (iii) how the feature set and complexity of a model can affect expressivity, and (iv) what infrastructure requirements are needed to create larger datasets and train models on them. In sum, we find that big data present unique challenges along very different aspects that should serve to motivate further work.


Introduction
Big Data is a term that not only governs social media and online stores, but has entered most modern research fields.As such, it also concerns materials science.While our research has always been largely based on data, through the analysis and interpretation of measured and computed results, in recent times, more aspects have come into play.For instance, there is the practical reason that more and more funding bodies require data to be kept for a certain period of time.On the scientific side, the idea of sharing data is enjoying popularity, thus avoiding that the same investigation is done multiple times.Most important is the use of data in machine learning (ML), or more generally, artificial intelligence (AI).
Artificial Intelligence is arguably the most rapidly emerging topic in various research domains, with increasing impact in materials science as well.The success of AI approaches, however, strongly depends on the amount and quality of the underlying training data.In this context, the first obvious question is how large a dataset needs to be in order to provide sufficient information for the research problem to be solved.Are millions of scientific results, as available in international databases, a gold mine?Or do these collections still not provide enough significant data for a specific question?Or can we even learn from small datasets?Are large, multipurpose databases sufficient for training models, or to what extent are dedicated datasets needed for this task?Can errors be controlled when using data from different sources?Can we learn from experimental and theoretical data together, or are even different theoretical or experimental data by themselves too heterogeneous to produce reasonable predictions?
What is useful for a particular learning task, may depend heavily on the research question, the quantity to be learned, and the methodology employed.This concerns data quality, data interoperability -especially when data from different sources are brought together -data veracity, and data volume (the last two are part of the "Four V's of Big Data" 1 ).In this work, we ask the question: "What does Big mean in the realm of materials science data?".There are many issues related to this, demonstrated by a few examples: First, some methods are more data hungry than others.While, for instance, symbolic regressors may lead to reasonable results already for a small data set 2 , neural networks (NNs) may require many more data points 3 for training to be stable.Second, even with what can be considered big datasets, training models that generalize to similar datasets is not trivial.Third, data quality may have a significant impact.While high-quality data may give one a clear picture from the very beginning, many more noisy data may need to be accumulated for obtaining robust results, to avoid wrong conclusions and/or allow for physical interpretation.Finally, what does Big Data mean for data infrastructure?What are the requirements for processing and storing big datasets; how compute-intensive is training ML models on them?
In this work, we address such questions and evaluate the performance of different AI models and tools in terms of data volume, diversity, and quality and provide a quantitative analysis.As the overall topic is very wide and diverse, we aim to draw the reader's attention to it and initiate discussions rather than to provide detailed solutions to all of the related issues.The chosen examples should serve this purpose.In the first one, we ask whether a model obtained from a message-passing neural network (MPNN) can be transferred to a similar dataset.Second, similarity metrics and sorting are applied to understand data quality in terms of convergence behavior of density-functional theory (DFT) data.Third, the effect of an expanded feature set on the expressivity of cluster expansion is demonstrated.Fourth, the complexity of different model classes is explored in terms of their performance.Finally, infrastructure requirements for creating large materials datasets and for training models with many parameters are quantitatively analyzed.

Transferability of models
Materials Project (MP) 4 and AFLOW 5 are popular data collections from computational materials science.Both are also often used for ML purposes.Most of the calculations are carried out by DFT with the same code (VASP, projector augmented wave method) and exchange-correlation functionals (mostly PBE).Since both databases also contain a significant amount of materials which are common to both, i.e., share the same composition and spacegroup, we ask the question whether a model trained on one dataset would perform well when making inferences on the other dataset.We choose the formation energy as our target, since it is a rather well-behaved property, only relying on the total energies of the compounds and their constituents.
The AFLOW dataset is chosen as described in Ref. 6 and filtered to use only calculations performed with the PBE functional, which contain both bandgap and formation energy.The MP dataset is retrieved from Ref. 7 (dataset snapshot denoted as MP-crystals-2018.6.1).After filtering, the two datasets comprise 61,898 and 60,289 materials, respectively.We proceed by assigning unique identifiers to the structures in the AFLOW and MP datasets.They consist of the chemical formula concatenated with the space group of the crystal, similar to Ref. 8, to define which structures are common to both databases and which are unique to either of them.This results in 50,652 unique compositionspacegroup pairs in the AFLOW dataset and 54,438 in the MP dataset, of which 8,591 pairs are common to both datasets.
To avoid leakage of information from the training set to the test set, we ensure that the same common structures are present in the training-test splits of both datasets.This means that a model trained on AFLOW data will not be used for inference on MP test data on the same structure it was trained on, and vice versa.For instance, all structures with composition Mg 2 F 4 and spacegroup 136 get the label Mg2F4_136.If there are several of these structures, e.g., with different lattice parameters, all of them get assigned to one split, i.e., training, validation, or test.We train MPNNs with edge updates, as described in Ref. 9 on the two datasets.Previously, we have shown that the hyperparameters that define the architecture of the MPNN transfer well to different prediction tasks 6 .Following this, we perform no further hyperparameter optimization.
Figure 1 shows the performance of the two models on the two datasets.While the predictions are very good when training and test data are from the same database -mean absolute errors (MAE) of 30 meV and 23 meV for AFLOW and MP, respectively, the errors in the prediction of the other database are an order of magnitude larger.Notably, there is also a clear trend of too large (small) predictions on the upper right (lower left) of the matrix.This points to a systematic offset between the predicted energies of the two datasets.We can partly trace back the discrepancies to the relatively small overlap of the spacegroupcomposition pairs in the two databases, which is about 16%.Further analysis, indeed, reveals that MP data contain many more materials with lower formation energies than the AFLOW data as evident from the top panel of Fig. 2. Differences may arise, however, also from computational details, such as Brillouin-zone (BZ) sampling, basis-set cutoff, convergence criteria, etc.The distribution of the shared structures (bottom panel of Fig. 2) indicates that the latter have overall a smaller impact, i.e., do not lead to a systematic offset.However, we see that there are many more entries of the same structures, e.g., with different lattice parameters, in the AFLOW dataset.For example, Ti 2 O 4 (spacegroup 136) and BaTiO 3 (spacegroup 221) occur 127 and 115 times, respectively, while only one of each compound is found in the MP data.
Even for what we might consider to be big datasets in materials science, and using a rather simple ML target, this example shows us that the models we train are only valid for the data they were trained on and struggle to generalize.In our specific example, the AFLOW and MP datasets turn out to sample the underlying material space differently, since the MP materials appear biased towards lower formation energies.We conclude that these two databases are not "big" enough in the sense that they are not diverse enough to be able to make predictions across a wider set of diversity.Training on the combined dataset, the predictions are less biased i.e., the errors being more symmetric with respect to the diagonal.However, the MAE is higher overall as compared to the individual models.This indicates that also differences in computational settings may play a role and may need to be considered as input features.
3 Revealing data quality by similarity measures Veracity, another of the "Four V's of Big Data" poses a challenge to ML applications by introducing a fundamental level of noise that cannot be overcome with more complex models.An example from computational materials science is sets of calculations with different levels of accuracy 10 determined by the chosen approximation, such as the exchange-correlation (xc) functional of DFT, or different levels of numerical precision, determined by input parameters, such as basis-set size or k grid.Related to this, is a practical problem: If multiple calculations from different sources -with smaller or larger deviations in the results for a particular physical property-are available, which value should be trusted?This challenge can be met by either applying corrections to the data to bring them onto the same level of accuracy/precision 11 or by filtering them for consistent subsets.The former option requires additional ML models 2 , trained on dedicated, high-precision datasets.However, generating the required data for training these models is costly, and making predictions for materials with large unit cells may require that the models are trained on such systems as well.Filtering the data by their numerical precision, on the other hand, can be applied to existing datasets, but one may not find enough data for a particular application because the number of calculations with exactly the same numerical settings is typically small.The number of calculations that can be used together can be increased, however, if we can understand -and quantify-the level of convergence of data with the computational parameters.Understanding and visualizing the convergence behavior of computed properties can be achieved by introducing similarity metrics, like recently demonstrated with the example of fingerprints of the electronic density of states (DOS) 12,13 .Given a large enough set of calculations for a single material, the relationships between the parameters used in calculations and the similarities of the results can be shown.
To illustrate this, we make use of data from the NOMAD data collection 14,15 .These calculations were carried out with the DFT code FHI-aims 16 as part of a systematic study of the impact of computational parameters on DFT results 17 .For our analysis, we use the calculations of hexagonal boron nitride (h-BN), specifically, the DOS of ground-state calculations obtained with different basis-set sizes and k-point samplings at the experimentally determined equilibrium volume.To compare the results of these calculations, we compute the spectral fingerprints 18 of the DOS in the energy range between −10 and 10 eV around the Fermi energy and calculate the corresponding similarity matrix.The rows and columns are sorted by increasing numerical settings, separately for the two xc functionals LDA (matrix indices ≤ 71) and PBE (indices > 71).This matrix is depicted in Fig. 3.The bottom panel shows the number of k-points (N kpt ) and basis functions (N b ).For each k-point mesh (plateau in the k-mesh), the basis set is increased in the same way.Additionally, the data are sorted by a set of numerical settings, called "light", "tight", and "really tight" in FHI-aims.Finally, consecutive calculations with otherwise identical settings vary by the relativistic approximation employed for core electrons, i.e., "ZORA", and "atomic ZORA" 16 .Sorting the matrix in this way, we see patterns appearing in the figure, which we discuss below.
Focusing first on the convergence of the LDA data (indices i ≤ 71), we observe a clear block structure, where the calculations of the first set, i.e., those with the lowest N kpt (index i ≤ 23) are most dissimilar to all others, and also to each other.However, they are pairwise similar, indicating that the relativistic approximation plays a minor role in the convergence of the DOS.Sets of calculations with medium (24 ≤ i ≤ 47) and high (48 ≤ i ≤ 71) numbers of k-points generally show higher similarity among themselves.Noticeably, calculations with a low (medium and high) number of basis functions N b are similar among different BZ samplings, but less similar to calculations with higher (lower) N b .For PBE calculations, the convergence behavior is different: Already calculations with low N kpt reach medium similarity to calculations with high N kpt .Contrary to the LDA data, PBE calculations with low and medium N kpt don't reach high similar- ity to PBE calculations with highest settings, indicating that PBE calculations require a higher k-point density for this material.Further comparing the similarity of LDA and PBE convergence, i.e., the off-diagonal blocks of the matrix (indices i ≥ 71 on the x-axis and i ≤ 72 on the y-axis), we find a pattern corresponding to calculations with the same number of basis functions.It shows that calculations with different xc functionals behave more similar to each other if the same number of basis functions are used.
Our example illustrates the veracity of materials data arising from the large combinatorial space of approximations and numerical parameters employed in DFT codes.While LDA and PBE, both being semi-local functionals, are generally expected to give similar results in terms of the electronic structure, we show here that the convergence behavior with the number of k-points and basis functions, is surprisingly different.The visualization of DOS similarity matrices allows one to understand and quantify differences between DFT data computed with different computational settings.It also enables the creation of rules to group data in order to gather "homogeneous" subgroups within a dataset.

Striving for expressitivity in cluster expansion
In this example, we investigate the role of the feature set in the modeling of the bandgap of a family of perovskite materials by means of the cluster-expansion (CE) technique.CE is a multi-scale method based on a lattice Hamiltonian 19 , which is used to model and predict properties of alloys at different compositions and temperatures.In CE, an alloy configuration is represented by a vector s s s of occupation variables s i .For a binary alloy, for example, the occupation variables may take the values s i = −1 or 1, indicating the presence of one or the other species at crystal site i.By defining the cluster functions Γ c (s s s) = ∏ i∈c s i , where c is a subset of crystal sites or cluster, it can be demonstrated 19 that any configuration-dependent property, can be formally expanded in terms of the symmetry-averaged cluster functions X c (s s s) = ⟨Γ c (s s s)⟩, named cluster correlations, where ⟨•⟩ represents the symmetry average over all clusters symmetrically equivalent to c.For modeling the bandgap, this reads Here, the sum runs over a set containing p symmetrically inequivalent clusters.In what follows, we call this family of models "Standard CE".Cluster expansion can be viewed as a ML problem 20,21 , where the cluster correlations X c are input features describing the structure of the crystal.This view can be leveraged by considering basis expansions based on these features, leading to nonlinear CE models 21 defined as with the (possibly nonlinear) functions h m (X X X) : R p → R and X X X a p-vector with components X c .The sum runs over a set containing q functions h m .For this example, we choose nonlinear polynomial features: Every function h m (X X X) is a monomial up to degree d, e.g., h j (X X X) = X 2 1 X 2 .The total number of features in this case is given by ∑ In this section, we want to explore the expressiveness of the feature spaces in standard and nonlinear CE.Assuming that the intrinsic error in the data to be modeled is small, does our feature space allow us to approximate the ground truth?We use a dataset of 235 oxide perovskites with composition BaSnO 3−x , with x < 1 being the number of oxygen vacancies per formula unit.BaSnO 3 has significant potential for use in photocatalytic and optoelectronic applications.Its electronic structure demonstrates a complex dependence on the concentration and configuration of oxygen vacancies.The bandgap varies strongly even among structures with identical concentration and comparable energies of formation.Thus, modeling the bandgap is challenging for linear regression models.For this learning task, the clusters account for specific structural patterns of O vacancies.
As shown in Fig. 4, standard CE models obtained by orthogonal matching pursuit (OMP) from a pool of p = 27 clusters, yield fit root-mean-squared errors (RMSEs) larger than 250 meV (blue line, top-left panel).The best fit is obtained, as expected, when using all 27 features (marked by a vertical dashed line).For the pool of p = 27 clusters, we explore polynomials of degree d = 2 and d = 3, containing 405 and 4059 features, respectively, obtaining models with up to 200 features using OMP.These are shown by the orange and red lines in the top-left panel of Fig. 4. In addition to the expected monotonous reduction of the error, two notable findings are obtained: first, with 27 features, the nonlinear feature space yields better models than the standard CE with the same number of 27 clusters.This is remarkable because the nonlinear features of the nonlinear CE models are derived solely from the 27 cluster correlations used in the standard CE model.Related to this point, we also observe that for the same number of features, degree 3 always gives smaller error than degree 2. Second, both nonlinear CE models reach a point where no further improvement is obtained upon increasing the number of features.This is observed as a plateau above ∼ 150 features for d = 3 and above ∼ 180 features for d = 2.The plateau for d = 2 is reached at a higher level of the RMSE than for d = 3.Thus, under the assumption that the intrinsic error is small, increasing the complexity of the feature space allows the CE to better approach the underlying ground truth.In practice, increasing the complexity of the feature space allows CE to obtain the same accuracy with sparser models as is obtainable with less complex feature spaces.This is important for model selection methods that favor sparsity, such as LASSO.The top-right panel of Fig. 4 shows the quality of the predictions for the whole dataset using the best possible model for standard CE (blue dots) and the best possible model for d = 3 nonlinear CE, based on the same 27 cluster correlations.The improvement of the nonlinear modeling is remarkable, especially for the difficult case of metals (the set of materials with E g = 0).Now, the question arises whether the beneficial effect of nonlinear features could be also obtained by adding more cluster correlations, that is, accounting for more structural patterns of vacancies.For this, we have created a pool of 176 clusters.The result of standard CE is shown in the lower left panel of Fig. 4 (blue line).Like before, we see a plateau setting in at about 110 clusters with a RMSE of ∼ 125 meV, and no further improvement can be achieved upon adding more clusters.The vertical dashed line indicates the model with the largest possible number of clusters in this pool.Again, the second-order nonlinear CE based on this same pool allows for obtaining better fits at all numbers of features.A comparison of the quality of the fits for standard and nonlinear CE models with 150 clusters, respectively, is shown in the lower-right panel of the figure .Similarly to what was obtained for the small pool of 27 clusters, the nonlinear CE yields significantly better predictions.
From this example, we see the feature set can have a strong impact on the expressiveness of the CE model.Including nonlinear terms enables the CE to better fit the underlying ground truth even with a small data set size.This benefit appears irrespective of the number of cluster correlations.This improvement motivates us to to try to better define model complexity which is done in the following section.

Model complexity
Can knowing the complexity of the model tell us something about model performance for a given dataset?The term model complexity is not well defined in the literature.It describes the capacity of a model to learn an underlying probability distribution.We start this section by defining what complexity means for several model classes.We then evaluate how complexity determines model performance for two example datasets.Finally, we discuss the search for a model-independent scalar that represents model complexity.

Complexity for neural networks, random forest, and SISSO models
Here, we discuss how complexity can be quantified for three different types of models, i.e., neural networks (NN), the sure independence screening sparsifying operator (SSISO) 22 , and random forests (RF).For NNs, the total number of trainable parameters is often used as a measure for the model's complexity.That said, some of the NN parameters have very different impact.The output of node j at layer k + 1 in a feed-forward network is: This node receives an input x from layer k, which is multiplied by a weight parameter W i j and summed with a bias parameter b j before being fed into a nonlinear activation function, σ .Thereby the bias parameter is quite different from the weight parameter.So, a better description of complexity for NNs is a two-dimensional vector, containing the number of weights and biases.The benefit of this definition is that it is simple to compute.The drawback is that it is not a unique definition and does not differentiate between parameters deeper in the network, which are more expressive because they transform nonlinear inputs with nonlinear functions 23 .
The SISSO model first combines primary features into generated features using a set of mathematical operations (e.g., +, −, exp(), √ , . .., and combinations thereof) 22,24 .The number of features that are selected by the algorithm is called the dimension of the model.We can interpret this in terms of a (symbolic) tree that describes a generated feature, where each split in the tree is an operation and the leaves are the primary features 24 .The larger the depth of the tree, the more operations there are in the generated-features space.The maximum tree depth is called the rung of the model.In essence, the number of selected generated features (model dimension) defines how many coefficients the model has.One can also include a single constant bias term in SISSO.With a larger rung value, the possible secondary features explode combinatorially.Therefore, we can define a descriptor for the complexity in SISSO with a two-dimensional vector, comprising the rung and the dimension of the model (counting the bias term if present as an extra dimension).
How can we define complexity for RFs? RFs are piecewise constant functions.For a regressor, each leaf node in each decision tree learns a constant bias term.For each split in each decision tree, a value of the variable to be split on must be learned.Since the trees are binary, the number of leaf nodes is one greater than the number of splits.Therefore, the number of leaf nodes is an integer that tells us the number of trainable parameters in the model and describes the complexity of the model.
From each of these three model types, we have presented a measure of model complexity.The model complexity for each model type is described with a different number/vector.For NNs, it is a two-dimensional vector of the total number of weights and biases; for SISSO, it is a two-dimensional vector reporting rung and dimension of the model, while for RFs, it is the number of leafs in the model.The definitions are not unique but offer an idea of how well the model can learn to approximate a wide variety of functions.

Performance as a function of complexity
Can model performance be predicted by the complexity of the model?Two examples are examined here using the models from the previous section.In the first example, an NN and a threedimensional SISSO model are fit to the DFT data used in Ref. 17 using 80% of the data for training and 20% for testing.The SISSO model outperforms the best performing NN architecture with a test root-mean-squared-logarithmic error of 0.231 compared to 1.367, despite having less parameters.The NN complexity can be scaled by changing the number of nodes and layers in the model.An NN with the same number of linear, nonlinear, and constant parameters as the SISSO model still does an order of magnitude worse than the SISSO model.This may not comes as a surprise, since NNs are considered to be data hungry 25 .From this example, however, we can deduce that the number of constant, linear, and nonlinear parameters in the model is not enough information to predict model performance.Rather, the model class is the determining factor here.
In the second example, an NN and an RF are trained to predict bandgaps on the AFLOW dataset of Ref. 6. Features describing the elemental composition of the materials, as described in Ref. 26, are fed into both models.For the same parameter count, one can argue that an NN with the ReLU activation function is more complex than the RF, since the RF model is piecewise constant and the NN is piecewise linear.However, both NNs and RFs are universal approximators, meaning that with enough splits/nodes, they can approximate any probability distribution.On the AFLOW bandagap dataset, after cross-validation, the RF outperforms the NN, with a test MAE of 469 meV and 515 meV respectively.This trend holds when we restrict the total number of parameters of both models to be similar.Once again, model class, not model complexity, is the decisive factor of model performance.
We can conclude from both examples that the number of trainable parameters alone tells us very little about model performance.Rather, we find that for a given dataset, the performance depends on the model class.

Searching for a definition of model complexity
Can model complexity be defined with a scalar?In the previous subsections we saw that the total parameter count is a poor metric for NNs.A similar argument can be made for polynomial regressors, where some parameters allow for linear terms to be used by the model and others allow for nonlinear terms.One could therefore be motivated to count the coefficients to linear and nonlinear terms and place them on uneven footing, since the nonlinear coefficients give the model a different type of flexibility than the linear coefficients.The question though is, how to weight these terms to come up with a more general metric for model complexity.The simplest combination would be a weighted sum: where C is the complexity of the model h, A is the number of additive parameters, L is the number of multiplicative coefficients to linear terms in the features, and N is the number of coefficients of nonlinear combinations.If we set the coefficients of the complexity metric, a, b, and c to unity, we recover the total trainable parameter count.Assigning values to a, b, and c is however not an easy task.The proposed complexity metric will mean different things for different models.As discussed, RFs have additive constant parameters that need to be learned, but no coefficients to linear or nonlinear combinations of the features.So with 10,000 additive constant parameters the RF can learn to approximate many different distributions with a low mean-squared error.A generalized linear model, however, is only able to approximate constant functions with constant additive parameters.Therefore, with the same value of complexity, as defined in Equation 4, the RF has a much higher capacity to learn a wide variety of distributions than the generalized linear model.
These examples demonstrate that the total parameter count is not sufficiently descriptive of the model's ability to learn to approximate a wide variety of functions for many model classes.We offer an alternative metric for model complexity as a weighted sum of constant, linear, and nonlinear terms.We conclude however, that the meaning of complexity is model dependent.This motivates us to therefore only talk about complexity within a single model class.More work is needed to explore these topics more carefully from an information theoretic point of view.
6 Infrastructure requirements

File and storage requirements for dataset calculation
In order to create large curated datasets of DFT calculations, sophisticated high-throughput workflows have become an indispensable tool.Here, we would like to provide an estimate for what this means in terms of infrastructure requirements.In Fig. 5, an example of such a workflow is depicted, leaving out workflow managers 27 , validity constraints 28 , and alike for the sake of simplicity.Here, an Atomic Simulation Environment (ASE) 29 database is employed to store input crystal structures that the user desires to simulate with exciting, an all-electron code based on the linearized augmented planewave method 30 .In this specific workflow, each structure is relaxed, i.e., the geometry is optimized by running multiple single-point calculations, via the open-source Python API excitingtools 31 .The bandstructure is calculated afterwards for the equilibrium geometry.This rather simple workflow creates 41 output files and roughly 30 MB of data for each crystal structure.In addition, for each structure, the calculation must be converged with respect to BZ sampling and basis-set size.Typical usage would see at least three different k-point grids and three different basis-set sizes.For 30,000 crystal structures, which is not too big a number for ML tasks, (see, e.g., section 2), and nine different settings per structure, we perform 270,000 relaxations and bandstructure calculations.This amounts to 11,070,000 files, or in terms of storage, roughly 8.1 TB of data.
Such workflows are typically executed on supercomputers, which impose limits on users concerning disk space and numbers of files.Storage limits are, e.g., on the order of a few 100 GB in backed-up directories, or a few TB in scratch directories.There are also limits in terms of files that can be stored (often at the most 1 million).Both limits often mean that the users need to compress their data periodically and transfer their files to a storage system outside of the supercomputer network.In this example, the data are transferred to a NOMAD Oasis 32,33 .NO-MAD Oasis is the same software as in the public NOMAD data infrastructure 34 , including among other tools, parsing, normalizing, electronic notebooks, and the NOMAD AI toolkit 35 .The software can be installed locally by any research group and can be configured to their needs.Calculated data, as described above, are stored into an interoparable format 36,37 .
In summary, calculating big data presents its own unique infrastructure challenges.The workflow example shown here quantitatively demonstrates that performing high-throughput DFT calculations requires sophisticated hardware/software solutions to navigate HPC file and storage restrictions.The NOMAD Oasis is one such solution to this problem, which also allows for making data ready for ML purposes.
Fig. 5 Workflow for high-throughput geometry optimization, followed by a bandstructure calculation.Crystal structures are pulled from a NOMAD Oasis instance and stored in an ASE database (DB).The structures are read by the Python API excitingtools, and a geometry optimization is carried out, consisting of multiple single-point ground-state calculations with exciting.For the resulting relaxed geometry, a bandstructure calculation is performed.All output files are uploaded to a NOMAD Oasis instance.

Computing requirements for training large models
State of the art in ML is to employ meta-learning approaches, or neural architecture searches (NAS) to find the best NN model architecture 38 .Typically 500 to 10,000 architectures are selected by an algorithm (e.g., random search or reinforcement learning) and trained, and then the best ones are selected by a user-defined reward function 39 .This approach has been applied to a wide variety of fields from object detection 40 to audio-signal processing 41 , and, more recently, in computational materials science 6 .It can be formulated mathematically as: where the reward function R is maximized by searching for the NN architecture, h, in a user-specified search space, H over some given dataset D. The dataset is defined as a collection of independently and identically distributed data points, i.e., D = {x 1 , x 2 , . . ., (x n )}, where data points x i are sampled from some underlying probability distribution P(x).A simple search space may consist of the layers (e.g., 1-5) in a deep neural network (DNN) and the nodes per layer (e.g., one of [16, 32, 64]).The NAS needs to train each candidate architecture enough to be able to compare them against one another.Note, that there are some methods (e.g., differentiable search 42 , shared weights 40 , early stopping 41 , etc.) in the literature to reduce the computational burden of NAS, but none are guaranteed to return the correct rankings of archi-tectures in terms of the reward function.For an MPNN model, several architecture parameters determine the number of model parameters.Here we discuss two such parameters, the number of message passing steps and the latent space size 9 .There are also other parameters such as the readout function and the number of layers in each graph convolution, which we do not discuss here.Figure 6 shows how the performance of the MPNN architecture described in Ref.Let us proceed to analyze the computing requirements for a concrete NAS example.The SchNet graph neural network has ca.85,000 trainable parameters for a latent size of 64 and three message passing steps (when trained on a dataset with 81 different atomic elements) 44 .On a single NVIDIA V100 GPU, it takes about two hours to train 2,000,000 steps on the AFLOW bandgap dataset (batch size 32).A further breakdown of the time required per training step, shows that the model takes 0.0010 seconds on average to batch the data and 0.0023 seconds to perform a gradient-update step.For the SchNet-derived NAS architecture search space, approximately 2,000,000 steps are needed to have an idea of whether a candidate architecture is promising or not.This means that to train 2,000 architectures, approximately 40,000 GPU hours for this dataset are needed, or in other words, more than 4.5 years on a single V100 GPU.Converting this number to dollars, current Amazon Web Services pricing plans fluctuate around 3 USD per NVIDIA V100 GPU hour 45 .Thus, a budget of roughly 120,000 USD is needed to perform this NAS.If we use a larger model than SchNet such as PaiNN 46 (600,000 vs 85,000 parameters) the training time and subsequent GPU and dollar budget would increase significantly.Moreover, considering larger datasets than AFLOW (about 60,000 data points, see Section 2) training would also slow down dramatically.
We can also compare CPU versus GPU training for this example.On a single CPU core, the training of 2,000,000 training steps of the SchNet model takes 10.5 hours, or on average approximately 0.0012 seconds each training step to batch the data, and 0.0177 seconds to take a gradient step and update the parameters.We conclude that CPU training is nearly an order of magnitude slower.
As NAS becomes more ubiquitous in materials science, as it has in other fields, new infrastructure challenges will have to be addressed.Finding the optimal model is often a non-convex task and can belong to large multidimensional search spaces that requires black-box optimization methods.For models with a relatively small number of parameters, an MPNN NAS can require a huge computational and therefore monetary budget to be run on medium-sized to large datasets.As datasets and the base models being trained grow larger, so too will the infrastructure requirements.

Discussion and conclusions
If we define a dataset as "big" because it has a large number of data points, we may wonder why ML models fail to generalize.For this aspect, data diversity, i.e., how well the underlying physical space is sampled, is critical.For example, AFLOW and MP datasets, despite their size, do not appear to contain enough data to describe the wide variety of materials.In other words, even larger and -in particular-more diverse datasets are required to build robust and transferable models.This has been shown by combining these two datasets.
Efforts to create big datasets will require advanced technical and software solutions.For example, computing only a fraction of above mentioned datasets, causes file and storage issues.In addition, as datasets grow, so does the number of model parameters.MPNN and other variants of graph neural networks have become quite common in materials science.Finding the best MPNN model, requires a neural-architecture search over an often non-convex search space.GPUs offer an order of magnitude speedup in training, but using enough GPUs to perform a single NAS search over a large dataset requires a significant monetary budget.
Databases also pose challenges related to data veracity.For instance, data points computed for the same material, may differ in the input settings.Sorting or clustering data using similarity metrics can help users better understand the quality of data.It also enables ML practitioners to filter data into homogeneous subsets that contain less variation in the target properties.Multi-fidelity modeling 11 , where hetereogenous data can be used by the model, is not discussed here, would be an alternative option for dealing with data veracity.
Big data presents an opportunity to use complex models.Complexity is shown to be well defined within the confines of a single model type.A general definition for comparative purposes is, however, lacking.The total trainable parameter count, or the number of constant, linear, or nonlinear terms can be less important than the model class when predicting performance.Increasing complexity, however, as shown with the example of using nonlinear features in cluster expansion, can aide significantly in describing intricate physics.More research is needed to better define model complexity in a data-and model-independent manner.
In this work, we attempt to shine light on some aspects of Big Data in materials science.There are certainly many more aspects to be explored.We hope that the issues we illustrate here will motivate further research along these lines.

Fig. 1
Fig. 1 Predicted versus calculated formation energies for AFLOW and Materials Project (MP) data.Top-left: model trained and tested on AFLOW data; top-right: model trained on AFLOW and tested on MP; bottom-left: model trained on MP and tested on AFLOW; bottom-right: model trained and tested on MP.

Fig. 2
Fig. 2 Top: Distribution of formation energies in the two datasets containing 50,652 (AFLOW) and 54,438 (MP) unique compositionspacegroup pairs.Bottom: Distribution of formation energies for the 8,591 composition-spacegroup pairs that are present in both datasets.

Fig. 3
Fig. 3 DOS similarity matrices for h-BN obtained with different basis-set sizes and k-meshes, and two different exchange-correlation (xc) functionals.The data are sorted by the latter, where low indices (≤ 71) correspond to the local-density approximation (LDA), high indices (>71) to the generalizedgradient functional PBE.The color code indicates the similarity coefficient.The bottom panel shows the number of k-points (blue) and the number of basis functions (orange).

Fig. 4
Fig. 4 Left: Fitting errors for standard (blue) and nonlinear CE models (orange, red) based on a pool of 27 clusters (top) and 176 clusters (bottom).Right: Predicted versus target bandgaps for the standard CE with 27 clusters and the nonlinear CE of degree 3 models with 150 features (top); and standard CE and nonlinear CE models with 150 clusters / features (bottom).
9 and trained on AFLOW bandgaps depends on these two parameters.Here, the NAS reward function is the negative of the validation RMSE.Each data point in the figure represents an individual MPNN architecture.The z-axis represents this architecture's best validation RMSE across different initial learning rates, learning decay rate, and readout functions.We see that the validation RMSE as a function of the two architecture parameters is non-convex.This non-convexity is what makes the optimization of the architecture slow and why black-box optimization methods such as reinforcement learning, random search, or evolutionary search are necessary 43 .Whether the NAS reward function is non-convex depends, of course, on the dataset and the search space.

Fig. 6
Fig. 6 Effect of two architecture parameters, number of message passing (MP) steps and latent space size on the validation RMSE of a message passing neural network with edge updates to fit bandgaps of the AFLOW dataset.Other parameters such as the learning rate and size of the readout function were optimized for each MP steps and latent size combination.