Machine learning in physics: The pitfalls of poisoned training sets

Known for their ability to identify hidden patterns in data, artificial neural networks are among the most powerful machine learning tools. Most notably, neural networks have played a central role in identifying states of matter and phase transitions across condensed matter physics. To date, most studies have focused on systems where different phases of matter and their phase transitions are known, and thus the performance of neural networks is well controlled. While neural networks present an exciting new tool to detect new phases of matter, here we demonstrate that when the training sets are poisoned (i.e., poor training data or mislabeled data) it is easy for neural networks to make misleading predictions.

Convolutional neural networks (CNNs), in particular, are specialized neural networks for processing data with a gridlike topology. Familiar examples include time-series data, where samples are taken in intervals, and images (twodimensional data sets). The primary difference between neural networks and convolutional neural networks lies in how hidden layers are managed. In CNNs, a convolution is applied to divide the feature space into smaller sections emphasizing local trends. Because of this, CNNs are ideally-suited to study physical models on hypercubic lattices. Recently, it was demonstrated that CNNs can be applied to the detection of phase transitions in Edwards-Anderson Ising spin glasses on cubic lattices [24]. It was shown that the critical behavior of a spin glass with bimodal disorder can be inferred by training the model using data that has Gaussian interactions between the spins. The use of CNNs also results in a reduced numerical effort, which means one could potentially access larger system sizes often needed to overcome corrections to scaling in numerical studies. As such, pairing specialized hardware to simulate Ising systems [25][26][27] with machine learning techniques might one day elucidate properties of spin glasses and related systems. However, as we show in this work, the use of poor input data can result in erroneous or even unphysical results. This (here inadvertent) poisoning of the training set is well known in computer science where small amounts of bad data can strongly affect the accuracy of neural network systems. For example, Steinhardt et al. [28] demonstrated that already small amounts of bad data can result in a sizable drop in the classification accuracy. References [29][30][31] furthermore demonstrate that data poisoning can have a strong effect in machine learning. Reference [32] focuses on adver-sarial manipulations [33,34] of simulational and experimental data in condensed matter physics applications. In particular, they show that changing individual variables (e.g., a pixel in a data set) can generate misleading predictions This suggests that results from machine learning algorithms sensitively rely on the quality of the training input.
In this work, we demonstrate that the use of poorlythermalized Monte Carlo data or simply mislabeled data can result in erroneous estimates of the critical temperatures of Ising spin-glass systems. As such, we focus less on adversarial cases, but more on accidental cases of poor data preparation. We train a CNN with data from a Gaussian Ising spin glass in three space dimensions and then use data generated for a bimodal Ising spin glass to predict the transition temperature of the same model system, albeit with different disorder. In addition, going beyond the work presented in Ref. [32], we introduce an analysis pipeline that allows for the precise determination of the critical temperature. While good data results in a relatively accurate prediction, the use of poorlythermalized or mislabeled data produce misleading results. This should serve as a cautionary tale when using machine learning techniques for physics applications.
The paper is structured as follows. In Sec. II we introduce the model used in the study, as well as simulation parameters for both training and prediction data. In addition, we outline the implementation of the CNN as well as the approach used to extract the thermodynamic critical temperature, followed by results and concluding remarks.

II. MODEL AND NUMERICAL DETAILS
To illustrate the effects of poisoned training sets we study the three-dimensional Edwards-Anderson Ising spin glass [35][36][37][38][39] with a neural network implemented in TensorFlow [40]. The model is described by the Hamiltonian where each J ij is a random variable drawn from a given symmetric probability distribution (either bimodal and Gaussian), arXiv:2003.05087v2 [cond-mat.dis-nn] 12 Mar 2020 s i = ±1 represent Ising spins, and the sum is over nearest neighbors on a cubic lattice with N sites. Because spin glasses do not exhibit spatial order below the spin-glass transition, we measure the site-dependent spin overlap between replicas α and β. In the overlap space, the system is reminiscent of an Ising ferromagnet, i.e., approaches for ferromagnetic systems introduced in Refs. [6,7] can be used. For low temperatures, q = (1/N ) i q i → 1, whereas for T → ∞, q → 0. For an infinite system, q abruptly drops to zero at the critical temperature T c . Therefore, the overlap space is well suited to detect the existence of a phase transition in a disordered system, even beyond spin glasses.

A. Data generation
We use parallel tempering Monte Carlo [41] to generate configurational overlaps. Details about the parameters used in the Monte Carlo simulations are listed in Tab. I for the training data with Gaussian disorder. The parameters for the prediction data with bimodal disorder are listed in Tab. II.   TABLE I: Parameters for the training samples with Gaussian disorder. L is the linear size of a system with N = L 3 spins, Nsa is the number of samples, Nsw is the number of Monte Carlo sweeps for each of the replicas for a single sample, Tmin and Tmax are the lowest and highest temperatures simulated, NT is the number of temperatures used in the parallel tempering Monte Carlo method for each system size L, and Ncon is the number of configurational overlaps for a given temperature in each instance.  We use the same amount of instances used in Ref. [42] with 100 configurational overlaps at each temperature for each instance. Because the transition temperature with Gaussian disorder is T c ≈ 0.95 [42][43][44], following Refs. [6,8,45] for the training data, we label the convolutional overlaps with temperatures above 0.95 as "1" and those from temperatures below 0.95 as "0." The parameters for the architecture of the convolutional neural network are listed in Tab. III. We inherit the structure with a single layer from Ref. [8]. All the parameters are determined by extra validation sample sets, which are also generated from Monte Carlo simulations. Note that we use between 4000 and 10000 disorder instances for the bimodal prediction data, which is approximately 1/3 of the numerical effort needed when estimating the phase transition directly via a finite-scaling analysis of Monte Carlo data, as done for example in Ref. [42]. As such, pairing high-quality Monte Carlo simulations with machine learning techniques can result in large computational cost savings.

C. Data analysis
Because the configurational overlaps [Eq. (2)] include the information about phases, we expect that different phases have different overlap patterns similar to grid-like graphs. Therefore, in the region of a specific phase, it is reasonable to expect that the classification probability for the CNN to identify the phase correctly should be larger than 50%. As such, it can be expected that when the classification probability is 0.5, the system is at the system-size-dependent critical temperature. A thermodynamic estimate can then obtained via the finite-size scaling method presented below.
Let us define the classification probability as a function of temperature and system size: p(T, L) which can be used as a dimensionless quantity to describe the critical behavior. From the scaling hypothesis, we expect p(T, L) to have the following behavior in the vicinity of the critical temperature T c : where the average is over disorder realizations. Note that the critical exponent ν ml is different from the one calculated using physical quantities. Due to the limited system sizes that we have studied, finite-size scaling must be used to reliably calculate the critical parameters at the thermodynamic limit. Assuming that we are close enough to the critical temperature T c , the scaling functionF in Eq. (3) can be expanded to a third-order polynomial in x = L 1/ν ml (T − T c ).
First, we evaluate ν ml by noting that to the leading order in x, the derivative of p(T, L) in Eq. (4) with respect to temperature has the following form: Therefore, the extremum point of d p(T,L) dT scales as A linear fit in a double-logarithmic scale then produces the value of ν ml (slope of the straight line), which is subsequently used to estimate T c . To do so, we turn back to Eq. (4) where we realize that the coefficient of the linear term in L 1/ν ml as the independent variable is proportional to (T − T c ) that changes sign at T = T c . Alternatively, we can vary T c until the data for all system sizes collapse onto a common thirdorder polynomial curve. This is true because the scaling func-tionF as a function of L 1/ν ml (T − T c ) is universal. The error bars can be computed using the bootstrap method.
III. RESULTS USING DATA WITHOUT POISONING Figure 1 shows results from the CNN trained with wellprepared (thermalized) data from a Gaussian distribution, predicting the phase transition of data from a Bimodal disorder distribution. Figure 1(a) shows the prediction probabilities for different linear system sizes L as a function of temperature T . The curves cross the p = 0.5 line in the region of the transition temperature for the bimodal Ising spin glass. Figures  1(b) and 1(c) show the estimates of the exponent ν ml and the critical temperature T c , respectively using the methods developed in Sec. II C. The critical temperature T c = 1.122(6) is in good agreement with previous estimates (see, for example, Ref. [42]). Finally, in Fig. 1(d), the data points are plotted as a function of the reduced variable x = L 1/ν ml (T − T c ) using the estimated values of the critical parameters. The universality of the scaling curve underlines the accuracy of the estimates.

IV. RESULTS USING POISONED TRAINING SETS
Although we have shown that the prediction from convolutional neural network can be precise, we still need to test how poisoned data sets impact the final prediction. First, we randomly mix the classification labels of the training sample with a probability of 1%, i.e., with a training set of 100 samples, this means only one mislabeled sample on average. Then we train the network and use the same samples in the prediction stage. Compared to Fig. 1, Figure 2 shows no clear sign of a phase transition. This means that mislabeling a very small portion of the training data can strongly affect the outcome. Given the hierarchical structure of CNNs, errors can easily be amplified in propagation [46,47], which is a possible explanation of the observed behavior. Finally, we test the effects of poorly prepared training datain this case, the training data are not properly thermalized. Figure 3 shows the prediction results using data with only 50% of the Monte Carlo sweeps needed for thermalization of the Gaussian training samples. Although 50% might seem extreme at first sight, it is important to emphasize that thermalization times (as well as time-to-solution) are typically distributed according to fat-tail distributions [48]. In general, users perform at least a factor 2 of additional thermalization to ensure most instances are in thermal equilibrium. As in the case where the labels were mixed, a transition cannot be clearly identified. This is strong indication that the training data need to be carefully prepared.
We have also studied the effects of poorly-thermalized prediction data paired with well-thermalized training data (not shown). In this case, the impacts on the prediction probabilities are small but not negligible.

V. DISCUSSION
We have studied the effects of poisoned data sets when training CNNs to detect phase transitions in physical systems. Our results show that good training sets are a necessary requirement for good predictions. Small perturbations in the training set can lead to misleading results.
We do note, however, that we might not have selected the best parameters for the CNN. Using cross-validation or bootstrapping might allow for a better tuning of the parameters and thus improve the quality of the predictions. Furthermore, due to the large number of predictors, overfitting is possible. This, however, can be alleviated by the introduction of penalty terms. Finally, the use of other activation functions and optimizers can also impact the results. This, together with the sensitivity towards the quality of the training data that we find in this work suggest that machine learning techniques should be used with caution in physics applications. Garbage in, garbage out . . .