“Noisy beets”: impact of phenotyping errors on genomic predictions for binary traits in Beta vulgaris

Background Noise (errors) in scientific data is endemic and may have a detrimental effect on statistical analyses and experimental results. The effects of noisy data have been assessed in genome-wide association studies for case-control experiments in human medicine. Little is known, however, on the impact of noisy data on genomic predictions, a widely used statistical application in plant and animal breeding. Results In this study, the sensitivity to noise in the data of five classification methods (K-nearest neighbours—KNN, random forest—RF, ridge logistic regression—LR, and support vector machines with linear or radial basis function kernels) was investigated. A sugar beet population of 123 plants phenotyped for a binary trait and genotyped for 192 SNP (single nucleotide polymorphism) markers was used. Labels (0/1 phenotype) were randomly sampled to generate noise. From the base scenario without errors in the labels, increasing proportions of noisy labels—up to 50 %—were generated and introduced in the data. Conclusions Local classification methods—KNN and RF—showed higher tolerance to noisy labels compared to methods that leverage global data properties—LR and the two SVM models. In particular, KNN outperformed all other classifiers with AUC (area under the ROC curve) higher than 0.95 up to 20 % noisy labels. The runner-up method, RF, had an AUC of 0.941 with 20 % noise. Electronic supplementary material The online version of this article (doi:10.1186/s13007-016-0136-4) contains supplementary material, which is available to authorized users.


Background
Errors in the data collected for scientific experiments or-especially-for routine industrial applications are referred to as noise in the data, and may arise for several reasons (e.g. instrument errors, human errors, environmental noise, inherent randomness in the physical process, corruption of data etc. [1]). Noisy data are a long known problem in statistics (e.g. [2]). In spite of efforts to clean the data and produce good quality datasets [3], a certain amount of noise is bound to persist in the data: this needs to be dealt with, and the impact on results assessed (e.g. [4][5][6]). In binary classification problems, noisy data typically take the form of mislabeled observations or flipped labels [6]. For instance, the carrier status of recessive mutations (e.g. [7]) may inadvertently be inverted in some individuals; the same could happen in the case of resistance/susceptibility to diseases (e.g. rhizomania in sugar beet [8], where a proportion of resistant plants could be mislabeled as susceptible, and viceversa).
These are examples of possible phenotypic errors in binomial traits.
In the field of genomics, the effect of mislabeled observations on the statistical power of genome-wide association studies (GWAS) has been recognized in case-control studies in human medicine [9,10]. Buyske et al. found that a 39-fold larger sample size is required to maintain the same power of analysis in case-control studies with 5 % misclassifications. In animal genetics, a known issue are pedigree errors and their effect on the accuracy of estimated breeding values [11]: for instance, in a pig population with 20 % errors in the pedigree, the average genetic gain showed a reduction in the range 3.2-12.4 % for a number of traits. Noisy data are bound to have a detrimental effect also on whole-genome predictions, which are increasingly used for a variety of phenotypes in plant and animal breeding [12]. Additionally, the current trend in precision agriculture is bringing about novel high-throughput phenotyping systems to measure a vast amount of data in an automatic and continuous way [13], which may well harbor a certain proportion of errors. Automatically generated non-curated datasets are prone to contain errors (e.g. [14,15]). There are currently no studies that address the issue of noisy data in genomic predictions, neither in humans, nor in plants and animals.
In this paper, the impact of random noise on the accuracy of genomic predictions for binary traits is investigated. Starting from a population of sugar beet with known binomial phenotypes, increasing proportions of noisy labels were randomly generated, and the performance of different classification methods was measured.

Plant phenotypes and genotypes
In total, 123 sugar beet (B. vulgaris) plants were available, 99 with high-and 24 with low-root vigor. Plants were originated from 18 selected sugar beet lines (15 with high-and 3 with low-root vigor). Root vigor is linked to nutrient uptake and plant productivity, [16] and, in selected sugar beet populations, has been usually treated as a binary trait [17,18]. Classification of plants into high-or low-root vigor was based upon phenotypic measurement of root elongation on 11-day-old seedlings: root elongation was on average 12.9 and 2.6 mm/day in high and low root vigor plants, respectively. The clearly bimodal distribution can be seen in Biscarini et al. 2015 [18].
All plants were genotyped for 192 SNP markers with the high-throughput marker array QuantStudio 12K Flex system coupled with Taqman OpenArray technology. The average per-sample and per-marker call-rate were 0.984 and 0.969. Only one SNP had a per-marker call-rate ≤85 % and was removed. There were in total 738 missing genotypes (3.14 %). After imputation ( [19]) data were edited for minor allele frequency (MAF): 16 SNPs with MAF ≤2.5 % were discarded. After editing, 175 SNPs evenly distributed across the nine chromosomes of the sugar beet genome were left for the analysis.
The study was conducted in accordance with the existing national and international guidelines and legislation.

Classification models
Based on SNP genotypes, the genomic classification of individual sugar beet plants into the two classes (highand low-root vigor) was carried out using the following five models: K-nearest neighbors (KNN) classifier The predicted class for plant x 0 was obtained by majority vote among the K closest neighbours. The neighbourhood was determined via Euclidean distances based on SNP genotypes where x i is the vector of SNP genotypes for plant i, and f b (x i ) is the prediction (high-/low-root vigor) from the classification tree built on the b th bootstrapped data sample. More details on random forest can be found in [23].

Ridge logistic regression (LR) classifier
The probability of having high-root vigor (P(Y = 1|X) = p(x)) was modeled as a linear combination of the SNP genotypes in a logistic regression model: where p(x i ) is the P(Y = 1|X) for individual i with vector of SNP genotypes x i ; SNP j is the effect of the j th marker; z ij is the genotype of individual i at locus j (0, 1 or 2 for AA, AB and BB genotypes). Since the number of markers in the model (175 SNPs) exceeds the number of observations (123 plants), an ℓ2-norm penalization (− 1 2 m j=1 SNP 2 j ) was applied to the likelihood function to be maximised [24]. Support vector machine with linear kernel (SVM-Lin) SVM-Lin maps the vector of SNP genotypes x ∈ R into a higher dimensional feature space φ(x) ∈ H and constructs a separating hyperplane-linear in Rto classify The mapping R � → H is performed by a linear kernel function K (x i , x i ′ ) = �x i , x i ′ � which defines an inner product of pairs of SNP genotype vectors in the space H. The intercept β 0 and the coefficients α i are obtained by maximizing the margin M, whose width is controlled by the hyperparameter C, optimized through cross-validation. Support vector machine with radial basis function kernel (SVM-Rbf ) As in SVM-Lin, observations are classified by the sign of Eq. 3 and the width of margin M; only, in SVM-Rbf the kernel function K is the radial basis function: The width of the margin M is again controlled by the hyperparameter C, while the positive constant γ controls the degree of non-linearity of the decision boundary.
For a full description of SVM with either linear or radial basis function kernel, see [25].
To test the impact of phenotyping errors on genomic predictions, an increasing fraction of the observations in the training set was randomly mislabelled: from 0 % (no mislabels) up to 50 % (theoretical maximum noise), through 12 intermediate steps (1, 2.5, 5, 7.5, 10, 12.5, 15, 17.5, 20, 25, 30, 40 %). At every step, the corresponding fraction of observations was randomly sampled from the original data and the labels were flipped (0 → 1; 1 → 0). For each proportion of mislabelled observations, the five classification models were tested with a 5-fold cross-validation scheme. 123 sugar beet plants were randomly split into 5 subsets of approximately the same size. In turn, the observations in one subset were set to missing and predicted using the model trained with the remaining four subsets, until all subsets were used once as validation set. A further nested 5-fold cross-validation run was applied for hyperparameter optimization. Labels predicted in the validation set were compared to the original (true) labels to measure the accuracy of classification. Each experiment (proportion of mislabelled observations per classification model) was repeated 100 times (×5-fold cross-validation = 500 replicates). Results were averaged to explore the variability of prediction and ensure numeric stability.
High root vigor (the majority class) was by convention considered positive and low root vigor (the minority class) negative. The accuracy of genomic predictions was measured as: (1) Total error rate (TER: ratio between the number of classification errors and the total number of predictions), (2) False positive rate (FPR: ratio between wrongly predicted positives and the total predicted positives), and (3) False negative rate (FNR: ratio between false negative predictions over all negative predictions). Additionally, the area under the receiver operating characteristic (ROC) curve (AUC) was also recorded to monitor FPR and FNR over all possible classification thresholds in [0,1] [26].

Software
All models were implemented using the Weka machine learning suite [27]. The open source statistical environment R [28] was used generate random noisy labels, to parse results and produce figures and tables.

Results
Error rates (TER, FPR, FNR) for the five classification models over all mislabeling proportions are reported in Table 1. In general, very low error rates were observed with no phenotyping errors in the data (base scenario). No errors overall and in both classes with KNN, LR and SVM-Lin, errors below 0.1 % with SVM-Rbf and around 1 % with RF.
The average AUC as a function of the proportion of mislabeled observations is a good indicator of the relative performance of the five classification models, and their robustness to noise in the data (Fig. 1). The performance of LR and SVM-Lin decreased approximately linearly with increasing proportions of mislabeled observations. KNN, RF and SVM-Rbf appeared to be more robust to noise in the data: AUC was 0.95 for KNN and RF, and larger than 0.90 for SVM-Rbf, up to 20 % mislabelled observations: only after 20 % phenotype errors their performance started deteriorating rapidly. With mislabeled observations approaching 50 %, AUC from all classification models quickly converged to 0.50 (absence of any predictive value).
With increasing noise in the data, not only did the average performance decrease, but also the genomic predictions were much more variable. Figure 2 shows the boxplots of the 500 (5-fold cross-validation, repeated 100 times) true positive (TPR = 1 − FNR) and true negative The low-to-high root vigor ratio was 0.195 in the original data. Mislabeled observations were then generated randomly, and this had an effect on the class ratio, which went up to 0.520 with 50 % noise. When increasing proportions of noise were introduced, data got progressively more balanced. The frequency of the minority class for each proportion of noisy labels is reported in Table 1.

Discussion
Classifying sugar beet plants into high-and low-root vigor using SNP genotypes was already shown to be very accurate [17,18]. This provides an excellent starting point, and ensures that observed classification errors are due to noise in the data and the chosen classification model, and not to intrinsic characteristic of the data that could privilege some method over the others.
In general, when noise increases, the rate of misclassifications also increases, together with the variability of genomic predictions, and the two classes gets progressively more balanced (which consequently casues TPR and TNR to get more similar). However, while the classification accuracy of LR and SVM-Lin decreased linearly with the rate of phenotyping errors, KNN, RF and SVM-Rbf were more robust to noise and showed a similar pattern in their AUC curve.
KNN and RF are semi-parametric statistical methods which are inherently "local" in their behaviour, and therefore tend to be robust to outliers in the data. Neighbourhoods (in KNN) and branches (in RF) use subsets of the data and rely on the prevalent labels in the subset to classify observations. It is unlikely that all-or most-mislabelled observations happen to be in one neighbourhood or branch. Therefore KNN and RF would give good performance up to the point when the subset is dominated by misalbeled observation. When the fraction of mislabeled observations is 20 % or higher, the amount of noise is such that probabilities revert, and it gets unlikely to have local subsets without-or with few-mislabelled observations, and also local methods begin to fail [29]. In SVM-Rbf, training observations which are far-in terms of Euclidean distance-from a given test observations    2 will in fact be very small [30]). This implies that the SVM-Rbf has a very local behavior, in the sense that only nearby training observations have an effect on the class label of a test observation, similarly to what happens with KNN and RF. This helps explain the similar performance of these three classification methods with increasing noise in the data.
On the other hand, LR and SVM-Lin work very well in the base scenario, when there are no mislabels. This is because in this classification problem the decision boundary is linear, and the two classes are linearly separable (see also phenotypic distribution in the Supplementary Figure SF1 in [18]). With noisy labels, though, LR and SVM-Lin tend to degrade faster than local methods because they build on general properties of the data.
Local classification methods proved to be robust to noise up to 20 % mislabeled observations in the dataset. At this proportion of errors, the average hyperparameters had the following values: for KNN, K = 4.49 (and no weight was used in most of the cases-40 %); for RF, B = 37.7; for SVM-Rbf, γ = 0.0921. These hyperparameters control the bias/variance trade-off and their optimization is much dependent on the specific training datasets (e.g. size of the data, number of parameters relative to observations). Therefore, the values of the hyperparameters estimated here are not directly applicable to other datasets, but can provide a guide for the space to be explored in similar problems.
Biscarini et al. [18] previously showed that it was possible to reduce the set of markers down to as few as 30 SNP, without losing accuracy of classification. The parsimonious classifier thus developed was here tested with noisy labels. Based on the proportion of variance explained, two subsets with the 50 (SNP50) and 30 (SNP30) most informative SNP loci were extracted and used to classify high-and low-root vigor sugar beets. The two best performing methods were applied: KNN and RF. Figure 3 shows the AUC for increasing proportions of noise in the data when using all 175 SNP or subsets with, respectively, 50 and 30 SNP. The accuracy of classification is practically unaffected by the number of SNP included in the model. The variability of predictions was also little affected: with fewer SNP predictions were only slightly less reliable (e.g. KNN for 5 and 7.5 % noise, see Additional file 1: Figure  ) . These results indicate that informative SNPs appear to be more relevant than sheer SNP density for the accuracy of genomic predictions (e.g. [31]). Robustness to noise is an aspect of genomic predictions which is currently overlooked, but may be desirable. To extract useful information from data, a classifier that is robust to noisy labels is needed to produce meaningful results even in the presence of noise. There may be interest in methods robust to noise. Manual phenotyping is known to may be prone to errors (e.g. in human medicine [32,33]). Novel high-throughput phenotyping platforms [34][35][36], by which very large amounts of data are automatically generated, may alleviate the problem, at least partially. However, automatically generated data are not double-checked for errors, and are therefore susceptible to contain a residual amount of phenotyping errors. This highlights on one hand the importance of accurate phenotyping for genomic predictions [37,38], on the other the need for prediction methods able to deal with noisy data.
Genomic classification for binary traits is highly relevant in plant breeding (e.g. resistance/susceptibility to diseases [39], which is often controlled by multiple loci e.g. [40]). In sugar beet, besides root vigor, other binomial characteristics of plants are important: for instance bolting tendency (i.e. premature flowering, negatively related to sugar yield [41]), for which a polygenic nature is increasingly evident [42], and genome-enabled predictions promise therefore to be a valuable technique for breeding.

Conclusions
Noise (errors) is pervasive in scientific data, potentially also in the field of genomics applied to plant breeding. A specific type of errors are misalbeled observations (wrongly assigned labels, flipped labels), which are relevant in the analysis of binary traits. The impact of noisy labels on the accuracy of genome-enabled predictions had not been investigated so far; this paper presented a first attempt at understanding what happens when binary phenotypes are incorrect, and how different classification methods respond to increasing proportions of noisy labels in the data. The results of this study indicate that local classification methods seem to be better suited to cope with noisy labels, with KNN outperforming all other classifiers. Overall, genomic predictions for binomial traits seem to be robust to small percentages of phenotyping errors, and the high variability between methods points at the possibility of selecting the best classifier for each problem, depending on the amount of noise and the nature of the decision boundary.

Availability of supporting data
SNP genotypes and high/low-root vigor status of the 123 sugar beet samples used in this study are currently not hosted in any open access repository, but are available upon request to the authors. Authors' contributions FB and NN and SM conceived the study and performed all statistical analyses. FB and NN wrote most of the paper. CB and PS contributed data for the analysis and information and insights on the binary trait used for illustration. All authors read and approved the final manuscript.

Additional file
Additional file 1: Figure S1. TPR/TNR variability with all SNP and with subsets of 30 or 50 SNP Distribution of TPR (red) and TNR (blue) in the validation set using KNN and RF with all 175 SNP and with subsets of 50 and 30 SNP. TPR and TNR as a function of mislabeled observations, from a 5-fold cross validation repeated 100 times. Results are presented per method.