An Approach to Identifying and Quantifying Bias in Biomedical Data

Data biases are a known impediment to the development of trustworthy machine learning models and their application to many biomedical problems. When biased data is suspected, the assumption that the labeled data is representative of the population must be relaxed and methods that exploit a typically representative unlabeled data must be developed. To mitigate the adverse effects of unrepresentative data, we consider a binary semi-supervised setting and focus on identifying whether the labeled data is biased and to what extent. We assume that the class-conditional distributions were generated by a family of component distributions represented at different proportions in labeled and unlabeled data. We also assume that the training data can be transformed to and subsequently modeled by a nested mixture of multivariate Gaussian distributions. We then develop a multi-sample expectation-maximization algorithm that learns all individual and shared parameters of the model from the combined data. Using these parameters, we develop a statistical test for the presence of the general form of bias in labeled data and estimate the level of this bias by computing the distance between corresponding class-conditional distributions in labeled and unlabeled data. We first study the new methods on synthetic data to understand their behavior and then apply them to real-world biomedical data to provide evidence that the bias estimation procedure is both possible and effective.


Introduction
The development and application of machine learning methods have become commonplace in biomedical sciences and have the potential to transform clinical care. 1,2 Many of those predictive modeling approaches take place in a binary semi-supervised setting; that is, where the prediction outcome is dichotomized and the available data for training and evaluation contains samples of labeled and unlabeled examples. One such scenario is the prediction of the effect of genomic variants as pathogenic or benign, where labeled data contains pathogenic (positive) and benign (negative) variants from databases such as ClinVar 3 and the unlabeled data is often a large reference set of observed variants such as gnomAD. 4 A traditional approach in semi-supervised learning is to assume that the labeled data is representative of unlabeled data, thus requiring little sophistication during model development, model selection, and performance evaluation. However, a distinguishing feature of real biomedical data is that the labeled examples may not be representative of the unlabeled data; that is, the labeled data may be biased. 5 Data biases can have adverse effects on the ability of models to be optimized for the unlabeled data at hand and can also lead to poor estimation of a classifier's performance on a reference distribution. 6 More generally, biased data presents an obstacle to the development of trustworthy methods that are necessary for the societal acceptance of machine learning-based predictive technologies. 7,8 Learning under sample selection bias is a well-known problem. 9 Early approaches relaxed the assumption of fully representative data by assuming the same class-conditional distributions in labeled and unlabeled data, thus reducing the problem of posterior estimation to estimation of class priors in unlabeled data. 10,11 Other approaches consider situations where at least one class-conditional distribution from which the labeled data is generated is representative of its unlabeled counterpart. [12][13][14][15] While such methods have advanced the treatment of sample selection bias, we are not aware of methods that can identify whether and to what extent labeled data differs from unlabeled data for a general form of bias.
The objective of this work is to develop a statistical test for identifying biased labeled data while simultaneously quantifying the level of bias. We assume that the real-world data can be transformed and subsequently modeled using nested mixtures of multivariate Gaussian distributions; that is, with both positive and negative samples being Gaussian mixtures themselves. We then model these class-conditional distributions in both labeled and unlabeled data by the shared underlying component distributions, but permit the proportions at which the data is sampled from those component distributions to differ between labeled and unlabeled data. We finally develop an expectation-maximization (EM) algorithm that learns both individual and shared parameters from the combined data which allows us to identify and quantify bias. Our experiments on synthetic and real-world data demonstrate the ability of this procedure to detect bias and provide useful information to data scientists in their workflows.

Problem Formulation
We consider the binary classification problem where input features x ∈ ℝ D are used to predict class label y ∈ = {−, +}, where + and − represent the positive and negative class, respectively. Let p(x, y) be the unknown joint distribution that governs how x appears in nature or in a target population of interest and its relationship with y. We refer to p(x, y) as the unbiased distribution, where we expect a classifier to perform optimally. Let f + (x) = p(x| y = +) and f − (x) = p(x|y = −) denote the positive and negative class-conditional distributions, respectively. Let f(x) = p(x) denote the marginal distribution over x and α = p(y = +) be the probability that a random point from p(x, y) is positive, the class prior for the positive class. It can be shown that f is a mixture distribution with components f + and f − and mixing proportions α and 1 − α, respectively; i.e., f(x) = αf + (x) + (1 − α)f − (x) . (1) Let L + and L − represent sets of positive and negative labeled examples, respectively and U represent a set of unlabeled examples, available for training. Though we observe examples drawn randomly from f(x) in U, unlike the standard classification setting, we might not observe labeled examples drawn randomly from f + (x) and f − (x). Instead L + and L − are drawn from potentially biased class-conditional distributions f + ′ (x) and f − ′ (x), respectively ( Fig. 1). We use the term bias here in a purely statistical sense; the labeled positives and negatives in the observed data are systemically different from those in the unlabeled data such that they cannot be interpreted to be drawn i.i.d. from the same distribution. In this work, we are interested in detecting and quantifying the extent to which the examples in L + and L − differ from the positives and negatives in U, without the knowledge of the class labels in U.

Assumptions
If f + ′ (x) and f − ′ (x) are arbitrarily different from f + (x) and f − (x), respectively, detecting and quantifying the bias is an intractable problem. Fortunately, for most practical settings the biased and unbiased distributions are related. In this work, we employ a (G)aussian (c)omponent-based "(m)ixing (b)ias" assumption (MB-GC), 16 relating the biased and unbiased distributions. Formally, we assume both f + (x) and f + ′ (x) can be expressed as mixtures with the same K + shared Gaussian component distributions, but with differing mixing proportions. f − (x) and f − ′ (x) are assumed to be related in the same manner with K − shared Gaussian components. Mathematically, where * is a placeholder for + or −; * = {1, 2, …, K*}; w * = w k * k ∈ K * and v * = v k * k ∈ K * are probability vectors; i.e., w k * , v k * ≥ 0, ∑ j ∈ K * w j * = 1 and ∑ j ∈ K * v j * = 1; and ϕ k * (x) = ϕ x; μ k * , Σ k * is the D-dimensional Gaussian density function with mean μ k * and covariance Σ k * . We use the shorthand μ * = μ k * k ∈ K * and Σ * = Σ k * k ∈ K * to group the parameters.
It is important to mention that a parametric approximation of the distributions becomes a universal nonparametric approximator as K + , K − → ∞. 17 However, picking a large number of components may lead to a complex model prone to overfitting and identifiability issues. We therefore restrict ourselves to a relatively small number of components, up to eight, in each class-conditional representation, as in the parametric paradigm.
Since Gaussian mixture models are effective up to a moderate number dimensions, for high-dimensional data, we employ the MB-GC assumption after dimensionality reduction.
Conceptually, we interpret the input feature x ∈ ℝ D as a low-dimensional representation of D r -dimensional raw features (D r > D) in such cases. It is conceivable that neither the raw features nor the dimensionality-reduced features appear exactly as Gaussian mixtures, especially with a small number of components. In spite of this limitation, we argue that the modern representation learning approaches 18,19 can be used to learn embeddings that do satisfy that property, potentially making our assumptions and methods even more effective.

Quantifying Bias
Although various distance measures can be used, 20 we quantify the bias between f + and f + ′ as the area under the ROC curve (AUC) of an optimal binary classifer, or a score function s: ℝ D ℝ, between them. Based on the probabilistic interpretation of AUC, 21 it is the probability that a randomly drawn example from f + achieves a higher score than a randomly drawn example from f + ′ , as per an optimal score function. Mathematically, for being the family of all real-valued score functions defined on ℝ D , where, correcting for ties, AUC s f + , f + ′ = p s X f + > s X f + ′ + 1 2 p s X f + = s X f + ′ ; X f + and X f + ′ are random variables distributed according to f + and f + ′ , respectively. Note that AUC is symmetric; i.e., AUC f + , f + ′ = AUC f + ′ , f + . It ranges from 0.5 to 1, with a higher value indicating a larger difference between the two distributions and consequently a larger bias. Typically, values between 0.5 and 0.6 are considered to be small enough that the distributions can be interpreted to be practically indistinguishable. A value of 1 corresponds to a perfect classifier; that is, a situation when the supports between f + ′ and f + are distinct.
Thus, in this work, a value of 0.5 indicates no bias and a value of 1 indicates maximum bias ( Fig. 2).
If samples from f + and f + ′ were available, AUC(f + , f + ′ ) could be estimated by first training a classifier to separate the samples, and then computing AUC in the standard manner as the area under the ROC curve. Though a sample from f + is not readily available, such a sample is procured using the approach presented in Methods. The bias between f − and f − ′ can be quantified as AUC(f − , f − ′ ) and estimated similarly.

Methods
In order to detect and quantify the bias, we derive an expectation-maximization (EM) algorithm from multi-sample Gaussian mixtures. Under the MB-GC assumptions each of L + , L − and U contain examples drawn i.i.d. from a Gaussian mixture. Formally, where the second equation for the distribution of U is obtained by combining MB-GC assumptions with Eq. 1 and * is a placeholder for + or −. Note that the resultant distribution is a mixture of K + + K − components. The combined data log-likelihood is given by To obtain the maximum likelihood estimates of the parameters, we derive the following update equations, under the EM framework.
where ˇ and ˆ are used to represent the current and updated parameters, respectively, during an EM iteration; α + = α and α − = 1 − α; ω k * (x; θ) is the probability that a given x ∈ U comes from ϕ k * ; similarly, ν k * (x; θ) is the probability that a given x ∈ L* comes from ϕ k * ; i.e., Starting with an initial value, as discussed in Section 3.3, the parameters in θ are iteratively updated using Eq. EM-update until convergence, when the relative change in the log-likelihood, ℒ θ ; L + , L − , U − ℒ θ; L + , L − , U /ℒ θ; L + , L − , U , is less than a small predefined threshold (δ) or until the number of iterations reaches a predefined maximum (I).

Estimating Bias
Once θ is estimated, we use the estimated value of w + to infer the distribution of the unbiased positives, f + , as per Eq. MB-GC. In order to estimate the bias in the labeled positive sample, we first subsample from U, to procure a set, L + , representing estimated f + .
To this end, we use the responsibility, r + (x; θ) = ∑ k ∈ K + ω k + (x; θ), giving the probability that a given x ∈ U is a positive. Precisely, ∀x ∈ U, if Bernoulli r + (x; θ) = 1 add x to L + , where r + (x; θ) is used as the success probability of the Bernoulli distribution. Once

Detecting Bias
We focus the subsequent presentation on bias detection in L + only; the detection of bias in L − can be approached similarly. Due to model misspecification and errors in the parameter and bias estimation, a bias higher than 0.5 is likely to be estimated, when, in fact, the data is unbiased. To mitigate this issue, we introduce a bias threshold, τ ∈ [0.5, 1], and interpret a dataset to contain bias only if its estimated bias is above τ. A higher value of τ would decrease the probability that an unbiased dataset is detected to have bias (type-1 error), e(τ).
However, it will also decrease the probability that a biased dataset is detected to have bias (power), q(τ). To achieve a low type-1 error and a high power, we determine an appropriate value of τ by controlling for type-1 error on synthetic datasets; see Synthetic Data.
Let S syn ub and S syn b be two families of unbiased and biased synthetic datasets, respectively, where each dataset is of the form (L + , L − , U) and bias is defined as per the current context. be the fraction of unbiased and biased synthetic datasets with estimated bias above τ, respectively. We define τ η = min τ e syn (τ) ≤ η as a suitable threshold for which type-1 error computed w.r.t S syn ub is η (typically, η ∈ [0, 0.1]); i.e., e syn (τ η ) = η. The power computed w.r.t. S syn b at τ η is q syn (τ η ). Using this framework, for any real-world dataset S = (L + , L − , U), we enable computing a p-value for bias detection as p-value(S) = e syn (Bias est (S)), the proportion of unbiased synthetic datasets estimated to have a bias above Bias est (S).
Note that estimates of type-1 error, power and p-value computed w.r.t. synthetic datasets are representative of their true values to the extent that they capture the diversity of the real-world datasets. In addition to explicitly diversifying the synthetic datasets to a feasible extent, we address this issue by also estimating type-1 error and power w.r.t. selected unbiased and biased real-world datasets, still using the synthetic data threshold; see Data and Results.

Implementation Details
Initialization-Parameter estimates of our algorithm are likely sensitive to the initial parameters; it is known to be the case for the standard EM algorithm (GMM) for a single Gaussian mixture sample. 22,23 Because we have access to labeled data, we leverage it for parameter initialization. However, in order to introduce more diversity to initialization across multiple restarts, we do not use parameters estimates on only labeled data as our initial parameters; e.g., by using parameters from GMM on each L*. Instead, we initialize parameters in the following steps. (1) Run GMM with K* components on L* to obtain initial estimates of v*, for * ∈ {+, −} and save the location parameter estimates u * = u k * k ∈ K * (2) Run k-means++ 24 on unlabeled data U with K + + K − centers. Sort the centers based on the minimum distance to any location in u + . Pick the top K + centers to initialize μ + and the remaining centers as μ − . (3) Compute the distance from unlabeled points x ∈ U to each of the K + + K − centers and assign them to the closest one. This gives an assignment for all points to a cluster which has already been assigned as positive or negative. (4) Use the assignments to compute α = ∑ k ∈ K + A k + |U| , w k * = A k * ∑ k ∈ K * A k * and indicate points assigned to the k-th positive (negative) cluster.
Model Selection-Parameter estimation with EM algorithms when the number of components is unknown is not trivial and many methods exist for model selection. 25,26 We employ the one-fold cross-validation-based information criterion (CVIC) 25 for model selection by running our EM optimization for various values of K + , K − and selecting the model that achieves the highest log-likelihood on a validation set.
Hyper-parameters-We assume K ≡ K + = K − for convenience in experimentation. We use the maximum number of iterations I = 2000 and the convergence threshold δ = 10 −8 for termination. We run the estimation on each dataset 20 times with different random seeds.

Synthetic Data
To find appropriate bias thresholds and evaluate our method, we generate synthetic Gaussian mixture datasets, following MB-GC assumptions, from known parameters. This allows us to control bias directly and evaluate performance for different levels of bias in the dataset.
Here f + and f − are both K-component Gaussian mixtures. Their parameters are determined by a given AUC(f + , f − ) range (e.g., [0.65, 0.7]) and mutual irreducibility parameters, support (σ = 0.01) and pairwise responsibility threshold (ρ = 0.9), governing the overlap between each pair of components. Let ϕ i and ϕ j be two of the 2K components and let Z i and Z j be samples of 1000 examples each, drawn from ϕ i and ϕ j , respectively. If more than σ fraction of points in Z i have ϕ i (·) ≥ ρ(ϕ i (·) + ϕ j (·)) and, similarly, more than σ fraction of points in Z j have ϕ j (·) ≥ ρ(ϕ i (·) + ϕ j (·)), then ϕ i and ϕ j are considered to be approximately mutually irreducible. 27

Biomedical Data
We selected 8 biomedical datasets from the the UCI Machine Learning Repository 28 to apply our methods. The following datasets were used, with a note that for each we give Datasets were constructed by assigning one class as positive and the remaining as negative for multi-class data or setting a threshold for regression data. For each problem, 100 unbiased datasets were generated by selecting a subset of labeled points uniformly. We generate 250 biased datasets for each biological dataset through Markov sampling. First a point x i is selected uniformly at random from the positive class. The same point is resampled with some probability p stay and a new point x j is selected with probability 1 − p stay .
The transition probability Pr(x j |x i ) is proportional to the inverse of the squared Euclidean distance between points ∥x i − x j ∥ 2 . Since the true bias cannot be measured directly, we use the probability of resampling p stay as a proxy for bias. Higher values of p stay correspond to higher bias in labeled data since the feature space will be less uniformly sampled (Fig. 3). In each case, 20% of points are held out as a validation set used for model selection. We reduce the dimensionality with PCA for datasets with more than 8 features.

Experiments Empirical Null Distribution and Bias Threshold
We use synthetic Gaussian mixture datasets to determine the bias threshold for a range of dimensions D ∈ {1, 2, 8, 16} and number of components K ∈ {2, 4, 8}. We consider bias in positive class, but the method for estimating bias in the negative class or both would follow the same process. We run the EM optimization on each unbiased dataset to estimate all unknown parameters, θ. We use the estimated parameters θ to compute the estimated bias for the positive class, AUC(f + , f + ′ ), where ⋅ indicates the parameters estimated by the optimization procedure and the distributions parameterized by them. The true bias AUC(f + , f + ′ ) for these datasets is exactly 0.5 since the distributions are identical (no bias), but because there is error in the estimation θ , AUC(f + , f + ′ ) ≥ 0.5. For each setting of dimension D and number of components K used to generate the datasets, we determine τ η (D, K) for η ∈ {0.05, 0.10} of datasets with AUC(f + , f + ′ ) ≥ τ η (D, K).

Model Selection
To apply the appropriate bias threshold τ η (D, K) to any data it is important to know the number of components that best represent the data and use the threshold found for that setting (dimension is known). However, the true or best value of K is not generally known for any dataset. We evaluate the effect of unknown K for finding the threshold τ η by running the optimization on unbiased datasets for K ∈ {2, 4, 8} on all datasets, regardless of which value was used to generate the data. For each dataset, we compute the estimated parameters log-likelihood on a validation set and choose the model that maximizes the value. The validation set is generated with the same parameters as the original dataset.

Bias Quantification and Detection
To evaluate our method in detecting and estimating bias, we run our EM optimization algorithm on synthetic and biological datasets with varying amount of bias and report the estimated bias. For synthetic data where the true bias is known, we evaluate power for each level of type-1 error, η ∈ {0.05, 0.10}. Ground truth biased datasets B are those where the true bias AUC(f + , f + ′ ) > 0.5, for K number of components. Predicted biased datasets B are those where AUC f + , f + ′ ≥ τ η (D, K) for K selected through model selection. Power is estimated as q(τ) = |B| |B| . Figure 4 illustrates the thresholds found for each dimension and number of components. When the number of components, K, is smaller, parameter estimation more reliably estimates the bias lower. As the number of dimensions and number of components increases, so does the complexity of the optimization problem and the estimated value of bias. These results suggest the utility of finding dimension-and component-specific thresholds, and the empirical null distribution for ascertaining bias.

Results and Discussion
Results on quantification of bias on synthetic (Fig. 5) and biomedical (Fig. 6) data show increasing estimated bias as true bias increases. Note that for biomedical datasets the true bias is unknown and p stay is not a direct measurement of bias; different data sets have different levels of compactness in their feature space. Since the sampling probability is proportional to the inverse distance between points, the bias is also dependent on the density of points. Bias will differ across datasets for the same value of p stay and estimated bias cannot be directly compared between datasets. However, for each datasets bias should increase as the sampling less uniform, i.e. p stay increases. In synthetic data, we see excellent power ( Fig. 7) for the type-1 error of 0.05 across all levels of bias, dimensionality D and the number of components (K) per class-conditional distribution. We also see for high-bias datasets (AUC(f + , f + ′ ) ≥ 0.9) on datasets with two components, that some datasets have a low estimated bias. Our investigation showed that to generate datasets with high bias and few components, the mixing proportions w k + or v k + must be very skewed, making the optimization difficult, sometimes unrealistically so. For one dimension, the average minimum value of the smallest w k + for datasets with AUC(f + , f + ′ ) ≥ 0.9 is 0.01, 0.07 for 0.8 ≤ AUC(f + , f + ′ ) < 0.9, and 0.19-0. 23 for AUC(f + , f + ′ ) < 0.8. Figure 7 shows the power for bias detection on synthetic datasets for type-1 error η ∈ {0.05, 0.10}. For each setting we see generally higher power in bias detection as the true bias increases. For higher type-1 error, the detection achieves a higher power. Again there is a drop in performance for K = 2 in high-bias datasets due to the challenging nature of these datasets.
For real datasets we also show that our estimation of α and negative bias is not generally affected by increasingly biased samples of the positive class (Fig. 6, middle and bottom rows, respectively). Our EM algorithm is still able to detect that the set of unbiased labels from the negative class are truly unbiased (a low value of AUC(f − , f − ′ )). The estimation for bias for negative class in UCI results is consistently better than the estimation of bias for unbiased positive samples because α is always less than or equal to 0.5. Higher estimated bias in negatives seems to be correlated with overestimation of the class prior α, particularly exemplified in the parkinsons dataset.

Conclusion
Despite a broad awareness that biased data may adversely impact the deployment of machine learning tools in biomedicine, there is a surprising dearth of methods built to ascertain the existence and the level of bias in available data. We set out to address this deficiency by developing and extensively evaluating a bias estimation method based on reasonable assumptions. We used synthetic and real-world biomedical data to show that technologies for bias detection and ultimately correction can be realistically implemented in future data processing pipelines.   Bias (AUC(f + , f + ′ )) thresholds found from parameter estimation on unbiased data sets.   Pac Symp Biocomput. Author manuscript; available in PMC 2023 January 01.