Trustworthy in silico cell labeling via ensemble-based image translation

Artificial intelligence (AI) image translation has been a valuable tool for processing image data in biological and medical research. To apply such a tool in mission-critical applications, including drug screening, toxicity study, and clinical diagnostics, it is essential to ensure that the AI prediction is trustworthy. Here, we demonstrate that an ensemble learning method can quantify the uncertainty of AI image translation. We tested the uncertainty evaluation using experimentally acquired images of mesenchymal stromal cells. We find that the ensemble method reports a prediction standard deviation that correlates with the prediction error, estimating the prediction uncertainty. We show that this uncertainty is in agreement with the prediction error and Pearson correlation coefficient. We further show that the ensemble method can detect out-of-distribution input images by reporting increased uncertainty. Altogether, these results suggest that the ensemble-estimated uncertainty can be a useful indicator for identifying erroneous AI image translations.


INTRODUCTION
Deep learning has demonstrated remarkable promise in enhancing and interpreting biological and medical data that are overwhelmingly complex for traditional parametric approaches (1).Specifically, artificial intelligence (AI) image translation has proved capable of labeling and enhancing image data, in which the deep neural networks (DNNs) decipher the highdimensional and nonlinear relationship between the target and input images.Such a powerful approach has enabled various in silico analyses of microscopy data, including identification of different cell types in co-culture samples (2), noninvasive labeling of organelles (3), virtual histological staining (4), image resolution enhancement (5), AI-aided medical diagnosis (6), label-free cell sorting (7), and in situ phenotyping of live cells (8,9).These innovative experimental capabilities have paved avenues to addressing interesting open questions in life sciences and biophysical research.
Despite such innovation (2)(3)(4)(5)(6)(7)(8)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19), AI image translation has not been routinely integrated in biology experiments, clinical use, or pharmaceutical applications.A major reason that causes this lack of traction is the unknown accuracy of AI predictions when the ground truth is absent.DNN models are black-box functions with multiple layers of nonlinearities (20), which make the evaluation of prediction confidence challenging (21)(22)(23).In the biomedical field, it is particularly important to ensure the accuracy of AI predictions, as any error could lead to catastrophic misinterpretation such as disease misdiagnosis or false results of drug toxicity or efficacy (24).The inability to know when to trust and when not to trust the AI prediction fundamentally hinders these mission-critical applications of DNNs (25,26).
In this work, we adapted the ensemble method to quantify the uncertainty of AI image-to-image translation.To do this, we developed a workflow that converts the standard deviation (Std) of AI predictions into uncertainty.Using actual microscopy data of mesenchymal stromal cells (MSCs), we showed that our ensemble method can effectively capture the AI translation accuracy across multiple molecular markers.More importantly, our method can detect mispredictions that arise from sample mishandling, imaging condition variation, and subtle cellular phenotypic changes, suggesting that the ensemble-based uncertainty evaluation can detect unexpected input (called out-of-distribution (OOD) data).In addition, we developed a FastEnsemble training framework that builds upon the recent findings of the local minimum connectivity in DNNs (44).This training strategy allows us to generate multiple independent ensemble models with a small computational overhead.Experimental results demonstrate that this training framework significantly accelerates the running time without compromising the quality of uncertainty estimation.

Microscopy image acquisition
Throughout this work, we used experimental microscopy data for testing the uncertainty evaluation method.By imposing tractable perturbations to these microscopy images, we studied how our uncertainty assessment identified AI misprediction.
This work mainly tested microscopy images of MSCs and prostate cancer cells.Specifically, human bone marrow-derived MSCs (ATCC, PCS-500-012) were cultured according to the manufacturer's instruction and standard protocols (45,46).In brief, once the MSCs were thawed, they were seeded into tissue culture flasks at a density of 5000 cells/cm 2 with the culture medium comprising DMEM (Gibco, 1 g/mL glucose, 500 mL), 10% fetal bovine serum (Gibco), and 1% penicillin/streptomycin (Gibco).The MSC culture medium was replaced every 48 h.Similarly to the MSCs, we cultured androgen-sensitive human prostate adenocarcinoma cells (Lymph Node Carcinoma of the Prostate (LNCaP)).LNCaPs were seeded at a density of 10,000 cells/cm 2 and cultured according to ATCC CRL-1740 protocols with medium comprising Gibco RPMI 1640, 10% fetal bovine serum (Gibco), and 1% penicillin/streptomycin (Gibco).
For immunofluorescence, the cells were first washed with PBSþ/þ; 4% paraformaldehyde (Thermo Fisher Scientific, 28908) in 1Â PBSþ/þ (Gibco) was subsequently used as the fixative.After $10 min of incubation, the samples were washed with PBSþ/þ.To immunostain the sample, the cells were first blocked using a solution consisting of 2% donkey serum (Sigma-Aldrich, D9663-10ML) and 0.5% Triton X-100 (Sigma-Aldrich, T8787-50ML) for 30 min.Each sample was then washed with PBSþ/þ twice, and then incubated with the primary staining solution (0.5% BSA, 0.25% Triton X-100, and the primary antibody).The slides were left in the staining solution for 30 min and then washed twice with 1Â PBS.After washing, the secondary staining solution (including drops of NucBlue and the secondary antibody (MSC) or BODIPY (LNCaP)) was added for 30 min.Last, the samples were washed twice with PBSþ/þ and added to 0.1% Tween 20 (Sigma-Aldrich, P9416-50ML) for long-term storage at 4 + C. In addition, a fully prepared and pre-stained mouse kidney section slide was purchased from Invitrogen (F24630).

AI model training image datasets
All AI training datasets consist of paired phase-contrast and fluorescence images of either MSCs, LNCaPs, or kidney tissue section.The image data tested in this work can be mainly categorized into three groups: 1) baseline images that are the raw microscopy data, 2) perturbed images with artifacts that were introduced in a tractable fashion, and 3) OOD images with gradual distribution shifts.
The baseline training images contain pairs of phase-contrast and the corresponding immunofluorescence (IF) images of MSCs.The cells were immunofluorescently stained for a series of surface markers (i.e., CD105, CD29, CD44, CD90, and STRO-1) that are routinely used to define MSC characteristics (47).After image acquisition, quality control was performed where blurry or artifact-containing images were excluded.
To understand if our uncertainty evaluation can be applied in practical cell imaging tasks, we perturbed the baseline training images using Fiji ImageJ (48).We studied the following image perturbations: image impurities, overexposure, nonuniform illumination, and zoomed-in images, which mimicked the effect of using different or compromised microscope settings.Additionally, we investigated the effect of cell type mismatch.These OOD datasets were then used for testing the uncertainty assessment.The training and testing set parameters for each dataset are summarized in Table 1.
To further analyze the performance of our uncertainty assessment, we obtained two sets of training images that have gradual distribution shifts.The first dataset includes images of LNCaP cells that are treated with enzalutamide (Enza) (Selleck Chemicals S1250) for 48 h.Using the untreated sample (control), we trained an AI model to predict the fluorescence images of BODIPY (lipid droplets in LNCaP cells) from the phase-contrast images.This model was then applied to the dataset of treated LNCaP cells to evaluate how the drug-altered cell phenotype affects AI predictions.The second dataset comprises images of LNCaP cells with four different cell densities.In these images, the cells exhibit different morphological phenotypes as a result of proliferation.The images were divided into four subsets for evaluation purposes: 20% confluency, 50% confluency, 80% confluency, and 100% confluency.These dataset allow us to systematically study whether our approach is capable of flagging OOD data that have slight distribution shifts.

Comparison of uncertainty evaluation models
To understand how the performance of ensemble-based uncertainty evaluation compares to that of other existing tools, we conducted a systematic comparison of six common methods where three are ensemble based (i.e., naive ensemble, BatchEnsemble (39), Snapshot ensemble (38), MC-Dropout (35), SVI (36,37,49), and SGLD (33,34)).The central features of the tested methods and corresponding parameters used in the comparative study are discussed below.

Naive ensemble
We trained six models independently with different random seeds.The prediction results were generated by a simple average.The total computational budget is 6B, where B is the budget to train one model from scratch.

FastEnsemble
We first trained a standard checkpoint with budget B, then use k2þk3 k1 BÂ5 to obtain the rest five models.In total, it costed

BatchEnsemble
BatchEnsemble is a method that reduces the computational and memory costs of performing ensemble calculations by optimizing the ensemble weight generation mechanism.(39) We replicated the BatchEnsemble code from official repository at https:// github.com/google/edward2/blob/main/edward2/tensorflow/layers/convolutional.py#L560 and extended it to support ConvTranspose2d layer.We matched the training budget of our method by increasing the training time proportionally.

MC-Dropout
The MC-Dropout method is a framework that utilizes dropout training in DNNs as approximate Bayesian inference in deep Gaussian processes.(35).We used the dropout rate equaling to p ¼ 0:5.The computational budget is B.

SGLD
The SGLD method estimates the prediction uncertainty by adding noise to a standard stochastic gradient optimization algorithm.(33,34).We first trained the model until convergence (i.e., burn-in phase).At this stage, we did not inject Gaussian noise.During the inference time, we then trained the model for one epoch after each sampling where the learning rate was 1000Â smaller than the training stage.No preconditioning technique was applied.We noted that, although training budget was only B, the inference budget was much higher than other methods.

Snapshot ensemble
The Snapshot ensemble algorithm leverages the cyclic learning rate scheduling in stochastic gradient descent to create multiple model snapshots with a single training process (38).
Furthermore, to fairly compare the computational time of these methods, we set the number of epochs to be the same across all methods except for naive ensemble (which is almost five times longer).Although we cannot ensure the wall clock time in each epoch to be the same (SVI and BatchEnsemble tend to be slower due to more complex model architecture), the relative difference in running time is negligible.

Receiver operating characteristic quantification of prediction accuracy
To quantify the performance of uncertainty assessment for individual methods, we used them to evaluate the impact of cell type mismatch and image impurities on prediction accuracy.To do this, we manually labeled the local regions that contain mismatched cells (LNCaP cells) or impurities using bounding boxes (red boxes in Fig. 6 a).Here, the cell type mismatch dataset was created by artificially cutting images of LNCaP cells and superimposing them to images of MSCs.Examples of false positives (i.e., the prediction was accurate but flagged by our uncertainty algorithm) and false negatives (i.e., the prediction was inaccurate but The test set was held separate from the train set for all model testing.All OOD datasets were trained using the MSC-CD105 training set and tested using a separate test set with the corresponding OOD perturbation.For drug-altered phenotype tests, the model was trained using the control condition and tested on either the control or Enza test set.
To measure model accuracy on cellular distribution shifts, all LNCaP density datasets were trained on the VSparse training set and tested on the corresponding density test set.
Biophysical Reports 3, 100133, December 13, 2023 3 not flagged by our uncertainty algorithm) are denoted by yellow arrows in Fig. 6 a.
We then defined the pixels inside bounding boxes as positive instances S 1 .All pixels (both inside and outside bounding boxes) were subsequently ranked by the uncertainty values in a descending order.The top-k highest uncertainty pixels were then used to define S 2 instances.We then have TP@k ¼ jS 1 XS 2 j, Precision@k ¼ TP@k= k, and Recall@k ¼ TP@kjS 1 j.Here, Precision@k and Recall@k report the model performance, and TP is the number of true positives.
Using the analyzed result, we then generated a receiver operating characteristic (ROC) curve with TP @k as the y axis and FP @k as the x axis (Fig. 6 b and c) for each method.The ROC curve is a probability curve that reports the true-positive rate against false-positive rate.To further quantify ROC curves, we computed the area under the curve (AUC) (50), in which higher AUC values indicate better model performance in distinguishing the positive and negative classes.The results are summarized in Table S1.
We found that the ROC curves of the naive ensemble and our FastEnsemble method exhibit a very similar trend, whereas we note that the naive ensemble is sim 5 Â slower to train.The runner-up group is the Snapshot ensemble and BatchEnsemble, which are as fast as our FastEnsemble method.Additionally, we found that traditional approximated Bayesian inference methods (i.e., SVI, SGLD, and MC-Dropout), did not perform as well as other methods on the tested benchmark.Their suboptimal performance might be due to the approximation being too coarse to make compelling Bayesian inference, suggesting that a more precise Bayesian approximation is required for image translation applications.

FastEnsemble
Despite the success of naive ensemble approaches, the naive ensemble method requires independent training for individual models, which could be time consuming for image-to-image training tasks.We propose a simple but effective FastEnsemble method to reduce the training time while maintaining the prediction accuracy.Our approach builds upon the recent findings in mode connectivity of local minimum (44), in which different local minima in the neural network training objective were found to be connected by a "low-loss valley."Therefore, it is possible to traverse from one local minimum to another through a path with small training loss.Starting from the first local minimum w 1 , we propose an algorithm to traverse to another local minimum w 2 through this low-loss valley to avoid retraining the model from scratch.Specifically, assume w 1 ; .; w m are the current models, to get w mþ1 , we initialize the model from w m and solve the following training objective: where the second term in Eq. 1 promotes the diversity of the solutions.The characterization of our FastEnsemble method performance is presented can be found in Table S1.Detailed description of the training algorithm can be found in Algorithm S1.

Ensemble methods for AI prediction uncertainty
The goal of uncertainty estimation is to measure the confidence of the AI model prediction.Previously, un-certainty estimation has been mainly discussed in the context of multi-label classification problems (51)(52)(53)(54), where the output of the model, denoted by f ðx; wÞ, is a single label.Here, we discuss the implementation of ensemble-based algorithms for uncertainty evaluation in image-to-image translation tasks.We employed the standard U-Net architecture (55) for our proposed image prediction method.Here, we chose Unet-256 configuration, with channel multiplier (number of filters in the generator ¼ 64) and batch normalization.Dropout is disabled except in the MC-Dropout comparison model.In the model, we assume f ð$; wÞ denotes the neural network parameterized by w.Training a neural network is equivalent to finding the parameters to fit the observed data pairs, which can be written as argmin w 1 n P n i¼ 1 lðf ðx i ; wÞ; y i Þ, where lð$; $Þ is the loss function measuring the discrepancy between the ground truth output and model's prediction and fðx i ; y i Þg n i¼ 1 are training data.We used stochastic gradient descent as the optimizer for solving the training objective.After training, the model translates each test image x into f ðx; wÞ.However, when x is an OOD image, it will still output f ðx; wÞ with suboptimal quality, leading to the importance of uncertainty estimation.For image-to-image translation tasks, we can obtain the uncertainty of individual pixel values from each of the K predictions from each of the K models.One advantage to a spatial uncertainty estimation is to flag regions in predicted images that may be OOD or highly uncertain compared to the rest of the image.
Our uncertainty evaluation model trains N ¼ 6 independent CNN models from the same set of training data and runs each model over the test set, which outputs a single-channel, pixel-averaged prediction image (Fig. 1 a).Using the difference in pixel values from the generated images, we demonstrate an image map of the Std, with brighter values indicating a higher Std (Std map of Fig. 1 a).We found that the ensemble of six independent models is sufficient for generating a robust Std map (Fig. S1).By comparing the Std map to the error map (i.e., deviance between target and mean prediction), we observed a noticeable correlation in the pixel intensity distribution, suggesting that the prediction Std may capture the actual translation error.
Next, we quantified the relationship between Std and error to calibrate the uncertainty across the testing images.We first generated a scatter plot of pixel-level intensity to illustrate this Std-error correlation, where the prediction error diverges monotonically with increasing Std as shown in Fig. 1 c.This observed divergence of the Std-error relationship indicates that a greater Std value corresponds to a higher chance of observing larger actual errors.Since the diverging trend of the Std-error relationship cannot be described by an one-to-one function, it is difficult to directly visualize the correlation between error and Std by comparing their maps.Therefore, to further analyze this correlation, we performed equal-width binning of the absolute error and plotted the binned value as a function of Std (Fig. 1 d).We then fitted a quadratic polynomial to the mean bin value.Here, we used the quadratic form to describe the Std-error relationship because of its mathematical simplicity and the mildness of the data saturation.In the cases where the data exhibit a more extended saturation plateau, alternative functions, such as exponential saturation, double exponential, and logistic function, can be considered.We also found that the data points of the top 10% of error values are very sparse and can potentially reduce the reliability of fitting.We, therefore, excluded those data points during fitting to ensure that the best fitting curve accurately captures the overall Std-error relationship.The fitted quadratic function (red line in Fig. 1 d) was subsequently utilized to determine the pixel-wise uncertainty from the Std.This calibration procedure was repeated for all molecular markers and imaging conditions throughout all experiments.

Ensemble-based uncertainty correlates with AI prediction inaccuracy
To understand whether our uncertainty calculation provides a robust and consistent assessment of the AI prediction accuracy, we performed the uncertainty quantification for five different MSC markers and nucleus staining.For each marker, we averaged the pixel-level uncertainty and absolute error values over individual fields of view (FOVs) and plotted these results in Fig. 2 a.We found a positive correlation (Pearson correlation coefficient $0.83) between our calculated uncertainty and absolute error across all six markers.In addition, the dataset that was contaminated with imaging artifacts (CD105-Impurities) exhibited a similar trend.This finding supports approximating the errorcalibrated Std as uncertainty.
Next, we computed the pixel-pixel Pearson correlation coefficient, r s , between the target and prediction, in which a higher r s value indicates a more accurate prediction.By comparing the r s with mean uncertainty for each FOV in Fig. 2 b, we observed a negative correlation between them for all tested markers.Since the definitions of Pearson correlation and uncertainty are strictly independent, we did not anticipate a universal trend across different markers.Our observed anticorrelation between r s and uncertainty further validated our uncertainty quantification approach.Also, such a finding suggests that our uncertainty assessment is effective for all tested markers.Therefore, an advantage of our uncertainty evaluation method is that it al-lows us to simultaneously assess the AI prediction performance and combine different markers into one image for an integrative assessment and visualization.An example five-marker composite image is shown in Fig. 2 c.
One application of our uncertainty evaluation is to identify noncellular artifacts in the microscopy data.Such impurities may corrupt the analysis statistics and cloud interpretation of AI predictions.The impurities commonly found in microscopy include precipitated crystals in the staining buffer, air bubbles due to pipetting errors, substrate scratches, and bacterial substances.In this work, we tested precipitated crystals as a demonstration.Specifically, we first used impurityfree data (baseline training set) for training the models, and then deployed the trained DNNs for translating phase-contrast images that contain impurities into fluorescent images.As shown in Fig. 2 d, the artifact (arrow) caused noticeable mispredictions and strong signal in the error map.At the same time, the pixels that are covered by the contaminant also exhibit uncertainty values that are significantly higher than those of other areas.This result suggests that our uncertainty evaluation can accurately flag the local contaminant that should be excluded from further analysis.
Both the image translation and our uncertainty quantification methods can be applied to a wide range of microscopy applications.We demonstrated this versatility by repeating our uncertainty evaluation using images of mouse kidney tissue sections (Fig. 3 a) in which the sample was immunostained for actin and nucleus.Following our calibration workflow, we calculated the prediction Std and converted it into uncertainty.Like the MSC markers, we found that the actin signal in the section sample exhibited a clear error-Std correlation (Fig. 3 b).We also found that the converted uncertainty moderately correlated with the absolute error with a Pearson correlation coefficient $0.57(Fig. 3 c).This demonstration suggests that our uncertainty assessment can be implemented in virtual histological staining and other similar applications.

Evaluation of OOD data uncertainty
The assessments of AI prediction accuracy typically rely on the direct comparison between prediction and target images.In many biological experiments, however, the ground truth images are strictly inaccessible and the prediction-target comparison is practically infeasible.In this case, although the DNN may still generate visually convincing results, these AI-predicted images could potentially deviate from the target.When applied in drug screening, toxicity studies, or clinical applications, such misleading results could have severe consequences (56).The results in the previous subsection suggested that our calibrated uncertainty should be able to detect OOD data.To understand if our quantification method can obtain a reliable metric for AI prediction accuracy without access to the ground truth, we conducted systematic tests using a series of perturbed microscopy images.Specifically, we tested six cases of OOD data, in which the testing datasets exhibit different degrees of visual differences from the actual training set: 1) image overexposure, 2) nonuniform illumination, 3) magnification mismatch, and 4) inconsistent cell type and cell morphology changes arising from 5) drug treatment and 6) cell expansion.These scenarios can be mainly categorized into two groups: imaging condition variations (1-3), and sample variations (4)(5)(6).
Applying these perturbations to models trained on the CD105 dataset, we showed that our ensemble method can identify test images that are drastically different from the training data by reporting increased uncertainty values.We selected the CD105 dataset due to the high signal-to-noise ratio of the target and accurate AI predictions.We generated the perturbed images by modifying duplicated test images (i.e., phase-contrast images) using Fiji ImageJ.As shown in Fig. 4 a-f, we found that all of the introduced image perturbations successfully led to AI prediction errors.We noticed that, although these errors may be detected by trained experts, such mispredictions are subtle and can be easily overlooked.For example, the imaging overexposure caused an overall blurry prediction with faint cell boundaries (Fig. 4 a).In the case of nonuniform illumination (Fig. 4 c), the overexposed upper right corner (yellow triangle) of the predicted im-age shows a cloudy and nonlocalized protein distribution.When the magnification is mismatched, we observed a patchy and fragmented CD105 distribution across the cell (Fig. 4 e), despite CD105 being a glycoprotein surface marker that should be uniformly expressed throughout the cytoplasm.
Importantly, we found that our uncertainty assessment successfully identified all these subtle errors.First, the uncertainty maps showed elevated intensities for all altered images (Fig. S2), in which the uncertainty assessment correctly highlighted the upper right corner in the nonuniform illumination case.By averaging the pixel-level uncertainty value over the FOV, our method reported a significant increase in the mean uncertainty compared to the unperturbed data for all tested OOD (bar charts in Fig. 4 b, d,  and f).This finding suggests that the uncertainty mean can act as an indicator, labeling predicted images that should be further scrutinized or even excluded.We further studied whether our uncertainty calculation can detect prediction errors due to cell type mismatch.To do so, we applied an AI model that was trained using the MSC data to the images of prostate cancer cells (LNCaP), which should not express CD105.As anticipated, the AI prediction contains substantial errors, which were mostly detected by our uncertainty assessment (Fig. 4 g and h).

Uncertainty evaluation of data with gradual distribution shifts
An essential application of AI image translation is to provide real-time molecular-based characterizations of cells for pharmacological study.In this application, the AI model predicts the cell characteristics or expression levels of molecular markers that can either evolve during natural cell growth or be altered by drug treatments.Because of cellular dynamics, it is imperative to ensure that the cell behavior is faithfully reported by the AI prediction; however, it has remained difficult to verify the black-box predictions in those tasks (57).In this work, we studied the influence of drug treatment and cell confluency on the AI prediction and tested if our uncertainty estimation is able to detect prediction errors.
To study how the drug-altered cell morphology affects the AI prediction, we cultured prostate cancer cells (LNCaP) and treated them with Enza, which impairs cell growth and alters cell metabolism through androgen receptor inhibition.A specific effect of Enza treatment is the reduction of lipid droplets (i.e., lipogenesis) (58).Using the untreated samples, we first trained an AI model that translates phasecontrast images into fluorescent images of BODIPY that stains the lipid droplets (Fig. 5 a).We then applied this model to the test dataset of treated LNCaP cells to obtain BODIPY image predictions (Fig. 5 b).Compared to the control data, we confirmed that the LNCaP cell morphology was altered and the BODIPY signal was reduced (IF target) by the Enza treatment.We also found that the AI model that was trained only using the control data cannot fully capture the drug-induced reduction in BODIPY signal (prediction image in Fig. 5 b).Such a misprediction can be visualized by the increased intensity in the error map (error map in Fig. 5 b).At the same time, we found that the ensemble method reports an elevated level of uncertainty (uncertainty map in Fig. 5 b).By averaging the uncertainty value and absolute error over the FOV, Fig. 5 c and d show that our uncertainty quantification effectively detects the OOD.
To understand how the cell proliferation-induced sample alterations influence the AI prediction, we seeded the LNCaP cells at a relatively low density (20%) and acquired microscopy images until the cells reached 100% confluency (i.e., 100% area coverage).The distribution shift of this dataset mainly arises from phenotypic changes associated with cell density.Like the drug experiment, we trained a model using control data, which are phase-contrast and BODIPY images of 20%-confluency samples (VSparse training set in Table 1).We then applied this model to the datasets of 50% (Sparse), 80% (Dense), and 100% (VDense training set in Table 1) cell confluence (Fig. 5 e).We found that the ensemble-predicted uncertainty increases with increasing confluency (Fig. 5 f), consistent with the mean target-prediction error calculation (Fig. 5 g).This finding further confirms that the cell morphology and density changes directly affect the AI translation accuracy, which can be captured by the ensemble uncertainty calculation.

Ensemble acceleration preserves the uncertainty evaluation accuracy
Although the naive ensemble has been shown to provide am accurate estimate of prediction uncertainty, a major weakness of this method is its computational overhead for building independent models.Therefore, it is critical to investigate new approaches to develop uncertainty estimation methods with low-cost computation.To address this technical hurdle, we developed an acceleration algorithm, FastEnsemble, which searches for independent low-loss optimization paths starting from the initial model (see section "materials and methods").We found that this directed optimization approach allows us to complete a new model training task with only 3%-5% additional training time.For example, generating an ensemble of six models using the naive ensemble approach requires a computational time $6Â single-model training time, whereas the FastEnsemble approach requires $1.2Â single-model training time, quintupling the training speed.
Furthermore, such an acceleration preserves the high prediction accuracy achieved by the naive ensemble.We empirically found that this approach can lead to a diverse set of models w 1 ; .; w K , which gives an uncertainty estimation akin to the naive ensemble method (Fig. 6).To quantitatively assess the performance of the FastEnsemble method, we used bounding boxes to manually label local image regions that comprise inaccurate predictions.These inaccurate prediction incidents were either artificially introduced (cell type mismatch) or identified using phase-contrast input images (impurity).
We utilized our uncertainty evaluation methods to classify incidents within these labeled image data.Further details of this analysis can be found in section "materials and methods."The uncertainty assessment accuracies of our FastEnsemble and naive ensemble methods were then compared with five previously published methods (i.e., MC-Dropout, SVI, SLGD, BatchEnsemble, and Snapshot ensemble).The main features and corresponding parameters of the tested methods are summarized in the "materials and methods" section.We then analyzed the corresponding ROC curves (see section "materials and methods"), as summarized in Fig. 6 and Table S1.We found negligible performance difference between the FastEnsemble and naive ensemble methods, both of which outperformed the MC-Dropout (blue curve) and SVI (orange) methods.This result demonstrates that we can preserve the performance while reducing computational load.Notably, our FastEnsemble is approximately five times faster than naive ensemble, since each model training is initialized from previous solution, substantially reducing the optimization iterations.

CONCLUSIONS AND DISCUSSION
In this work, we found that ensemble-based deep learning can effectively report the image translation uncertainty, which is correlated with the absolute error and Pearson correlation coefficient.We considered the problem of uncertainty estimation and calibration in image translation under various distribution shifts, in which the batch effects (e.g., cell-cell phenotype variation, batch-to-batch inconsistency, imaging condition differences) are the main reasons responsible for the prediction error associated with different imaging conditions and methods.We further showed that our uncertainty assessment method can be used to forecast translation error that is associated with varying imaging conditions and specimen alterations.
Compared to previous uncertainty studies that have mainly focused on classification tasks, our work demonstrated that ensemble algorithms can be employed to predict image-to-image translation errors, in which experimental microscopy data were directly used for all conducted tests.Furthermore, our FastEnsemble method, which provides a sensitivity and specificity akin to the naive ensemble and other ensemble-based methods (e.g., BatchEnsemble and Snapshot ensemble), presents a valuable alternative approach to reduce computational costs in approximating posterior distribution.Since the ensemble method simply evaluates the statistics of the AI predictions, it can be straightforwardly integrated with various deep-learning models that have been utilized for labeling and processing microscopy data, improving the trustworthiness of these cutting-edge tools (2,3,8,9,(59)(60)(61)(62).
Overall, our model serves as a foundational step toward robust AI image translation for biomedical applications; however, practical implementation of our reported algorithms would require further development.For example, although the proposed algorithm focuses on pixel-pixel deviations, data-exclusion decisions would require tools to flag highly uncertain features in the microscopy data (63).In addition, although many studies have empirically demonstrated the feasibility of ensemble-based uncertainty evaluation, its underlying working mechanism remains not fully understood.Thus, establishing further theoretical understanding will enable developments of more reliable uncertainty assessment tools.Last, it is imperative to eventually improve the robustness of AI labeling, where the AI model can still make accurate predictions when encountering OOD images.In robust machine learning, which is a rapidly growing field, studies on distribution shift robustness are specifically related to the microscopy application.Over the past decade, promising advances have been made in developing algorithms to quantify the model robustness (64,65) and training frameworks to optimize the model stability (66)(67)(68).

FIGURE 1
FIGURE 1 Demonstration and characterization of ensemble method and uncertainty map.(a) Schematic of the ensemble method, showing independent training of N models that generate an average prediction and corresponding standard deviation (Std) between predictions.(b) Left to right: example of a phase-contrast input image, target immunofluorescence image (ground truth), the respective AI prediction, error (difference between prediction and target), and Std map estimated by our ensemble method.(c) Scatter plot illustrating the correlation between the pixel-level Std and error divergence (dashed lines).(d) Boxplot generated by binning the absolute error in Fig. 1 c, confirming the correlation between error and Std.Red line denotes a quadratic function fitted to the data 1:318e À 6xStd 2 þ1:026e À 3xStdþ7:662e À 4. The fitted function is subsequently used for converting Std into uncertainty.Error bars represent standard deviation of the absolute error.

FIGURE 2
FIGURE 2 Ensemble-estimated uncertainty predicts image translation error across various MSC markers.(a) Positive correlation between uncertainty and absolute error that are averaged over the field of view (FOV) for all five tested MSC markers and nucleus staining.In addition, datasets that contain imaging artifacts (dark green, CD105-Impurities) also exhibit a similar correlation.(b) Uncertainty negatively correlates to Pearson correlation coefficient (r s ) for all tested models.(c) Independently trained AI predictions and uncertainty assessments can be combined into an image composite reporting prediction performance for individual markers.(d) Ensemble-based uncertainty effectively detects noncellular (white arrow) impurities in the image.

FIGURE 3
FIGURE 3 Uncertainty evaluation in image translation of mouse kidney tissue sections.Ensemble-based uncertainty can also be used to assess the labeling performance in tissue section images.(a) Example images of a mouse kidney section.The sample is fluorescently stained for actin and counterstained for nucleus.In this test, we focused on predicting the actin signal distribution, in which the images from left to right are phase contrast, fluorescent target, AI prediction, target-prediction error map, and the uncertainty map generated using our ensemble method, respectively.(b) Boxplot generated by binning the pixel-level absolute error of the actin model, validating the correlation between error and Std.Red line denotes the Std-uncertainty conversion function.Error bars represent standard deviation of the absolute error.(c) Positive correlation between uncertainty and absolute error averaged over the FOV.

FIGURE 4
FIGURE 4  Ensemble-based uncertainty flags OOD predictions.The calculated uncertainty successfully identifies OOD image data generated by various types of perturbations, including overexposure (a and b), nonuniform illumination (c and d), wrong microscope magnification (e and f), and different cell type (g and h).Significant increase in uncertainty values (n ¼ 15 FOVs) is observed in all perturbation cases.Note that the y axis limits of (b), (d), (f), and (h) were adjusted to highlight the differences between conditions.For all barplots, ***p < 0.0001, in which the p values were determined using two-sample Student's t-test.Error bars represent standard deviation.

FIGURE 5
FIGURE 5 Distribution shift-based uncertainty evaluation Our uncertainty estimation is able to detect OOD datasets.(a) Example images of the control model, which shows LNCaP cells stained for BODIPY.(b) Control model applied to phase-contrast images of LNCaPs that have been treated with a chemo drug called enzalutamide (Enza).Treated samples show slight changes in morphology that lead to significantly higher uncertainty estimation (c) in predictions compared to control group.(d) Mean absolute error (MAE) calculations verify uncertainty estimation.(e) Cell culture expansion leads to confluency (cell density) difference over time, which can affect cell morphology and protein expression profiles.Uncertainty maps show increase in uncertainty estimation with increasing cell confluency.(f and g) Visual assessment can be confirmed by numeric uncertainty and MAE comparison.Note that the y axis limits of (c) and (f) were adjusted to highlight the differences between conditions.For all barplots, **p < 0.001 and ***p < 0.0001, in which the p values were determined using two-sample Student's t-test.Error bars represent standard deviation.

FIGURE 6
FIGURE 6 Quantification of the prediction accuracy.(a) Bounding boxes highlighting OOD area in the test images.Image samples from the cell type mismatch dataset (top) and image artifact dataset (bottom).Common mispredictions include distorted cell contours (upper row) associated with cell type mismatches and speckles (lower row) associated with impurities in the experiment sample.The yellow arrows indicate representative false-positive (third column) and false-negative (fourth column) events.Three different FOVs are shown for the image artifact case (lower row), since we could not identify any FOV that contains multiple false-positive and false-negative incidents.(b and c) Comparison of different ensemble-based uncertainty evaluations and other algorithms.ROC curves showing the relationship between the true-positive rate and false-positive rate for identifying the cell type mismatches (b) and image artifacts (c).We found that naive ensemble and FastEnsemble methods perform similarly, suggesting that both methods have a comparable sensitivity and specificity to diagnose translation predictions.

TABLE 1
Train and test set parameters for image datasets