Uncertainty-Aware Body Composition Analysis with Deep Regression Ensembles on UK Biobank MRI

Along with rich health-related metadata, medical images have been acquired for over 40,000 male and female UK Biobank participants, aged 44-82, since 2014. Phenotypes derived from these images, such as measurements of body composition from MRI, can reveal new links between genetics, cardiovascular disease, and metabolic conditions. In this work, six measurements of body composition and adipose tissues were automatically estimated by image-based, deep regression with ResNet50 neural networks from neck-to-knee body MRI. Despite the potential for high speed and accuracy, these networks produce no output segmentations that could indicate the reliability of individual measurements. The presented experiments therefore examine uncertainty quantification with mean-variance regression and ensembling to estimate individual measurement errors and thereby identify potential outliers, anomalies, and other failure cases automatically. In 10-fold cross-validation on data of about 8,500 subjects, mean-variance regression and ensembling showed complementary benefits, reducing the mean absolute error across all predictions by 12%. Both improved the calibration of uncertainties and their ability to identify high prediction errors. With intra-class correlation coefficients (ICC) above 0.97, all targets except the liver fat content yielded relative measurement errors below 5%. Testing on another 1,000 subjects showed consistent performance, and the method was finally deployed for inference to 30,000 subjects with missing reference values. The results indicate that deep regression ensembles could ultimately provide automated, uncertainty-aware measurements of body composition for more than 120,000 UK Biobank neck-to-knee body MRI that are to be acquired within the coming years.


Introduction
UK Biobank studies more than half a million volunteers by collecting data on blood biochemistry, genetics, questionnaires on lifestyle, and medical records (Sudlow et al., 2015).
The relationship between obesity, type-2 diabetes, and nonalcoholic fatty liver disease is of particular interest due to their high prevalence and associated adverse health effects (Wilman et al., 2017;Linge et al., 2018). Depending on genetic and environmental factors, body fat can accumulate in organs, abdominal depots, and muscle infiltrations, all of which have specific effects on health outcomes. Ongoing work is therefore concerned with acquiring measurements of liver fat content (Wilman et al., 2017), muscle volumes, and adipose tissue depots (West et al., 2016;Linge et al., 2018) with manual and semi-automated techniques (Borga, 2018). Recent works also proposed fully-automated techniques with neural networks for segmentation, which have been applied to the heart (Bai et al., 2018), kidney (Langner et al., 2020a), pancreas Bagur et al., 2020), and liver (Irving et al., 2017), but also the iliopsoas muscles (Fitzpatrick et al., 2020), spleen, adipose tissues, and more (Liu et al., 2021). Similar to the latter, neural networks have also been proposed for segmentation of adipose tissues in other studies involving computed tomography (CT) (Wang et al., 2017;Weston et al., 2019) and MRI (Langner et al., 2019a;Estrada et al., 2020;Küstner et al., 2020).
Apart from semantic segmentation, neural networks can also be trained for image-based regression, predicting numerical measurement values without any need for explicit delineations. In medical imaging, deep regression has gained attention for analyses of human age in MRI of the brain (Cole et al., 2018), volume measurements of the heart (Xue et al., 2017), and blood pressure, sex, and age in retinal fundus photographs (Poplin et al., 2018). On UK Biobank neck-to-knee body MRI, deep regression can quantify human age and liver fat, but also various measurements of body composition. For the latter, its accuracy can exceed the agreement between established gold standard techniques (Langner et al., 2020b).
This type of deep regression requires no ground truth segmentations and can measure abstract properties by training on numerical reference values from arbitrary sources. However, the lack of output segmentations poses a limitation, as the predicted numerical values give no indication of confidence or reliability. Previous work examined the underlying relevant image features with saliency analysis, but only provided interpretations on cohort level without attempting to estimate individual measurement errors.
Recent advances in the field of uncertainty quantification have the potential to address some of these concerns by providing an error estimate for each individual measurement (Ghahramani, 2015). High uncertainty could accordingly alert researchers or clinical operators to anomalies, outliers, or other failure cases of these systems (Kendall and Gal, 2017). Among various proposed methods, such as Bayesian inference with Markov chain Monte-Carlo techniques (Neal, 2012) and more computationally viable approximations that apply dropout at test time (Gal and Ghahramani, 2016), recent work reported superior behavior for deep ensembling strategies (Gustafsson et al., 2020b;Ovadia et al., 2019;Ashukha et al., 2020). These approaches provide predictive uncertainty by training multiple neural networks to each predict not only a point estimate but a probability distribution, with multiple network instances form-ing an ensemble (Lakshminarayanan et al., 2017). In related work, a similar approach was recently applied for age estimation from fetal brain MRI, reporting high accuracy and promising indications for abnormality detection (Shi et al., 2020).
The aim of this work is to develop an automated strategy for body composition analysis on UK Biobank neck-to-knee body MRI which provides not only measurements (Langner et al., 2020b) but also introduces individual uncertainty estimates that can represent confidence intervals. As a key advantage, the deep regression approach can be trained without access to reference segmentation masks and instead learns to emulate the existing, numerical metadata. Six body composition measurements relating to adipose tissues with high relevance for cardiometabolic disease were predicted from twodimensional representations of the MRI data. ResNet50 neural network instances (He et al., 2016) for image-based regression were trained to each predict the mean and variance of a Gaussian probability distribution over a given measurement value. Combined into ensembles they provided estimates of predictive uncertainty (Lakshminarayanan et al., 2017). The main contribution consists in extensive analysis of the independent effects of mean-variance regression and ensembling on overall accuracy and speed, but also on the calibration (Guo et al., 2017) of uncertainties and their ability to identify the worst predictions in sparsification (Ilg et al., 2018), both in cross-validation on about 8,500 subjects and testing on another 1,000 subjects. The proposed method was deployed for inference to obtain previously unavailable measurements from more than 30,000 images, including 1,000 repeat scans.

Materials and methods
The neck-to-knee body MRI of each subject was formatted into a two-dimensional image from which the proposed method estimates a numerical measurement value in image-based regression. This work examines least squares regression, which produces only the measurement value itself, (Langner et al., 2020b,c), but also mean-variance regression (Nix and Weigend, 1994), in which both the mean value and the variance of a Gaussian probability distribution over one measurement of one subject is modeled. In ensembling, the predictions of several networks are furthermore aggregated (Lakshminarayanan et al., 2017). The thus obtained uncertainty estimates can help to identify outliers and potential failure cases automatically (Gustafsson et al., 2020a).

UK Biobank image data
UK Biobank has recruited more than half a million men and women by letter from the National Health Service in the United Kingdom, starting in 2006(Sudlow et al., 2015. Examinations involve several visits to UK Biobank assessment centers, with imaging procedures launching in 2014 for a subgroup of 100,000 participants (Littlejohns et al., 2020). At the time of writing, medical imaging data from three different centers has been released for 40,264 men and women (52% female) aged 44-82 (mean 64) years with BMI 14-62 (mean 27) kg/m 2 and a majority of 94% with self-reported White-British ethnicity. As input to the neural network, each MRI volume was represented as color image of (256 × 256 × 3) pixels by forming channels from the projected water (red) and fat (green) signal and fat fraction slices (blue) from two axes each.
For 1,209 of these, data from a repeat imaging visit with an offset of about two years has been released. All participants provided informed consent and both the UK Biobank examinations and the experiments in this work were approved by the responsible British and Swedish ethics committees.

MRI data
The MRI protocol examined in this work is listed as UK Biobank field 20201 and covers the body from neck to knee in six separate imaging stations acquired in a scan time below ten minutes (West et al., 2016;Littlejohns et al., 2020). Volumetric, co-aligned images of water and fat signal were acquired with a two-point Dixon technique with TR = 6.69, TE = 2.39/4.77 ms and flip angle 10deg on a Siemens Aera Magnetom 1.5 device. The image resolution varies between stations, with a typical grid of (224 × 174 × 44) voxels of (2.232 × 2.232 × 4.5) mm (for more detail, see "Body MRI protocol parameters" in Littlejohns et al. (2020)).

Image formatting
For this work, the six MRI stations of each subject were first fused into a common voxel grid by trilinear interpolation to form a single volume of (224 × 174 × 370) voxels for each signal type. These volumes were then converted to twodimensional representations by summing all values along two axes of view, yielding a coronal and sagittal mean intensity projection, which were concatenated side by side. This was done separately for both the water and fat signal, with the resulting images individually normalized and downsampled to form two color channels of a single image of (256 × 256 × 2) pixels (Langner et al., 2020b). As a third image channel, both a single coronal and sagittal fat fraction slice were extracted based on a body mask (Langner et al., 2020c). These fractions resulted from voxel-wise division of the fat signal by the sum of water and fat signal. Fig. 1 shows the result, a dual mean intensity projection with fat fraction slices, encoded in 8bit for faster processing.

Ground truth
UK Biobank provides several body composition measurements from the same neck-to-knee body MRI data as used in this work, based on volumetric multi-atlas segmentations (West et al., 2016;Borga et al., 2015): Visceral Adipose Tissue (VAT), abdominal Subcutaneous Adipose Tissue (SAT), Total Adipose Tissue (TAT), Total Lean Tissue (TLT), and Total Thigh Muscle (TTM). Together with Liver Fat Fraction (LFF) values based on dedicated multi-echo liver MRI (Linge et al., 2018), these reference measurements form the ground truth data, or regression targets, for this work.

Data partitions
Among the 40,264 released images of the initial imaging visit, visual inspection identified 1,376 subjects with artifacts such as water-fat signal swaps, non-standard positioning and metal objects (Langner et al., 2020b). Three datasets were formed from the initial imaging visit from those subjects for whom any of the six reference measurements were available.
Dataset D cv consists of 8,539 subjects without artifacts and was subdivided into a 10-fold cross-validation split which was retained for all experiments.
Dataset D test contains another 1,107 subjects without artifacts and served as a test set, but notably lacks any values for two of the six regression targets for which no reference values have been released yet.
Dataset D art was formed from those subjects with identified artifacts, yielding 330 subjects, to examine behavior on abnormal data.
Two additional datasets were formed from those subjects with no available reference measurements. Dataset D in f er comprises all remaining 29,234 subjects without artifacts from the initial imaging visit, for whom the prediction model was applied to for inference. Finally, dataset D revisit was formed for inference on the repeat imaging visit from 1,179 subjects with no image artifacts.

Model
A ResNet50 architecture (He et al., 2016) was configured to receive the two-dimensional image format as seen in Fig. 1 as input for a given subject and predict all six regression targets at once. No explicit segmentation was performed at any stage of this work. Each network was pre-trained on ImageNet and optimized with Adam (Kingma and Ba, 2014) at batch size 32 with online augmentation by random translations. After 5,000 iterations, the base learning rate of 0.0001 was reduced by factor 10 and training continued for another 1,000 iterations (Langner et al., 2020b). All experiments were conducted in PyTorch, using an Nvidia RTX 2080 Ti graphics card with 11GB RAM.
Four distinct configurations were compared. As the first one, a least squares regression network predicted only these six output values, each corresponding to one measurement for a given subject, trained by optimizing the mean squared error criterion of equation 1. In this formula, µ θ (x n ) represents the network prediction for the n-th input sample x n , with y n as the corresponding ground truth value.
As a second configuration, least squares ensembles were formed by combining ten such networks. Their predictions were averaged and the spread, or empirical variance, of their predictions used as uncertainty estimate (Ilg et al., 2018).
As the third configuration, mean-variance regression was performed by predicting two values, corresponding to the mean and variance of a Gaussian probability distribution over one measurement value for a given subject, optimized with a negative log-likelihood criterion (Nix and Weigend, 1994) as shown in equation 2. Here, p θ (y n |x n ) is the probabilistic predictive distribution over one measurement value, modeled by the network outputs µ θ (x n ) and σ 2 θ (x n ), which represent the predicted mean and corresponding predicted variance for input sample x n , respectively. The last term, c, is a constant that does not depend on θ. This criterion expands the mean squared error of eq. 1 by a sample-specific, heteroscedastic variance and can likewise be averaged across multiple samples. This predicted variance directly serves as an estimate of uncertainty, with high values describing a wide normal distribution within which plausible values for the estimated measurement are assumed.
As the fourth and final configuration, mean-variance ensembles employ ten such network instances. Their predictions can likewise be aggregated to obtain estimates of predictive uncertainty (Lakshminarayanan et al., 2017).
In all ensembles, model diversity was increased by withholding one of ten evenly sized subsets of the training data from each instance, as if they had been obtained from a preceding cross-validation experiment. The target values were standardized (Langner et al., 2020b). When one or more of the six ground truth values for a given training sample were missing, their contribution to the loss term was dynamically set to zero, so that they would not affect the training process. In this way, it was possible to utilize samples with missing values and provide as much training data as possible. A PyTorch implementation for training and inference will be made publicly available 1 .

Evaluation
All configurations were evaluated in 10-fold cross-validation on dataset D cv and also validated against artifact dataset D art . The best configuration was eventually applied to test dataset D test and deployed for inference on datasets D in f er and D revisit .
The predicted measurements were compared to the reference values with the intraclass correlation coefficient (ICC) with a two-way random, single measures, absolute agreement definition (Koo and Li, 2016) and the coefficient of determination R 2 . The mean absolute error (MAE) is also reported, together with the mean absolute percentage error (MAPE) as a relative error measurement. The latter is the absolute difference between prediction and reference divided by the reference. Additionally, aggregated saliency maps were generated to highlight relevant image areas (Selvaraju et al., 2017).
The estimated uncertainties were evaluated regarding sparsification (Ilg et al., 2018) and calibration (Guo et al., 2017). Sparsification examines whether the highest uncertainties coincide with the highest prediction errors. Ranking all measurements by their uncertainty and excluding one after another should accordingly yield consistent improvements in performance metrics such as the MAE. Calibration examines the magnitude of uncertainties and resulting under-or overconfidence of predictions. The uncertainty obtained for any given sample corresponds to the variance of a Gaussian probability distribution, modeling characteristic confidence intervals around the predicted mean. Higher uncertainty scales these intervals to be wider, enabling them to cover larger errors. Ideally calibrated uncertainties define confidence intervals that cover, on a set of samples, a percentage of errors that corresponds exactly to their specific confidence level.

Results
Both mean-variance regression and ensembling provided complementary benefits. Combining both yielded the best predictive performance, shown in Table 1 and Fig. 2, with additional detail provided in the supplementary material. On average, the predictions can account for 98% (R 2 ) of the variability in reference values, with absolute agreement (ICC) above 0.97 on all targets. The metrics carry over to the test data largely unchanged. All targets are predicted with a relative error below 5%, except the liver fat fraction. This target also incurred the highest relative uncertainties and is examined further in the supplementary material, together with additional evaluation metrics, and a comparison to alternative reference methods. It also provides additional detail on the saliency analysis, which is compiled into Fig. 3.   4 shows that even without utilizing the uncertainties, the mean-variance regression ensemble reduces the MAE by 12% when compared to the least-squares regression baseline. The uncertainties enable sparsification, identifying some of the worst predictions which can be excluded to reduce the prediction error even further. The scatter plots of Fig. 2 show predictions for one target in detail, together with color-coded uncertainty. Despite containing image artifacts, not all subjects of dataset D art yield higher uncertainties than the normal material. Indeed, many of these subjects result in highly accurate predictions despite the artifacts, and high uncertainties tend to occur only in those cases with high prediction errors. On test dataset D test , the uncertainty highlights an outlier case for VAT (see Fig. 2), SAT, and TTM. This one subject causes consistently flawed predictions and was found to suffer from an abnormal, atrophied right leg.
On datasets D cv and D test the predicted means exhibit a consistent, linear correlation with the predicted log uncertainties. Accordingly, large subjects with high volumes induce systematically higher uncertainty. Although these cases also generally incur higher prediction errors, this bias can be shown to not achieve optimal sparsification. On the normal material with hardly any outliers, this tendency is so strong that sparsifying simply by predicted mean is almost as effective as using the uncertainties. On dataset D art , this bias is less pronounced, as those cases with artifacts that cause genuine prediction failures are correctly assigned much higher uncertainty.
The best calibration was also achieved by the mean-variance ensemble, which nonetheless often produced overconfident uncertainties. Post-processing with target-wise scaling factors can achieve a near perfect fit to the validation data, however, and also improves the overall calibration on the test set. The supplementary material explores both sparsification and calibration in more detail and also lists results for datasets D in f er and D revisit , on which the proposed method inferred new measurements for over 30,000 images.
No difference in processing speed was observed between least squares and mean-variance regression. Image formatting required the bulk of processing time, but once cached, training one network only requires about 15 minutes, or 2.5 hours for an ensemble of ten instances. Ensemble predictions for about 60 subjects can be generated within one second, so that inference for all 30,000 required less than ten minutes.    (Ilg et al., 2018) shows how the overall performance can be improved by gradually excluding those subjects with the highest predicted measurement uncertainty. Each position along the xaxis represents a certain share of excluded, most uncertain measurements, whereas the y-axis shows the change in mean absolute error relative to baseline, averaged across all targets on dataset D cv . Even without utilizing the uncertainty to exclude any subjects, the mean-variance ensemble achieves a reduction of the MAE by 12%. Further improvements in the MAE can be achieved excluding increasingly large shares of those measurements with highest uncertainty.

Discussion
With relative measurement errors below 5%, all targets except the liver fat fraction can be predicted with higher accuracy than observed for the mutual agreement between the reference and alternative established methods, both in cross-validation and on the test data. For liver fat itself, the relative error of 22-26% is worse than the 15% seen between the reference used here and an alternative set of UK Biobank liver fat measurements. The two-point Dixon images inherently limit the prediction accuracy for this target, as the reference values were obtained from another imaging protocol that reconstructs fat fractions more faithfully (Wilman et al., 2017;Linge et al., 2018). The saliency analysis of Fig. 3 indicates that the networks nonetheless learned to correctly identify liver tissue and other target-specific regions. The inference on 30,000 subjects provides material for further medical study which is, however, beyond the scope of this work.
The estimated uncertainties identified many of the worst prediction errors. They correctly highlighted an outlier with abnormal physiology on the test data and enabled consistent reductions in the mean prediction error by excluding the least certain measurements. On the inference datasets, the highest uncertainties were furthermore found in several cases to coincide with previously undetected anomalies in positioning, but also with minor artifacts and pathologies that may have neg-atively affected prediction accuracy and should arguably have been excluded during the original quality controls. In practice, the acquired measurements can accordingly be supplied together with their uncertainty, which could serve both as an error estimate and as a means to identify potential anomalies and failure cases. The affected cases could then be manually examined and, if necessary, excluded from further analyses.
However, the results also show two noteworthy limitations of the proposed approach which arise from imperfect calibration and the observed bias for high measurement values to incur high uncertainties. The imperfect calibration is linked to uncertainties that often underestimate the true measurement error. This is a known effect related to overfitting on the training data (Guo et al., 2017;Laves et al., 2020). As shown in the supplementary material, it is possible to correct the calibration by calculating target-wise scaling factors on the validation results. Once obtained, these simple scaling factors also yield improved overall calibration on the test data.
The bias towards systematically higher uncertainty in higher measurement values is a more concerning pattern. This effect can make it hard to distinguish whether a measurement with high uncertainty should be excluded due to being flawed or whether it merely resulted from a large subject, many of whom may provide valuable insight in correlation studies. It is most pronounced in the normal material where no genuine failure cases are encountered. In contrast, the uncertainty for one abnormal subject in the test set or the flawed predictions on images with artifacts of dataset D art are typically higher.
Conceptually, body weights above 150kg and BMIs of up to 53 kg/m 2 as present in the training data represent physiological extremes that could be considered outliers in their own right. Arguably, the two-dimensional projections are also inherently less suitable to represent more voluminous bodies and many of the largest subjects furthermore show considerable variability in shape and extend beyond the field of view. Even then, the effect is gradual and large subjects incur higher uncertainty than warranted in terms of the prediction errors alone. Previous work on age estimation from fetal brain MRI reported similar effects (Shi et al., 2020), noting specifically that higher aleatoric uncertainty, corresponding to the variances returned by the network instances, correlated with higher gestational age of the fetal brain. In this work, the effect is present in both the aleatoric and epistemic uncertainty component as modeled by the empirical variance, even in least-squares regression ensembles.
On a technical level, the mean-variance configuration provided immediate benefits over least squares regression despite merely changing the loss function and requiring that both a mean and a variance be predicted. This could be explained by loss attenuation (Kendall and Gal, 2017;Ilg et al., 2018) weakening the impact of outliers among the ground truth values. Several mismatches between the image data and reference were identified where the predictions also incur high errors in spite of low uncertainty. Images with artifacts, in contrast, did not necessarily yield high uncertainties, as the method was in fact able to provide accurate predictions for many of them. In turn, this also means that subjects with artifacts will not generally be identified as out-of-distribution samples. Ensembling yielded an inherent benefit in prediction accuracy and also improved the calibration. The ten network instances were conveniently obtained from a cross-validation split, but sufficient ensemble diversity could potentially be induced by random weight initialization alone and similar benefits can be achieved with fewer instances as seen in ablation experiments of the supplementary material and related literature (Fort et al., 2019;Ovadia et al., 2019). Based on the results, even a single mean-variance instance would be viable in practical settings if model size and runtime are of chief concern. The calibration could be adjusted with scaling factors, although it would not benefit from the 12% reduction in MAE achieved by ensembling.
Several additional limitations apply on a methodological level. No independent, external test set was examined, so that no claim can be made about generalization of the trained networks to other studies. The validation and test cases used in this work are furthermore preselected for the intended measurements by virtue of having passed the quality controls of the reference methods. Similarly, certain phenotypes were systematically excluded from the experiments in this paper, such as subjects with knee implants or other severe pathologies. When applied to different imaging devices, protocols, or subject demographics, new training data in the range of several hundred samples would likely be required. In contrast, multi-atlas segmentations with manual corrections have been based on just above 30 annotated subjects (West et al., 2016), whereas neural networks for semantic segmentation typically report training data ranging from 90 to 220 subjects (Fitzpatrick et al., 2020;Bagur et al., 2020) on UK Biobank MRI.
When compared to neural networks for segmentation, the proposed approach accordingly requires more training samples and produces no output segmentation masks. In turn, it can be trained without access to reference segmentations in an end-toend fashion that does not require for the property of interest to be manually encoded in the input data during training. Previous work showed that it outperformed segmentation in estimating liver fat from the two-point Dixon images, possibly by using additional image information that is not easily accessible to human intuition Langner et al. (2020c), and also accurately estimated other, more abstract properties Langner et al. (2020b). Likewise, the uncertainty quantification as proposed here can provide error bounds for the measurement that is ultimately of interest for medical research, although approaches for voxelwise uncertainty from segmentation networks have also been proposed in the literature Roy et al. (2019).
The concept of designing two-dimensional input formats resembles hand-crafted feature selection and it would be preferable to apply a regression technique directly to the volumetric MRI data. No claim is intended for the chosen representation to be optimal as input to the neural network. The MRI volumes could be sliced, projected, or aggregated in various ways and in any signal or phase component may contain valuable information. Despite the empirical success of the presented approach, further improvements may be possible, as the chosen format compresses the MRI data to just 0.5% of its original size and almost certainly results in a loss of information. However, a fully volumetric approach would likely require substantially increased processing time and GPU memory. The proposed approach, in contrast, can run on consumer-grade hardware and achieves relative errors as low as 1.6%, which may be hard to improve much further. Future work may adapt the presented approach to the dedicated liver MRI of UK Biobank, with potential for far more accurate liver fat predictions.
Future work may also explore how the bias between high measurements and high uncertainty can be corrected for and could explore alternative strategies which are known to produce substantially distinct estimates of uncertainty (Ståhl et al., 2020). However, it is unclear whether Monte-Carlo techniques that employ dropout at test time (Gal and Ghahramani, 2016) could reach sufficient predictive performance, whereas more faithful approximations of Bayesian inference with Markov chain Monte-Carlo (Neal, 2012) may not be computationally viable. Deep ensembles are often reported as one of the most successful strategies (Gustafsson et al., 2020b;Ovadia et al., 2019;Ashukha et al., 2020) and a suitable alternative will have to achieve better calibration and sparsification without sacrificing predictive accuracy or exceeding the computational limitations in order to be competitive.
In a large-scale study such as the UK Biobank the main strengths of the proposed approach can be exploited. Without any need for further guidance, corrections, or intervention, these values can be inferred for the entire imaged study population, both for existing and future imaging data. The resulting measurements can be obtained for further study and quality control months or years before full coverage has been achieved with the reference techniques. In practice, researchers may apply this system to obtain automated measurements for all upcoming 120,000 UK Biobank neck-to-knee body MRI scans yet to be released, and will be alerted to potential prediction failures by the predictive uncertainty. Future developments may also yield comparable systems that could ultimately be integrated into scanner software to provide fully automated analyses for specific imaging protocols.

Conclusion
In conclusion, both mean-variance regression and ensembling provided complementary benefits for the presented task. Without extensive architectural changes or prohibitive increases in computational cost they enabled fast and accurate measurements of body composition for the entire imaged UK Biobank cohort. The predicted uncertainty can, despite the specified limitations, give valuable insight into potential failure cases and will be made available together with the inferred measurements for further medical studies.

Supplementary Material
The following pages provide additional detail on predictive performance, sparsification, calibration, and inference with the proposed approach. Unless otherwise specified, all listed results were acquired with the configuration that combines both mean-variance regression and ensembling. The individual targets are furthermore examined in detail and compared to alternative UK Biobank reference values. A PyTorch implementation for preprocessing, training, and inference with mean-variance regression on the given image data is available online.

Datasets and Predictive Performance
The effective number of samples in the three datasets used for evaluation is listed in Supplementary

Overall Calibration
All examined configurations are biased towards overconfidence, consistently underestimating the true prediction errors. The predicted uncertainty should accordingly be scaled up. Suitable target-wise scaling factors can be determined to reach a better calibration on the validation data after training (Guo et al., 2017;Laves et al., 2020). In this work a simple grid search was used, which resulted in the target-wise scaling factors and the areas under calibration error curve (AUCE) (Gustafsson et al., 2020b) shown in Supplementary Table 4, with calibration plots, or reliability diagrams, shown in Supplementary Fig. 1. The same factors also achieve a considerable improvement when applied to the test data, indicating that the calibration of the proposed method could easily be corrected with this strategy for the normal material of the entire cohort.  Note: Earlier versions of this manuscript reported slightly worse calibration metrics due to flawed reversing of the standard scaling.

Detail on Individual Targets
The following pages list dedicated plots for the prediction, sparsification, and calibration of each target. For the test data only the mean-variance ensemble configuration is shown, which was determined to be the best performing approach in cross-validation. Each subsection also includes short discussions and comparisons to alternative reference measurements which are primarily derived from two main sources. The first source contains body composition measurements obtained by Dual-energy X-ray absorptiometry (DXA) as conducted by UK Biobank (Littlejohns et al., 2020). The second source contains additional measurements based on independent machine learning analysis of the same neck-to-knee body MRI as used in this work as conducted by Application 23889, who have shared a return dataset 981.
Similar comparisons have been previously reported for a comparable least squares regression technique (Langner et al., 2020b). Some measurements may be highly correlated but yield low agreement due to a shift or scaling difference. Where specified, these alternative measurements were therefore mapped with linear regression to the target values as used in this work, so that agreement values can be reported. Additionally, Pearson's coefficient of correlation r is reported. For a fair comparison, the methods are evaluated on the same subjects.
The sparsification plots also show oracle sparsification curves (Ilg et al., 2018), which describe a hypothetical optimum that would result from sparsifying with a ranking of uncertainties that corresponds exactly to a ranking of absolute prediction errors. This optimum can typically not be reached in practice, as it would require imitating not only the desired measurements but also any inconsistencies and noise in the reference techniques themselves. The sparsification for the three evaluation datasets is shown separately, but it is worth noting that in most cases the samples with artifacts incurred the highest uncertainty. When applied to a dataset that included mixed normal material and artifacts, the latter would therefore typically be excluded first in the sparsification. The outlier with largest prediction error in testing for VAT, SAT, and TAT is the same subject, found to suffer from an atrophied right leg. Aggregated saliency maps were obtained by generating guided gradient-weighted class activation maps for 3,091 subjects and coaligning them by image registration (Langner et al., 2019b). Each aggregated saliency map accordingly highlights which anatomical structures were predominantly considered by the network to make predictions for the specified target. For clarity, the visualizations show the aggregated saliency as a heatmap for each of the three input image channels side by side and are provided with and without the template subject anatomy as an overlay. The network weights used for this purpose are based on the mean-variance configuration with a single network trained for cross-validation in this work, in each case using the instance that did not contain the given image in its training set.   Visceral Adipose Tissue (VAT), extended notes: Supplementary Fig. 4 shows a close fit with few outliers in the normal material. In testing, a single subject with an atrophied right leg incurs a substantially overestimated measurement, which can be identified by high uncertainty.
Alternative reference methods: UK Biobank field 23289 contains measurements of VAT by DXA for 5,109 subjects. These values were first converted from mL to L and then mapped to the target with the following linear transformation parameters: (2.27x + 0.83L). UK Biobank return 981 by application 23889 also offers VAT measurements for 9,127 subjects. These values were converted from mL to L, but did not require adjustment by linear regression.  Abdominal Subcutaneous Adipose Tissue (SAT), extended notes: The scatter plot for the test data of Fig.8 shows a single outlier with about 15 L of subcutaneous adipose tissue, for whom the prediction yields almost 20 L with high uncertainty. This subject was found to suffer from an abnormal, atrophied right leg and also incurs high measurement errors in TTM and VAT.
Alternative reference methods: UK Biobank return 981 by application 23889 also offers measurements of subcutaneous adipose tissue volume for 9,379 subjects. These values were converted from mL to L and then mapped to the target with the following linear transformation parameters: (0.98x + 0.46L).  Total Adipose Tissue (TAT), extended notes: No test data was available for this target.

Supplementary
Alternative reference methods: UK Biobank field 23278 contains alternative measurements of total fat mass by DXA for 5,170 subjects. These values were first converted from mL to L and then mapped to the target with the following linear transformation parameters: (0.80x + 0.51L).  Total Lean Tissue (TLT), extended notes: No test data was available for this target. Supplementary Fig. 16 shows a curious pattern for the cross-validation, where a subset of measurements is consistently overestimated by about 2 L. The reason for this mismatch is unclear. The affected subjects are not part of the same cross-validation split set, were imaged in different imaging centers, and share no other obvious confounding factors. However, alternative measurements of total lean tissue by DXA (total lean mass, field 23280) independently support these overestimations relative to the reference used in this work. Supplementary Fig. 19 shows a comparison where the reference is plotted against the DXA measurements. All those cases that were overestimated by the proposed method by at least 2L are color-coded and form a similar pattern as observed in cross-validation.  Figure 19. In some subjects (red), the proposed method overestimated total lean tissue (TLT) by at least 2L. As shown on the right, the DXA scan shows a similar pattern and independently indicates higher values for these subjects.

Supplementary
Alternative reference methods: UK Biobank field 23280 contains additional measurements of total lean mass by DXA for 5,170 subjects. These values were first converted from mL to L and then mapped to the target with the following linear transformation parameters: (0.50x + 0.47L). On a side note, UK Biobank field 23285 also contains DXA measurements of trunk lean mass, but these values reaches lower agreement with the target than field 23280 and were not considered further.    Total Thigh Muscle (TTM), extended notes: Supplementary Fig. 21 shows a close fit with few outliers in the normal material. In testing, a single subject with an atrophied right leg incurs high uncertainty, together with a moderately overestimated measurement. Several other high-valued testing cases are slightly underestimated. Many of those cases with the highest uncertainty show severe fat infiltrations of the thigh muscle.

Supplementary
Alternative reference methods: UK Biobank field 23275 contains measurements of the lean mass of the legs by DXA for 5,170 subjects. These values describe more than just muscle volume, but may still be considered as a proxy. These values were first converted from mL to L and then mapped to the target with the following linear transformation parameters: (0.69x + 0.64L). UK Biobank return 981 by application 23889 also offers thigh muscle volume measurements for 9,441 subjects. These values were first converted from mL to L and then mapped to the target with the following linear transformation parameters: (1.06x + 0.67L).  Liver Fat Fraction (LFF), extended notes: The scatter plots of Supplementary Fig. 25 show that a small number of samples in the range of zero to five fat fraction points are severely overestimated, both in cross-validation and testing. Not all of these predictions incur high uncertainty.

Supplementary
Visual control of the affected subjects showed that the predictions by the proposed method often provided a better match to the neck-to-knee body MRI than achieved by the reference values. No obvious confounding factors such as artifacts or high liver iron content were observed. A similar effect was noted in previous work (Langner et al., 2020c) where a least squares regression technique was trained to emulate an alternative set of UK Biobank liver fat measurements, field 22402. As both of these reference fields are based on the dedicated liver MRI instead of the neck-to-knee body MRI used here, a possible explanation could be an unusually severe mismatch of both protocols for these subjects.
On average, LFF incurred by far the highest normalized uncertainties (calculated by dividing the predicted uncertainty by the predicted means) of all targets. Finally, it is worth noting that for this target superior results may be possible when using an input format that only shows a fat fraction slice of the upper body, as previously proposed (Langner et al., 2020c), although no rigorous comparison was attempted in the scope of this work. The technique could also be applied directly to the dedicated liver MRI.
Alternative reference methods: UK Biobank field 22402 contains alternative liver fat fraction values for 4,616 subjects, obtained by mostly manual analysis of dedicated liver MRI (Wilman et al., 2017). Relative to the target used in this work, one outlier subject is overestimated by 24 fat fraction points and no linear transformation was applied.   (Langner et al., 2019b) for Liver Fat Fraction (LFF) for 3,091 subjects, generated by a single mean-variance network. Each row shows the water, fat, and fat-fraction channels side by side, with the top row showing an overlay on the image data and the bottom row the saliency only.

Inference
The following histograms of Supplementary Fig. 29, 30, and 31 show the reference values in comparison to those measurements predicted for inference on the original imaging visit on dataset D in f er and the later repeat imaging visit D revisit . All shown data passed the visual quality controls, but no further attempt was made to exclude outliers based on the predicted uncertainty for these plots.
Supplementary Figure 30. Reference and predicted Total Adipose Tissue (TAT) (left column) and Total Lean Tissue (TLT) (right column).