A big-data approach to understanding metabolic rate and response to obesity in laboratory mice

Maintaining a healthy body weight requires an exquisite balance between energy intake and energy expenditure. To understand the genetic and environmental factors that contribute to the regulation of body weight, an important first step is to establish the normal range of metabolic values and primary sources contributing to variability. Energy metabolism is measured by powerful and sensitive indirect calorimetry devices. Analysis of nearly 10,000 wild-type mice from two large-scale experiments revealed that the largest variation in energy expenditure is due to body composition, ambient temperature, and institutional site of experimentation. We also analyze variation in 2329 knockout strains and establish a reference for the magnitude of metabolic changes. Based on these findings, we provide suggestions for how best to design and conduct energy balance experiments in rodents. These recommendations will move us closer to the goal of a centralized physiological repository to foster transparency, rigor and reproducibility in metabolic physiology experimentation.


Introduction
Mice are an instructive tool for the study of human metabolism as they can mirror human physiology in their responses to age-related and diet-induced obesity, and their physiological compensations to resist weight loss . The study of genetic and environmental factors influencing energy balance using laboratory animals has been advanced by the introduction of indirect calorimetry systems (Even and Nadkarni, 2012). Indirect calorimeters measure metabolic rates using gas sensors to capture rates of change in O2 consumption and CO2 production within sealed cages. Physical activity is monitored by recording infrared beam breaks or using electromagnetic receivers. Food intake is measured using sensitive mass balances. Weight gain results when food intake outpaces metabolic rate. Inversely, when food intake falls below metabolic rate, weight loss ensues. Physiological constraints limit both unrestrained weight gain and weight loss. The ability to promote durable weight loss through increases in metabolic rate or decreases in food intake is a major therapeutic goal in the context of the modern obesogenic environment.
It is important to be able to compare the physiological data across papers studying metabolism. However, comparing results from published indirect calorimetry studies has been hampered by inconsistent application of analytical techniques. In studies of obesity, metabolic rates of mice with different body compositions are sometimes subjected to inappropriate normalization attempts (Arch et al., 2006;Katch, 1972;Kleiber, 1932;Tanner, 1949). Position papers decrying this situation have been routinely produced, and just as often ignored. A strongly worded commentary provoked a series of responses, including one from authors accused of incorrectly representing their metabolic analyses (Butler and Kozak, 2010;Tschöp et al., 2012). An additional result of the commentary was a new appreciation of how to approach the large amount of data resulting from indirect calorimetry experiments (Mina et al., 2018). Most groups have agreed that the optimal data treatment uses the ANCOVA, an ANOVA with body mass or body composition as a covariate (Arch et al., 2006;Kaiyala, 2014;Kaiyala et al., 2010;Kaiyala and Schwartz, 2011;Speakman et al., 2013;Tschöp et al., 2012). However, the implementation of these recommendations has been uneven (Fernández-Verdejo et al., 2019). To address this problem, we developed CalR, which facilitates investigators uploading and analyzing their calorimetry data by automating many of the routine steps of data curation, pre-defining statistical significance cutoffs, and allowing users to automatically perform appropriate statistical tests for interaction effects (Mina et al., 2018). CalR is the first step toward standardization of indirect calorimetry analysis across different equipment platforms. However, we realized that the results provided by CalR included measures of statistical significance, but were lacking critical physiological context.
To establish the normal range of metabolic rate under standardized experimental conditions, we analyzed 2 large independent datasets (see Methods). For our primary training model, we utilized a dataset from the Mouse Metabolic Phenotyping Centers (MMPCs) representing longitudinal data from 4 sites in the United States where groups of male C57BL/6J mice were followed for 12 weeks on either a standard low-fat diet (LFD) or an obesogenic high-fat diet (HFD). The results of our analysis from the MMPC experiment were applied to a larger secondary dataset from the International Mouse Phenotyping Consortium (IMPC), an ambitious large-scale project with metabolic data from more than 30,000 mice, seeking to determine the genetic contributions to mammalian physiology. Here we present the results of our analyses from these 2 datasets, and provide generalized recommendations to facilitate the creation of a centralized metabolic data repository.

Factors influencing metabolic rate in mice
To understand the effects of body mass on mouse metabolism, cohorts of genetically identical mice were shipped to 4 independent US MMPCs and assessed longitudinally over 12 weeks in indirect calorimeters while on LFD or HFD. Weekly body masses were similar among institutions for mice on LFD but diverged significantly among mice on HFD (Figs. 1A and 1B). Regression plots of 24 hr average energy expenditure (EE) versus total body mass showed distinct site-specific metabolic rates which could be attributed to differences in body mass for mice on LFD or HFD ( Fig. 1C and 1D). Mice at each site and overall had a positive association between mass and EE. This mass effect reflects the biological consequence of Newton's 2 nd law of motion, whereby movement of greater mass requires more energy, and by extension larger animals require more energy for both resting metabolism and to perform work under standard conditions (Kleiber, 1932;White and Seymour, 2005). For mice on LFD, animals at each institution formed distinct slopes and intercepts indicative of location-specific differences in housing temperature, experimental apparatus, microbiome, and other factors. For animals on 4 or 11 weeks of HFD we observed identical slopes with different intercepts (Fig.  1D). This suggests that the relationship of EE to mass among genetically identical mice is similar despite absolute differences in EE at different locations. We also found that institutional differences in metabolic rate of wildtype (WT) mice introduced the largest source of variation in metabolic measurements. 4 colony. The mass variation was not observed for mice on LFD, suggesting strong effects driving obesity which are not encoded by genetic variation in C57BL/6 mice (Fig. 1E). We sought to quantify the relative contribution of the likely sources for the variation in EE using a widely adopted method (Groemping, 2006). The R 2 of 80% reflects total explained variance and a good overall fit. This analysis reveals the largest source of variation in metabolic rate is the institutional site of experimentation. Other sources include body composition, activity, time of day (photoperiod), diet and acclimation (Fig. 1F).

Role of acclimation on indirect calorimetry data
Rodents engage in a behavioral response when challenged with a novel environment (Archer, 1975). While there are no firm guidelines for how long mice may need to acclimate to indirect calorimeters, age, strain and diet can impinge on acclimation time. Here we defined the acclimation period as the first 18 hr or the first full photoperiod in the calorimeter. Analyses of the pre-acclimation and post-acclimation dark photoperiods were performed for each site. The hourly mean values for EE, energy intake, and respiratory exchange ratio (RER) from all 4 sites are plotted vs time. The strong effects of entrainment to 12 hr light/dark photoperiods produce differences at all sites for all parameters measured. The effects of acclimation were highly variable, differentially affecting mice at each location. At one of the 4 sites, EE increased following acclimation (Supp Fig. 1A, 1B). Similarly, energy intake was increased at only one site in non-acclimated mice (Supp Fig. 1C, 1D). The RER value, representing substrate oxidation, was significantly altered in 3 of 4 sites (Supp Fig. 1E, 1F). In this study, locomotor activity was not significantly impacted by acclimation (Supp 1G, 1H). Overall, acclimation adds a small but unpredictable level of noise and variation to the measurements tested.

Role of Mass and Body Composition in Energy Expenditure
How HFD feeding contributes to obesity has been a topic of detailed investigation. Both male and female C57BL/6 mice typically gain weight on a diet high in fat (Johnston et al., 2007;West et al., 1992). The degree of weight gain however is highly variable and consequently there is considerable heterogeneity in the underlying mechanisms involving energy intake, metabolic rate, and absorption (Kohsaka et al., 2007;Lin et al., 2000;Mayer and Yannoni, 1956;Storlien et al., 1986;Yang et al., 2014). Here we compare the calorimeter-determined energy intake and EE responses of WT male mice to HFD feeding at 4 different locations. After 11 weeks on LFD or HFD, as before, the EE was dependent on body mass (slopes of kcal/hr vs gram body weight) for all mice at all sites. However, HFD did not significantly alter this relationship except at one center ( Fig. 2A). We fitted the LFD mouse data at each site to a model including body mass and activity to predict expected values for mice on HFD. The difference from the expected values, or residual values from this fit show an overall decrease in HFD mice relative to the predicted EE (Fig. 2B). We similarly examined how rates of energy intake changed with diet and found a surprising degree of variance among sites (Figs. 2C and 2D). Body weight is ultimately affected by long-term differences in energy balance (energy intake minus EE). When calculating the difference between calories consumed and calories expended, as before, we observed a large variation in responses to HFD, including positive and negative changes in energy balance (Figs. 2E and 2F). However, energy intake was the primary driver of energy balance. These findings serve to underscore the large site-specific effects that contribute to variability of metabolism. To evaluate the contribution of body composition to the variable metabolic rates observed between sites, we used lean mass values as the covariate for our ANCOVA analysis (Supp Fig. 2A-2C). In 3 of 4 sites HFD altered the dependency of EE on lean mass. At one site HFD altered the dependency of energy intake on lean mass. We also explicitly calculated the relative contribution of fat mass to EE using multiple linear regression and found an unexpectedly large range of fat mass contributions varying from +33% to -19% (Supp Fig. 2D). When we examined the time course of the calorimetry data for all the centers, they demonstrated the characteristic circadian patterns of EE, activity and RER in animals on LFD during the 3-day period; with peaks in the dark phase (Fig. 2G,2L and 2H). Moreover, on HFD, EE was increased, and RER was decreased. Total energy intake and cumulative intake were increased by HFD, yet energy balance was unaltered by diet reflecting a neutral energy balance and weight stability (Fig. 2K) while in the calorimeter. We also plotted change in energy balance vs weight change between the start and end of the first calorimetry experiment at week 0 (Supp Fig. 3A). The normal circadian patterns seen in these mice, as well as the positive linear relationship between their energy balance and mass balance (i.e. animals in positive energy balance tend to gain weight) speak to the quality of the acquired data.
Application to large-scale dataset. A role for location, temperature, mass and sex To understand the relevance of the initial analysis, we applied this approach to the IMPC dataset (Rozman et al., 2018). We similarly plotted the change in energy balance vs weight change between the start and end of the IMPC calorimetry experiments. The data appear to be of high quality as indicated by the expected positive linear relationship. Furthermore, mice in positive energy balance showed a higher RER consistent with carbohydrate oxidation, and mice with weight loss exhibit a lower RER consistent with oxidizing endogenous fat stores (Supp Fig. 3B).
The IMPC data demonstrate a pronounced mass effect, representing that bigger male and female animals expend more energy under standard housing conditions. The IMPC data include results of indirect calorimetry performed on 33,279 distinct animals including 9,545 WT C57BL/6N mice at 10-11 weeks of age. 69% of these mice were male. The mean body mass of female mice is lower than that for male mice (at 21.7 g and 27.3 g respectively; Fig. 3A). The slope of the relationship between EE and body mass in WT male mice is greater than in female mice. While they have similar EE rates at lower masses (18-23g), they diverge significantly at higher masses (Fig. 3B). Relationship of EE vs total body mass for male WT mice at each of the 10 IMPC sites (D). Reported ambient temperatures for the experiments performed at each site (E). Relationship of EE vs total body mass in WT males for temperature (F) and season (G). Relationship of EE vs ambulatory activity in WT mice (n=3,941 males) (H). The percent variation in EE explained by each of the listed factors (I). Percent contribution of fat mass to EE by site in WT males (J). n = 6,590 males and 2,955 females unless otherwise specified.

Variability in WT female mouse metabolic rate
There is no indication that metabolic rate of female mice is more variable than that of male mice; a long-held assumption in metabolic research, despite recent contradictory evidence (Prendergast et al., 2014). Studies on mice of both sexes were performed at 7 IMPC institutions; 3 institutions examined only male mice (Supp Figs. 4A and 4B). To determine the variability in metabolic rate, we fitted EE to the available data for each institution and both sexes. The R 2 value of this fit represents the quality of the fit, with larger numbers representing a better fit, with lower unexplained variability. The R 2 varied considerably by site, likely due to 6 sites reporting locomotor activity values and only 3 sites reporting precise ambient temperature data, while all sites reported body mass. Despite the great variation between sites, R 2 values were similar between males and females at 6 of the sites (+/-9% difference). The remaining site had a female R 2 of less than 0.001, a poor fit, due to a small number of mice with little variation in mass, the sole predictor in this model, and therefore not an instructive example. The Canadian site reported mass, temperature and activity, and despite having the third smallest sample size had the best fit with R 2 female = 0.507 and R 2 male = 0.468. Combining the data from both sexes at this site further improved the fit to 0.581 suggesting only minor differences in metabolic rate among female and male young, chow-fed WT mice. Clearly, the more precise the covariate information reported, the better the explanatory value of the model. As the sample sizes of the 2 sexes are unequal in these sites (2,929 females and 6,569 males) we also examined the distribution of variance in male and female mice with a modified quantile-quantile (Q-Q) plot. Here the slope of the blue and pink lines represents the theoretical standard deviation (SD) for the EE of each sex. Parallel lines indicate similar variability; while points not falling on the line represent variation from the normal distribution. The greater slope for male mice is indicative of greater variability in the data. Overall, this representation suggests that female mice have a qualitatively better fit and lower overall variation (Supp Fig. 4C).

Institutional variability
Due to the larger quantity of male mouse data available for subsequent analysis of control and experimental mice, we present results only for male mice, but we find qualitatively similar metabolic results for WT male and female mice in the IMPC dataset. The 10 sites represented in the IMPC data each reported between 237 -1368 male WT mice (Fig. 3C). We observed more than a 5-fold difference in metabolic rate slopes (0.28-1.56 x 10 -3 kcal/hr vs gram body mass) between sites despite experimental protocols on mice of similar ages and similar diets (Fig. 3D). To further understand these differences, we examined other factors which influence metabolic rate including mass, temperature, season, and locomotor activity. The ambient temperature for mouse experiments varied by site. A range of ambient temperatures were reported for experiments performed by the sites in Canada, Germany, and Oxfordshire, UK. However, most sites report one single nominal temperature (e.g. 21.0°C) which likely does not precisely reflect daily variation in room temperature, nor the actual cage temperature, and may be a major source of unexplained variation (Fig. 3E). WT mice housed at low temperatures had the highest EE, and conversely mice housed at warmer temperatures had the lowest EE ( Fig. 3F). We next examined the effect that time of year (i.e. season) might be having on metabolic rate. In the wild, mammals have seasonal differences in metabolic rate. There is evidence that despite the climatecontrolled atmosphere of a vivarium, laboratory mice could similarly detect changes in season (Drickamer, 1977). All sites contributing data (Canada, China, France, Germany, Japan, Korea, UK, and USA) are located in the Northern Hemisphere and experience seasonal differences (Fig. 3C, inset). Within these data, there were no significant differences in EE by season, suggesting that seasonal variation was not a significant contributor to experimental variability (Fig. 3G). Differences in locomotor activity can contribute to whole-body EE, but typically have a negligible effect at non-thermoneutral temperatures (Virtue et al., 2012). We find a strong positive correlation between ambulatory locomotor activity and EE in the 6 sites reporting movement data (Canada, Oxfordshire UK, Japan, France, Korea, and Cambridgeshire UK) (Fig. 3H). We examined the relative importance of these effects to predict rates of EE (Fig. 3I). As with the MMPC dataset, institution is by far the largest single predictor of variability in rates of EE. Unlike the MMPC dataset, more than 50% of the variation in the IMPC data was unaccounted for by examining only these reported variables, suggesting that our model is as yet incomplete. We also examined the contribution of fat mass to overall EE and found a highly variable contribution from +52% to -70% among sites (Fig. 3J). The variability is likely due to their low adiposity as these animals are on chow diets. Overall, our analysis of 9,545 WT mice identifies institution, body composition, housing temperature, locomotor activity, and sex as contributing variables to EE.
To test the applicability of our analysis to the IMPC dataset, we chose the data from the 3 sites that reported both body composition and ambient temperature to within 0.1°C: Toronto Centre for Phenogenomics (Canada), Helmhotlz Munchen (Germany), and MRC Harwell (Oxfordshire, UK). We examined the contribution of total body mass to EE. Within this subset of 2010 male and 1155 female mice we re-examined sex as a mass dependent covariate. We find no significant difference between the EE rates of male and female mice (Fig. 4A). Using total body mass as a covariate, we observe nearly identical slopes for EE vs mass at each institution (Fig. 4B). When temperature effects were examined at these sites, mice maintained at warmer temperatures had shallower slopes than mice maintained at colder temperatures (Fig. 4C). We find an improved R 2 of 67% for this dataset of 3,165 mice (Fig. 4D). This example accounts for only 3 sites, but now has precise ambient temperature, mass, sex and season for all animals. These data do not account for locomotor activity. This greater fit has improved predictive powers over sites without precise temperature recording.

Variability in KO phenotypes
After observing the large inter-site variation for WT mice, we next examined whether phenotypic differences due to genetic modulation could be observed reproducibly at multiple sites. In the IMPC dataset, there were 6 strains examined in at least 5 locations, Ap4e1 -/-, Dbn1 +/-, Dnase1l2 -/-, Nxn +/-, and Prkab1 -/-(AMPKb1 -/-), and Rnf10 -/-. We also compared results for the Cdkal1 -/gene from 2 IMPC sites with our results (Massachusetts, USA) (see Methods). We plotted EE vs body mass for male mice at each of the sites to examine phenotypic variability (Fig. 5A). This representation plots each individual mouse at each site but does not convey the difference from the local WT populations and thus is challenging to make direct comparisons. To provide context for the magnitude of these phenotypes, institution, mass, activity, and temperature were fit to a multiple linear regression model for WT male mice. We calculated the deviation from the model at each site and plotted these residual values (Fig. 5B). Four of these strains have not previously been published and thus the expected phenotype is not known. The only strain with consistent phenotypic findings was Cdkal1 -/-, where EE values similar to WT values were observed for mice at all 3 locations tested. Unidirectional phenotypes were observed for 3 strains. Significantly increased EE was observed Ap4e1 -/and Dbn1 +/mice at 3 of 9 sites and 3 of 7 sites, respectively. Nxn +/mice had significantly lower EE at 3 of 8 sites. Bidirectional changes were observed for the remaining 3 strains. Dnase1l2 -/mice were similar to WT controls at 6 sites, with 2 sites showing significantly altered EE, with one higher and one lower. Prkab1 -/mice were similar to controls at 5 sites, consistent with results from an independently generated strain of AMPKb1 deficiency which found no significant EE phenotype (Dzamko et al., 2010). Yet at two sites Prkab1 -/mice were statistically different, higher and lower than controls at 1 site each. Lastly, Rnf10 -/mice had the greatest variability. Variants in RNF10 are associated with obesity in a population of Pima Indians (Huang et al., 2014). For Rnf10 -/mice, EE was significantly increased at 2 sites, unchanged at 2 sites, and decreased at 1 site. Consistently the largest residual values were from sites that did not report precise temperature values or locomotor activity, pointing towards incomplete modeling from these locations. These results illustrate the large phenotypic variability among mice with identical genetic perturbations observed at different locations.

Model application: genes with known effect on body weight
Our analysis helps to quantify the magnitude by which site, temperature, mass, and sex contribute to metabolic rate in WT mice. However, genetic variations can also impact metabolic rate. One of the goals of the IMPC is to use mouse models to understand the contribution of genetic factors to phenotypic variation. The IMPC dataset contains the metabolic analysis of 2,336 gene KO strains. Using the multiple linear regression model fitted from WT male mice, we examined the phenotype of all KO strains. To find the strains of KO mice with the greatest impact on metabolic rate we fitted the effects of institution, mass, sex, and ambient temperature. After fitting the mean EE of each KO strain to the model, we plotted the unexplained metabolic effect, or EE residuals. The residuals are normally distributed around 0 for male mice (Fig. 6A) suggesting the model is appropriately fitted (Fig. 6B). The results of this analysis are included in Supp. Table 1. To validate this data treatment, we plotted the IMPC data for both EE vs total body mass (Fig. 6C) and residual EE values from the regression model vs total body mass (Fig. 6D). Using the IMPC data, we plotted the results for 9 genotypes previously shown to affect body weight in mice. These strains affect weight by different mechanisms, including by increasing food intake, limiting metabolic rate, and affecting energy absorption. Food intake drives obesity in strains deficient for melanocortin 2 receptor accessory protein 2, Mrap2 -/- (Asai et al., 2013), carboxypeptidase E, Cpe -/- (Alsters et al., 2015;Naggert et al., 1995), proprotein convertase 1, Pcsk1 +/- (O'Rahilly et al., 1995;Zhu et al., 2002), or growth differentiation factor 15, Gdf15 -/- (Tsai et al., 2013). Low metabolic rate drives obesity in mice lacking growth hormone, Gh -/- (Meyer et al., 2004). Relatedly, at standard room temperatures, uncoupling protein 1 deficient Ucp1 -/mice have reduced metabolic rate, but obesity is not observed until housing under thermoneutral conditions (Enerbäck et al., 1997). We also plotted the metabolic rate of strains which promote leanness due to high metabolic rate, Pparg +/-, Fgfr4 +/-, Fgfr4 -/-, and Acer1 -/-. Peroxisome proliferator activated receptor-γ, Pparg heterozygous mice have increased metabolic rate driven by increased locomotor activity, controlled by the central nervous system (Lu et al., 2011a;Lu et al., 2011b). Elevated metabolic rate in mice with fibroblast growth factor receptor 4, Fgfr4 knockout and antisense inhibition have elevated metabolic rate likely due to increased levels of FGF19 and FGF21 (Ge et al., 2014;Yu et al., 2013). Alkaline ceramidase 1, Acer1 deficient mice have a skin barrier defect and progressive hair loss, likely accounting for compensatory increases in metabolic rate (Liakath-Ali et al., 2016). The uniform phenotyping standards of the IMPC allow a comprehensive comparison of relative phenotypic magnitude. Institutional effects on EE can make the true magnitude of effects between control and experimental strains difficult to discern in regression plots of EE vs mass (Fig. 6C). Instead, when plotting against residual values, it is apparent that 5 of these genes have phenotypes within 1 SD of the predicted mean (Fig. 6D). While the modestly decreased EE may contribute to obesity in the Cpe -/-, Gh -/-, and Mrap2 -/mice, food intake and other factors are likely the primary drivers as reported. There was no suggestion of altered EE in Gdf15 -/-, similarly suggesting other mechanisms at work in the control of body weight. The -1.0 SD phenotype observed in the Pcks1 +/mouse likely has a large impact on EE to predispose to obesity (O'Rahilly et al., 1995;Zhu et al., 2002). The hypermetabolic 2 SD and 3 SD effects of Fgfr4 +/and Acer1 -/contribute to resistance to obesity, while the effects of PPARg heterozygosity are modest. The EE effects have been combined with other assays including skin barrier function or HFD challenge to explain these phenotypes. In concert with comprehensive phenotyping, the IMPC data have confirmed these previously published findings and demonstrated the robustness of the IMPC experiment.

Model application: unknown genes
We next examined the human genetic loci linked to increased risk for obesity and body weight related traits through genome wide association studies (GWAS). The IMPC strains with a gene in a mapped obesity locus (n = 42) with phenotypes +/-1 SD from the mean (n = 7) were plotted ( Fig. 6E and 6F). We found 4 strains with lower metabolic rate that might predict obesity susceptibility, Pax5 +/- (Melka et al., 2012), Chst8 -/- (Tachmazidou et al., 2017), Tfap2b +/- (Lindgren et al., 2009), and Pald1 -/- (Cotsapas et al., 2009). We also found 3 KO strains which might protect from obesity, Pepd -/- (Shungin et al., 2015), Klf12 +/- (Jiang et al., 2018), Figure 6. Genetic contribution to EE. The residual change in EE compared to site-appropriate controls for males of 2,336 strains (A). Distribution of residual changes in EE for all KO strains with the numbers of KO strains reaching 1SD (green), 2SD (orange), or 3SD (red). Regression plots of EE vs mass (C) or residual EE vs mass for mice from mouse strains with known metabolic phenotypes (D). Lines showing ranges of standard deviations in EE are overlayed. Similar plots for IMPC strains with genes identified by obesity related GWAS and having phenotypes of more than 1 standard deviation (E, F). n = 23,739 males. and Pacs1 +/- (Wheeler et al., 2013). Further characterization of these strains, including exposure to HFD, may reveal insights into the phenotypic effects driving common forms of human obesity.

Physiological context
The normal distribution of changes in EE which are not accounted for by site, mass, and temperature is relatively narrow, with 2/3 of all strains (1 SD) having modest differences of +/-0.036 kcal/hr. To place these values into a physiological context we compared the magnitude of examples of the effects of age, voluntary exercise, adrenergic activation, and temperature on metabolism in C57BL/6 mice (Fig. 7A). The effects of aging on basal metabolic rate, as reported by Hootkooper et al. (Houtkooper et al., 2011) were modest, with differences at 15, 55, and 94 weeks of age differing by less than 0.01 kcal/hr on average, all corresponding to less than 1 SD (Fig. 7B). In voluntary wheel running, O'Neal et al. (O'Neal et al., 2017) report a strong (3 SD) induction of EE after one week, which decreases to a 2 SD effect at 2 and 3 weeks of exercise (Fig. 7C). In mice housed at thermoneutrality, administration of the b3 adrenergic agonist CL316,243 produces a >3 SD effect over 3 hours (Fig. 7D). We also plotted metabolic rates at ambient temperatures varying from 6-30°C (Fig. 7E). Using 22°C as the comparator, increasing the temperature just 3 degrees produced a 2 SD effect. Similarly, decreasing temperature 4 degrees has a nearly 3 SD effect. Greater temperature changes produced larger effects on EE except for the transition from 28°C to 30°C which likely reflected a departure from true thermoneutrality. These plots help to provide a frame of reference for the magnitude of effects observed in both the MMPC and IMPC datasets. Changes in EE for physiological challenges including age, voluntary wheel running exercise, b3 adrenergic agonism, warm and cold temperature compared to relative genetic changes as seen in Fig. 6B (A). Regression plots for each intervention. Age (n = 5 per group) (B), exercise (n = 7) (C), b3 adrenergic agonist CL316,243 (n = 10) (D), and temperature (n = 9) (E).

Discussion
Small scale data, large-scale data and data sharing The past decade has seen a transformation in our understanding of mammalian metabolism. New genetic tools, increased commitment to large-scale experimentation, and greater sharing of data are welcome developments. Large-scale systemic experiments such as the IMPC provide an unprecedented view into the genetic pathways that control energy balance in mammals. Yet, despite information on thousands of mice being deposited annually into the IMPC, data from far greater numbers of experimental animals are produced at institutions worldwide. These smaller datasets remain siloed and unshared at great loss to the scientific community. A critical unmet need exists to create centralized data repositories on metabolism if we are to make sense of often conflicting information on the drivers of obesity. Differences in observed phenotypes may well be due to environmental variances including temperature and diet. Preclinical studies in model organisms should pave the way forward for greater understanding of the role of genetic variation, diet, exercise, macronutrient composition, age, and obesity on human metabolism. A metabolism data repository is the obvious next step given these advances. The necessary prerequisites to achieve this reality are: 1) standardization of experimentation and analysis, 2) standardization of data formats, and 3) consensus approaches to essential metadata collection. The results of experiments in mice vary by strain, age, diet, temperature, site, intervention protocols, and other factors, and there can be useful covariates in explaining variance among experiments. The absence of this information can turn an attempt at experimental replication into part of an ongoing replication crisis in biomedical research.

Essential data to report
The goal of the present study was to understand the critical variables that would be necessary to make in vivo studies of metabolic rate broadly comparable. The largest sources of variation observed in the small-scale pilot MMPC experiment were institution and mass, including body composition data. Locomotor activity, photoperiod, diet and acclimation were smaller contributors. In the IMPC dataset, institution, mass and temperature were the largest contributors to variation. The lack of precise temperature information somewhat reduced the utility of the IMPC data at some of the sites. We recommend that in future publications, the minimal necessary information to compare and interpret metabolic studies (Table  1) should be included.

Sources of Variation
At the outset of this project, we envisioned determining the range of metabolic rates observed in WT mice on a LFD or HFD. We predicted a high level of concordance between sites due to the experience, strong reputation for excellence, and know-how in metabolic phenotyping at the participating sites. The MMPC and IMPC experiments were both expressly designed to minimize variation between sites by standardizing protocols, and by studying age and diet-matched animals. Despite these safeguards, we saw a large variance in metabolic rate in each experiment (Figs. 1 and 3). For the MMPC, these small-scale studies (n = 6-8 mice per group) are typical and can be instructive. In this case, the response of these mice to HFD was highly variable with mice at one site gaining 3 times more fat mass than mice at another site. Despite the large differences in phenotype, these 4 MMPC sites produced high quality data on body mass, locomotor activity, photoperiod, diet, and acclimation which accounted for 80% of the variation. There are no right and wrong phenotypes, only the phenotypic differences in institutional responses, which we have quantified but have yet to fully elucidate. Here institutional differences provide the single greatest source of variation The IMPC data afford the largest evaluation of non-genetic and genetic sources of variation yet described. This dataset is both impressively large and yet still narrowly focused on a single age and genetic background, on a standard chow diet. Expanding the types of experiments, strains, and covariates beyond those studied by the MMPC and IMPC will be essential for greater understanding. Our analysis helps to quantify the relative contributions of well-known environmental variables including site, mass, temperature, activity, sex, and season. There was again a large institutional variability not accounted by these factors. This was further exemplified both by the comparison of 9,545 WT mice (Fig. 3B) and by comparing the same KO strains at up  (Fig. 5). We observe phenotypic differences for specific KO strains at some locations, but not at others. However, for most strains we observe good agreement. Understanding the drivers of this variation can help increase reproducibility. Pooling data from multiple sites can also help to reach consensus. The variation or differences among sites is one component of the reproducibility crisis (Drucker, 2016) and understanding institutional variability is a key to increasing reproducibility of metabolic data.
Our current understanding of the critical covariates is as follows.
Sex: We found similar variability of WT female mice compared to male mice in the IMPC experiment. Indeed, we were surprised at the small contribution of sex to metabolic rate. The differences between female and male mice were largely due to the smaller body mass in the former, rather than sex per se. These effects would be predicted to become more pronounced on HFD, as body weights of male and female mice would further diverge and the metabolic effects of altered estrogen and androgen levels may become more prominent.
Body mass: The single most consistent result across the 2 experimental datasets is the positive correlation between body mass and EE. With sufficient sample size and variation in mass range, this positive relationship can be observed and is one indication of a successful experiment (Tschöp et al., 2012).
Body composition: Typically, inclusion of body composition can improve the quality of regression plots, especially as fat mass accumulates with aging or obesity (Kaiyala et al., 2010). In the IMPC experiment, mice were maintained on standard chow diets and had small quantities of body fat. Using body composition data rather than whole body mass as a covariate can help to decrease the variation between groups. It has been proposed that lean mass + 20% fat mass might be used as the covariate for metabolic analysis (Even and Nadkarni, 2012), a calculation which has seen limited adoption (Coyne et al., 2019;Del Moral et al., 2016;Dorfman et al., 2017;Montgomery et al., 2013;Piattini et al., 2019;Schöttl et al., 2015;Seyfarth et al., 2015;Zhao et al., 2016). We find the contribution of fat mass to whole body EE to be highly variable and dependent on the particular features of the study group. In our study, lean mass + 20% fat mass was not an appropriate mass covariate, and we believe it is unlikely to be a generalizable solution. Furthermore, using lean mass as a covariate in studies of obese mice fails to capture the metabolic contribution imparted by the adipose tissue; an approach that should be treated cautiously (Tschöp et al., 2012) (Supp Fig. 2).
Temperature: Adding the bona fide experimental temperature improved our ability to account for the variation in metabolic rate at the 3 sites where this information was available. The powerful role of temperature to affect EE is well established (Abreu-Vieira et al., 2015;Mount and Willmott, 1967). It therefore stands that providing precise recorded temperatures would be an easy and effective way to improve the utility of indirect calorimetry results in the IMPC and elsewhere.
Locomotor activity: Most indirect calorimeters developed in the past 10 years will include the ability to record locomotor activity either by incorporating a calculated distance traveled by counting infrared beam breaks or by tracking a device implanted into the animal. A surprising lack of standardization makes direct comparisons between instruments challenging. However, locomotor activity adds a small but significant contribution to variation in EE and these values should be incorporated whenever possible.
Acclimation: Acclimation to the indirect calorimetry cages had only a modest effect in the MMPC dataset. Prior to acclimation, mice tended to eat more food than after acclimation. Consequently, for mice on a high carbohydrate low-fat diet, RER tended to be higher. An acclimation period was not included in a majority of IMPC sites; data were recorded for only 21 hours. The analysis of the MMPC experiments suggest that this had only a small effect on the results presented. However, the rapid adaptation time may be due to the young age of these animals and other experimental strains or conditions may greatly affect acclimation time.
Sample size: Relatively large numbers of subjects are required to find statistically significant differences between groups using ANCOVA, especially with small variance of body weight or metabolic parameters (Arch et al., 2006;Even and Nadkarni, 2012;Fernández-Verdejo et al., 2019;Tschöp et al., 2012). This point is well illustrated by the large effect of LFD and HFD on mice in the MMPC at UC Davis to promote a change in fat mass from 5 to 14 grams in 11 weeks. Yet, despite the obvious phenotypic differences, neither EE nor energy intake was significantly different by ANCOVA in these WT mice with 8 animals per group, suggesting that these studies were underpowered. Together, recording and applying these covariates can maximize experimental reproducibility.

Genetics
Our analysis of obesity-related GWAS results specifically focused on strains which may impact EE. Genetic variants identified with these methods are often outside of protein-coding sequences and may affect positive or negative gene regulation (Buniello et al., 2019). It is therefore of interest that our analysis predicts KO strains which both increase and decrease metabolic rate. One of the biggest advantages conferred by a repository of metabolic data will be the opportunity for researchers to investigate phenotypes of specific alleles or interventions. In the publication describing the metabolic data from the IMPC, the authors chose a ratio-based reporting system based on a residuals from multiple linear regression (Rozman et al., 2018). We have reported the residual as well as standard deviation from the mean. Both approaches are similar, but we find reporting the average mass and deviations of the KO strain can add a dimension of context to the results.
The strains we identified have not been well characterized with regards to their roles in metabolic pathways, suggesting follow-up studies are warranted. Strains with lower metabolic rate which might contribute to development of obesity include Chst8, Pax5, Pald1, and Tfap2b. Chst8, carbohydrate sulfotransferase 8 mediates sulfation of the carbohydrate structures of the sex hormone LH; deletion of Chst8 produces mice with elevated LH levels. Chst8 is also expressed in the hypothalamus and pituitary and is predicted to modify POMC, the precursor to several key hormones involved in the control of stress and body weight regulation (Baenziger, 2014). Pald1, phosphatase domain containing paladin 1, is a regulator of angiogenesis and vascular function. Female Pald1 -/exhibit an emphysema-like lung disorder (Egaña et al., 2017). However, cell culture studies have shown Pald1 to have a role in insulin signaling by negatively regulating insulin receptor expression and phosphorylation (Huang et al., 2009). Pax5, paired box 5, is a transcription factor controlling B cell development (Nutt et al., 1997). Tfap2b, transcription factor AP-2 beta, KO mice exhibit a form of patent ductus arteriosus (Satoda et al., 2000). However Tfap2b heterozygous mice are largely unaffected (Zhao et al., 2011).
Strains with increased metabolic rate include Pepd, Klf12, and Pacs1. Pepd, peptidase D (or prolidase) is an enzyme which cleaves dipeptides including collagen. Patients harboring mutations in PEPD have skin ulcers (Sheffield et al., 1977). Pepd KO mice reveal an important role for this protein in bone and hematopoetic development. These mice have defects including alterations in bone and immune development (Besio et al., 2015). The Pepd gene affects coat pigmentation, suggesting an interaction with melanocortin signaling pathways (Cota et al., 2008). Klf12, kruppel like factor 12 is a transcriptional repressor linked by GWAS to EE (Jiang et al., 2018). Klf12 is necessary for NK cell proliferation in mice (Lam et al., 2019). Pacs1, phosphofurin acidic cluster sorting protein 1, is a protein with a putative role in the localization of trans-Golgi network membrane proteins and has been shown to be important for vesicular trafficking of POMC into dense secretary granules in neuroendocrine cells, a pathway essential to its downstream cleavage into its active peptide hormones. Pacs1 may also affect the function of cilia, a structures that when altered are associated with development of obesity (Schermer et al., 2005).
In summary, these 7 genes may affect EE through roles in POMC or melanocortin signaling, skin barrier function, lung defects, hematopoiesis, and bone formation. Follow-up studies will help to clarify the role of these proteins in mouse biology. Greater understanding in mice would provide testable hypotheses for assignment of causal human genetic variants to clinical phenotypes.

Limitations
Several key limitations of the current study include the currently accepted methodology for analysis which requires reduction of hundreds of data measurements collected for each animal down to a single value for ANCOVA or multiple linear regression-based statistical analysis. The IMPC data provides one data point per metabolic variable (e.g. oxygen consumption or EE) per animal. Further research on time-series could potentially extract meaningful information from these discarded values. One possible source of variation between sites is the different models of indirect calorimetery systems used. This study was unable to directly compare the efficacy of different indirect calorimeter manufacturers due to the large institutional differences in EE even between sites using the same instrumentation. However, a direct comparison of results from mice with two different systems within the same room has recently been reported (Soto et al., 2019). The process of observing mice in an indirect calorimeter may affect their behavior and fail to accurately reflect normal food intake, locomotion, and EE. The energy balance calculated for the MMPC experiment is inconsistent with the body weight gain over the 11-week period on HFD. This finding emphasizes the difficulty in accurately determining energy balance over a few days and may be better captured by longer measurements. It is also possible that the calorimetry environment altered energy balance compared to home cage conditions. It can be challenging to accurately measure energy intake and systems designed to minimize spillage could also impact food intake by making it more difficult for mice to eat. Observing serial measurements of body weight can help to mitigate interpretation of experiments where data unexpectedly reflects weight loss (as in Fig. 1). Both indepth phenotyping and large-scale population-based phenotyping reveal the large effect size of institutional variation. Two possible but untested sources of variation in these datasets include changes in gut microbiota and/or epigenetic changes induced by different environmental triggers. Mice have specific microbial flora which change over time and affect energy balance (Turnbaugh et al., 2006). The microbiome is likely to be an institutionally local phenomenon which can affect metabolic rates; knowledge of microbial populations which correlate with EE may help to refine metabolic studies in the future. Attempts to rescue phenotypic variation by microbiome reconstitution may prove fruitful. Similarly, it may soon be possible to interrogate the set of epigenetic modifications to genes controlling metabolic rate and help to predict or modify metabolic rate accordingly.

Conclusions
There is a surprisingly large degree of phenotypic variation observed in mouse energy expenditure. The two experimental paradigms analyzed were specifically designed to create consistent, reproducible results. Contrary to our predictions, the site of experimentation plays an outsized role in determining the metabolic rate of mice. This variation was sufficiently large to prevent consistent phenotypic conclusions caused by the same genetic interventions. The experimental location variability strongly affected results in both WT and genetically modified strains. These findings suggest that identifying the as-yet unknown sources of experimental variability among sites should become a high priority until we have sufficient understanding to consistently reproduce experimental results at multiple institutions.

MMPC experimental description
Data from the NIH-funded MMPCs represent longitudinal measurements at 4 sites located at University of California, Davis (UC Davis), University of Massachusetts (UMass), Yale University (Yale) and Vanderbilt University (Vanderbilt). Indirect calorimetry measurements were recorded for each mouse for at least 4 days, allowing for analysis of pre-acclimation (the first 18 hr) and post-acclimation (18-96 hr). Unless otherwise noted, our analyses use the post-acclimation data. The 4 geographically distinct MMPC sites used calorimeters from 3 different manufacturers (Columbus Instruments at UC Davis, California and Yale, Connecticut; TSE Systems at UMass, Massachusetts; Sable Systems International at Vanderbilt, Tennessee). Genetically identical C57BL/6J male mice (n= 60, 6-7 weeks of age) originating from the same room in the Jackson Laboratory facility were shipped to each site. On arrival, mice were maintained on LFD for one week and then randomized into two groups for all further experiments. To control for the transition from group to single housing, all animals remained singly housed for the duration of the study. Indirect calorimetry was performed on all mice at each site (week 0) after which one group was given HFD for the next 12 weeks, while the other was maintained on LFD. All mice were returned to the calorimeters at 4 and 11 weeks post diet randomization. Mouse diets were purchased as a uniform lot from a singular production stream for all sites from Research Diets (LFD: D12450B, 10% energy derived from fat, 3.85 kcal/g; HFD: D12452, 60% energy derived from fat, 5.24 kcal/g) and were delivered to each site simultaneously. Energy intake was calculated by taking the product of food intake in grams with the energy density of the diet in kcal. UC Davis used a PIXIMus DEXA scanner under anesthesia for body composition measurements. Other sites used an NMR-based Bruker minispec without anesthesia. Reported room temperatures for the UC Davis, UMass, Yale, and Vanderbilt MMPC sites were 22.0, 21.1, 21.5 and 22.5°C respectively. For indirect calorimetry, the Vanderbilt site used a temperature-controlled environment of 24°C. Beam breaks reported from Columbus Instruments, Sable Systems International, and TSE Systems correspond to different distances. Accordingly, locomotor activity was calculated as beam breaks as a percent of the global maximum per site.

IMPC experimental description
IMPC data were collected as described (Rozman et al., 2018). In version 10.0 of this dataset, the 24 hr averaged metabolic data from more than 30,000 mice is publicly available. The IMPC data uses highly similar indirect calorimetry protocols across 10 sites in 8 countries. All indirect calorimetry data were collected from 11 week old mice, except for body composition data. Lean and fat masses corresponding to week 11 were estimated for each mouse based on body composition percentages collected at week 14. The calorie content and macronutrient compositions of the chow diets used were provided when requested. The IMPC 10.0 sites uniformly provide a single value reported for oxygen consumption, carbon dioxide release, RER, and EE, although additional hourly data is available through their API. Data were not consistently present at all sites for both males and females, and sites did not uniformly report locomotor activity, precise housing temperatures, or food intake data. Similarly, acclimation was performed only at a subset of sites and pre-acclimation data was not readily available. On initial visualization, the data deposited showed several abnormalities. These included one site where larger animals paradoxically consumed less oxygen than smaller animals, demonstrating an inverse mass effect (Supp Fig. 5A). When contacted this site quickly identified and corrected a persistent error in their data analysis pipeline which will be rectified in a future IMPC data release. A different site reported erroneous food intake data with each animal consuming exactly 0.05 g of food per hour regardless of body mass or genotype. This site also provided an updated dataset with corrected food intake values (Supp Fig.  5B). Lastly, we observed differences in methods to calculate EE from indirect calorimetry data. When calculating EE using the Weir formula (Weir, 1949) we find 2 sites which produced different values, likely due to implementation of other formulae including the Lusk equation (Lusk, 1993) (Supp Fig. 5C). To improve consistency, we re-calculated data for all mice using the Weir equation. Erroneously keyed data were corrected as described in Supp Fig. 6. One strain was excluded due to highly variable values within the genotype in a subset of mice, Fbxl19. The acronyms describing the IMPC sites have changed in some instances. For simplicity, the following sites are indicated by the country in which they are located. For countries with 2 sites, the state or county is indicated (Table 2).  (Houtkooper et al., 2011). EE values during voluntary wheel running exercise in 6 month old chow-fed (Teklad 7022) C57BL/6 male mice at 22°C are as reported in the supplemental materials from T.J. O'Neal et al. (O'Neal et al., 2017). Studies of the effect of b3-adrenergic stimulation and of temperature change were performed in a temperature-controlled chamber enclosing the indirect calorimeter. Both studies were conducted in Boston MA.
For b3 stimulation, 6.5-month-old chow-fed (LabDiet 5053) C57BL/6J global Cdkal1 -/or WT littermate male mice were maintained at 30°C for 24 hr. Mice received an IP administration of 1 mg/kg CL, 5 hr into the light photoperiod. The mean EE over the 3 hr post-injection period was used in this analysis. No significant differences in EE were observed between genotypes; accordingly, data for all mice were pooled. EE measurements of Cdkal1 -/or WT littermate control male mice maintained at 23°C were used in Fig 5. The study of the effect of temperature on EE was performed on 10-week-old chow-fed (LabDiet 5053) C57BL/6J male mice with an incremental decrease in temperature. Mice were maintained at 30°C for 24 hr intervals between each 24 hr temperature challenge ranging from 28 to 6°C.

CalR analysis
CalR analysis was performed on the MMPC dataset (Supp Fig. 1) as described (Mina et al., 2018). The "remove outliers" feature was enabled to exclude from analysis data that were recorded during momentary cage opening.

Statistical Methods
Data input, cleaning, visualization, and statistical analysis were performed in the R programming language (R Core Team, 2019). The relative contribution of covariates were calculated with the relaimpo package (Groemping, 2006) The multiple linear regression model and model to quantify the explained variance for the MMPC experiment after 11 weeks on diet included institution, body composition, locomotor activity, photoperiod, diet, and acclimation. These models were not improved by the addition of equipment manufacturer or temperature as each site reported a single, unique temperature, and only 2 sites used a similar manufacturer of indirect calorimeter. The IMPC models included institution, body composition, locomotor activity, ambient temperature, sex and season. These models were not improved with addition of equipment manufacturer. The q-q plot was performed with a geom_qq using EE data from WT mice at the 7 sites examining both sexes (Wickham, 2009). All plots were produced with ggplot2 or CalR (Mina et al., 2018;Wickham, 2009). Unless otherwise specified, all reported data are based on the daily EE (DEE), 24-hr averaged EE values. Experiments were not performed at thermoneutrality, eliminating the possibility of determining basal EE or resting metabolic rate at thermoneutrality (Even and Nadkarni, 2012;Meyer et al., 2015).