Understanding the unexplained: The magnitude and correlates of individual differences in residual variance

Abstract Behavioral and physiological ecologists have long been interested in explaining the causes and consequences of trait variation, with a focus on individual differences in mean values. However, the majority of phenotypic variation typically occurs within individuals, rather than among individuals (as indicated by average repeatability being less than 0.5). Recent studies have further shown that individuals can also differ in the magnitude of variation that is unexplained by individual variation or environmental factors (i.e., residual variation). The significance of residual variation, or why individuals differ, is largely unexplained, but is important from evolutionary, methodological, and statistical perspectives. Here, we broadly reviewed literature on individual variation in behavior and physiology, and located 39 datasets with sufficient repeated measures to evaluate individual differences in residual variance. We then analyzed these datasets using methods that permit direct comparisons of parameters across studies. This revealed substantial and widespread individual differences in residual variance. The magnitude of individual variation appeared larger in behavioral traits than in physiological traits, and heterogeneity was greater in more controlled situations. We discuss potential ecological and evolutionary implications of individual differences in residual variance and suggest productive future research directions.

Guppies are the best represented species with 16 estimates from 7 studies, and four independent lab groups. Next are the hermit crabs, all of which are from Mark Briffa's lab, with 9 estimates from 6 studies.
Here is a list of citations: The best estimates (points) of CV P show a little bit of a positive skew, as does the credible distrbiution on these estimates -as shown by the asymmetry of the 95% credible intervals. Log-transforming the best estimates does appear to improve normality. This also makes the the estimates in later models positive definite (i.e. CV P > 0 ). However, log-transforming the credible distributions leads to a strong negative-skew, far worse than the existing positive-skew.

Funnel plot
To start, here are plots of the best estimates (log-transformed on the x-axis) against the sample size (log(N observations )) and against the precision as as given by: As we can see, when plotted against the sample sizes, estimates seem fairly randomly distpersed, indicating we do not have a big confirmation bias or file-drawer effect. The plots against τ show larger estimates have lower precision, which is due to scaling and the lower bound of 0 truncating the estimate variance. This effect is observable in the simulations for Table 3  Due to the skew among estimates, and because CV P must be greater than 0, I have modelled CV P on the log-scale. However, as the credible distribution on each estimate is closer to symmetry without transforming, I have retained the error variance around each estimate on the observed scale. This means that for the simple first model calculating a weighted mean, a link-function is moved to the right-hand side of the equation. This is done through the non-linear commands in brms.
Where a model of transformed data would be log where i refers to the estimated precision in the estimate, i.e. se(SD.CVp). This is shown by the bold line around the point above.
When returned to the observed scale, these variances become log-normal. Instead, the models below move the log-scaling to the righthand side, and omits the error variances (i.e. i ) from the link-function. This retain the precision estimates from the original scale. This can instead be written as: Thus obs i and Study j are log-normal, with the mean given by the fixed effects, while k = N orm(0, σ).

Physiology vs. behaviour
Here is the analysis splitting physiological and behaviour traits. Physiology is quite broadly defined here, with 2 endocrine studies (2 estimates), two metabolism studies (5 estimates), a haemolymph study (1 estimate) and also includes a performance study (1 estimate of sprint speed, Adolph and    This means on average the physiological traits have a lower CV P than the behavioural traits, though this is insignificant: -0.105 [-0.215, 0.022]. The haemolymph density study has a high CV P , which is driving the lack of a difference. If we remove this observation, the difference is then significant: We see in the next section that metabolism is lower, and the two hormone studies also have low CV P .

Separated by trait
The following analysis breaks down traits into more detail. For 'activity' and 'boldness' we have a good number of estimates, disproportionately guppies for activity, and hermit crabs for boldness. Numerous studies end up getting pooled into "miscellaneous" as we just have one or two examples of that trait. For example, we only have 1 of haemolymph desity, aggression, parental provisioning, exploration and a few more. I have split these "miscelaneous" traits by whether they're physiological or behavioural.
I have kept the separate variances among estimates by splitting physiology and behaviour.

Field vs. laboratory
Finally, the comparison of lab and field studies. A number of people have posited that individuals may differ in rIIV due to the lack of control of the environment. The argument being that individuals commonly vary in contextual plasticity, and this could give rise to heterogeneous residual variance, or some individuals might be assayed under more variable conditions. The analysis here statistically controls for the differences among trait types, and retains the variance differences.
As you can see, the opposite is actually true, which we posit in the main text is due to the lack of control swamping out intrinsic individual differences in residual variance, i.e. differences in variability get lost when extra noise is added. This result does require a pinch of salt -the estimates in the field tend to commonly be "miscellaneous" traits, boldness in the field is flight initiation distance rather than latency to emerge, activity in the field is over much larger spatial scales than laboratory assays in an open field. There are therefore likely confounds which affect this result in unknown ways.   (NUTS). For each parameter, Bulk_ESS ## and Tail_ESS are effective sample size measures, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1).

Publication bias
While at the beginning of this document I plotted out effect sizes vs. sample sizes and estimate precision, another way we may evaluate the potential reporting biases is to compare the studies which had (n = 39) and had not (n = 25) considered individual differences in residual variance. If these patterns were frequently identified through initial data exploration and visualisation, or simply not reported where the results came out to be null, then this would be evident when comparing the reported CV P estimates of those which did not consider individual variation in rIIV.   (NUTS). For each parameter, Bulk_ESS ## and Tail_ESS are effective sample size measures, and Rhat is the potential ## scale reduction factor on split chains (at convergence, Rhat = 1).

Correlation with the mean
Next are the analyses of the generality of an association between the mean expression of a trait and rIIV. Only boldness and activity rates have a sufficient number of studies to be informative, and the "miscellaneous" traits are too variable in their significance to be informative.
First, I am going to transform the correlation coefficients to Fisher's Z-correlation; z = 1 2 * ln( 1+ρ 1−ρ ). This is done for the upper and lower bound of the credible distribution, then I re-calculate the best estimate as halfway between those two values. The 95%CRI correspond to 1.96 standard deviations from the mean, so precision is calculated as σ = ρupper−ρ lower 3.92 .
The inverse of the Fisher's Z-correlation transform is then given as ρ = exp(2z−1) exp(2z+1) for the posterior.

Boldness
All of the boldness measures here have bolder animals having lower values, except for Jolles et al. 2018, i.e. animals with longer latency to emerge or flight initiation distances are considered "shyer". I have multiplied the correlations by -1 to make positive correlations equate to bolder animals being more variable. We have 20 observations of boldness, though one study did not have an estimate for the correlation coefficient and the raw data was not available.  Returned from Z-scale to correlation coefficients, r = 0.236 [-0.007, 0.467]. There's lots of heterogeneity within these studies as seen by the variance in obs. Even within the hermit crab data, which makes up 9 estimates, correlation values range from -0.773 to 0.626.

Activity
For activity, we have 20 estimates, ranging from r = -0.719 to 0.480.  This gives a best estimate r of -0.149 [-0.342, 0.046]. So there's no indication that there's a meaningful correlation on average.

Correlation of rIIV among-traits
In the analyses of the raw data, where multiple traits were available for the same set of individuals, we ran models as multivariate. This therefore assessed the among-individual correlation in rIIV -thus assessing assess whether individuals are generally (un/)predictable across multiple traits. Where traits are functionally linked in the analyses (for instance Allan et al. 2020 provide data on visual orientation distance and flight initiation distance), correlations at the among-individual level of means and residual variances will be very high. Therefore correlations in predictability are likely to be driven simply by this correlation. Residual correlations themselves were not assessable for all pairs of traits, and are problematic to estimate as they assume a constant correlation coefficient while the variances are changing. Instead we extracted among-individual correlations in mean behaviour from all such data. The direction of the correlation in means is irrelevant for the proposed functional link, so here we use the logit(r 2 ) as an estimate of the extent to the correlation in means. Correlations in rIIV are predicted to be positive, so we again use the Z-correlation. Data is limited here with just 19 estimates from 8 studies. For many studies, there was little power to assess this correlation and the estimates therefore carry very little weight in the model. We have fit this model as a multivariate model to assess 1) the mean correlation of rIIV between different traits and 2) whether the extent of a correlation in means predicts the correlation in rIIV.

Simulation of CV [P ]
Here, I will present the code which created the simulations that went into Figure 2. Firstly, I have created a simple simulation with all individual means fixed to 0. I have restandardised the estimates throughout so that they are centred on 0 and the variances among BLUPs are re-standardised to be what they are meant to be -i.e. they're not perturbed from these values by random chance. The individuals in the simulation are then ranked based on the on the observed data, not the values that they are simulated from. This means that even for the null scenario, there appears to be some fanning of variance from left to right.

Plot code
This initial plot is the one which appears in the main text, with 30 individuals simulated with 20 observations for each individual.

Small sample size
Finally, as stated in the main text, the illusion of among-individual variation can be seen when sample sizes are small. While the models do a good job parsing this sampling noise out (if distributional assumptions are met), this does mean that we should be careful in how we may perceive effects when exploring and visualising data. Here is the equivalent plot to above, with the exception that I have instead taken to lower threshold of datasets we considered -5 observations per individual. Sampling more individuals will of course help the model's power, but will not help in this visualising exercise.