New approach for the identification of implausible values and outliers in longitudinal childhood anthropometric data

Purpose We aimed to demonstrate the use of jackknife residuals to take advantage of the longitudinal nature of available growth data in assessing potential biologically implausible values and outliers. Methods Artificial errors were induced in 5% of length, weight, and head circumference measurements, measured on 1211 participants from the Maternal Vitamin D for Infant Growth (MDIG) trial from birth to 24 months of age. Each child's sex- and age-standardized z-score or raw measurements were regressed as a function of age in child-specific models. Each error responsible for a biologically implausible decrease between a consecutive pair of measurements was identified based on the higher of the two absolute values of jackknife residuals in each pair. In further analyses, outliers were identified as those values beyond fixed cutoffs of the jackknife residuals (e.g., greater than +5 or less than −5 in primary analyses). Kappa, sensitivity, and specificity were calculated over 1000 simulations to assess the ability of the jackknife residual method to detect induced errors and to compare these methods with the use of conditional growth percentiles and conventional cross-sectional methods. Results Among the induced errors that resulted in a biologically implausible decrease in measurement between two consecutive values, the jackknife residual method identified the correct value in 84.3%–91.5% of these instances when applied to the sex- and age-standardized z-scores, with kappa values ranging from 0.685 to 0.795. Sensitivity and specificity of the jackknife method were higher than those of the conditional growth percentile method, but specificity was lower than for conventional cross-sectional methods. Conclusions Using jackknife residuals provides a simple method to identify biologically implausible values and outliers in longitudinal child growth data sets in which each child contributes at least 4 serial measurements.


Introduction
Child growth in stature and body dimensions is a continuous and dynamic process that is optimally studied through longitudinal follow-up. As such, many epidemiologic studies are designed to collect repeated anthropometric measures of each child across successive time points to describe growth patterns, identify predictors, and assess associations with later health outcomes [1e3]. Standardized procedures for the measurement and collection of anthropometry data have been established to maximize data quality [4,5], and advanced analytical strategies are employed to accommodate the longitudinal nature of the data collected in such studies [6]; similarly, rigorous quality control procedures during data cleaning are warranted to identify outliers and implausible values.
For cross-sectional studies in which each child only contributes one set of measurements, identification of outliers and implausible values is limited to the use of fixed cutoffs that were derived from a reference population, such as those established for the WHO Child Growth Standards [7] or from the observed distributional properties of the study population. However, with serial anthropometric measurements collected in the same individual, the growth trajectory of each individual provides another basis upon which to assess the biological plausibility of any given measurement for that individual. For example, a decrease in length or height between two successive time points is biologically implausible and thus indicates that at least one of the values was incorrectly measured or recorded. However, without additional information, it is often challenging to determine which of the two values induced the implausible trajectory. Many longitudinal studies use only conventional cross-sectional approaches to identify outliers; when the longitudinal nature of the data is taken into consideration, the methods used to identify these implausible values are often not described, or exclusions are made on a case-by-case basis [8].
Yang et al. [9] recently described the application of conditional growth percentiles for systematically identifying outliers and implausible values in longitudinal childhood anthropometric data. In their example, a hierarchical model of serial weight measurements as a function of age was constructed to estimate an individual's weight percentile at time t, while conditioning on the individual's weight percentile at time tÀ1. This approach implies that the plausibility of a given measurement is solely based on the preceding measurement (and therefore cannot be applied to an individual's first measurement), yet the expected value is contingent on the overall growth trajectory of the study population.
We propose an alternative longitudinal approach to assess for outliers and implausible values using growth trajectories that are fit to each individual's anthropometric data, in which the identification of outliers and implausible values does not depend on the distribution of measurements in the whole study population. In this study, we aimed to use a longitudinal growth data set with artificially induced errors to demonstrate how jackknife residuals of linear models of z-scores or raw growth data as a function of age can be used to identify biologically implausible values and outliers. We further aimed to compare the sensitivity and specificity of jackknife residuals to alternative methods of data cleaning, including conditional growth percentiles and conventional crosssectional approaches.

Data source
Anthropometric data collected from infants enrolled in the Maternal Vitamin D for Infant Growth (MDIG) trial were used. Data collection including anthropometry is ongoing; therefore, data available up to January 26, 2017 were used for this study. The MDIG trial methods have been previously described [10]. In brief, the MDIG trial is a randomized placebo-controlled dose-ranging trial of vitamin D supplementation during pregnancy and lactation in Dhaka, Bangladesh. Length, weight, and head circumference measurements are scheduled in tri-monthly intervals from birth to 24 months of age, plus an additional measurement taken at 2, 4, 6, or 8 weeks of age chosen through random assignment, with some variability between infants in actual timing of measurement collection. Age-and sex-standardized z-scores for length, weight, and head circumference measurements were generated using a combination of growth references: the Intergrowth-21st Newborn Size standards; the Intergrowth-21st International Postnatal Growth Standards for Preterm Infants; and the World Health Organization (WHO) Child Growth Standards (Supplementary Material Methods). Data cleaning of any natural occurring errors in the data set was not conducted to preclude biasing the results in favor of one method over another.

Simulation of implausible values and outliers
Simulated outliers and implausible values were randomly generated in the existing data set by deliberate introduction of data errors. In primary analyses, we randomly selected 5% of all encounters for error induction, in which values were randomly shifted upward or downward in relation to the original observed values. The magnitude of the errors (i.e. differences between original and shifted values) followed a normal distribution of mean ¼ 0 with a standard deviation based off of the derived standard deviation of the raw anthropometric measurements in the WHO Multicentre Growth Reference Study that is the basis for the WHO child growth standards [7]. As such, the standard deviation of the errors varied by type of measurement, sex, and age of the infant. In sensitivity analyses, we used error rates of 10% and 15%, and standard deviations of 2-and 3-times the age-and sex-specific standard deviation for the corresponding anthropometric measure. A Monte Carlo approach was used, in which each scenario for the varying error rates and error generation methods was simulated 1000 times.

Identifying biologically implausible values
Jackknife residuals were applied to identify the incorrect measurement in instances where errors introduced into the anthropometric data set resulted in a biologically implausible decrease in a child's length, weight, or head circumference from one time point to the next. Jackknife (or externally studentized) residuals, r (-i) , are generated from regression residuals, e i , that are scaled by a function of the mean squared error with the i th observation deleted, MSE (-i) , and the leverage, h i : Jackknife residuals are expected to follow a t distribution with (neke2) degrees of freedom, where n is the number of observations and k is the number of parameters in the fitted model, thereby giving the distribution a mean of 0 and standard deviation slightly greater than 1. Given that the k th observation is an outlier, the jackknife residuals of other observations will shrink toward zero due to an overestimation of MSE, whereas the jackknife residual of the k th observation will not. As such, jackknife residuals respond more strongly to the presence of a single outlier than does the standardized residual [11].
Any decrease in raw length or head circumference measurements was considered to be biologically implausible, whereas a decrease of greater than 15% in the raw measurements for weight was considered biologically implausible. This analysis addressed instances in which an error in the data set could be clearly identified, but the exact time point at which the error occurred was not as easily discerned. Therefore, these analyses were limited to children for whom there was at least one implausible decrease induced in anthropometric measures.
All individuals with a biologically implausible decrease between any two measurement time points were first identified based on the criteria listed previously. Separately for each child, linear regression was used to fit a straight line through the individual's sex-and age-standardized z-score of the corresponding anthropometry measurement as a function of age: where "i" denotes the i th individual and "j" denotes the j th time point. For raw measurements, each individual's measurements were regressed on the square root of age (t ½ ) to model a curvilinear relationship in which growth rates vary with age [12]: Each measurement of a given individual is assessed for adequate fit to the modeled trajectory for that individual using jackknife residuals (Supplementary Material Methods). We compared the absolute values of the jackknife residuals for the two adjacent time points that spanned the interval across which each biologically implausible decrease occurred. For each pair of values, the value with the largest absolute jackknife residual was labeled as the incorrect value, irrespective of the absolute magnitude of the jackknife residual.
Kappa statistics were computed to assess the performance of these approaches in identifying the correct biologically implausible value (i.e., agreement between the classification by the jackknife residual method vs. the true classification of values as induced errors or unmodified measurements). Analyses were restricted to infants for whom four or more measurement time points were available, since the residual method cannot be applied to individuals with fewer than (kþ2) measurements available, where k represents the number of parameters in the model (Supplementary Material Methods). In addition, biologically implausible decreases in the original MDIG data set were excluded from this analysis since the time point at which the error occurred is unknown.

Comparison of sensitivity and specificity across methods for identifying outliers
To assess the sensitivity and specificity of the jackknife residual method for detecting induced errors, analyses were no longer restricted to instances in which there were implausible decreases in size between subsequent time points, although many of the identified errors were expected to overlap with the biologically implausible values identified in the previous analysis. The modeling strategies were identical to those described previously. All measurements with a jackknife residual below À5 or above þ5 were considered outliers. Sensitivity analyses were conducted in which cutoffs of ±3 and ±7 were used instead.
The conditional growth percentile method outlined by Yang et al. was also applied. A random-effects model of the raw anthropometric measurement as a function of age was constructed for the whole study population, using a restricted cubic spline with five knots, where knot locations were based on Harrell's recommendations [13]. Conditional percentiles for each measurement were estimated, and measurements which were below À4 SD or above þ4 SD were considered outliers, as implemented by Yang et al. [9]. As per the demonstrations provided by Yang et al., conditional growth percentiles were calculated only for raw measurements and not for their corresponding z-scores because one of its strengths is that the method can be applied even when external standards are not available.
Finally, two traditional cross-sectional approaches to identify outliers and biologically implausible values were applied: (1) using the cutoffs for biologically implausible values that were derived from the WHO Child Growth Standards (<À6 SD or >6 SD for LAZ, <À6 SD or >5 SD for WAZ, and <À5 SD or >5 SD for HCAZ) [7] and (2) using cutoffs of 4 SD below or above the observed population average at the given time point.
For each method, the sensitivity and specificity of the approach was assessed, whereby knowledge of which errors were artificially introduced into the data set was considered the "gold standard". Sensitivities and specificities were calculated for the whole data set, regardless of the limitations of each method with respect to the detection of errors under certain conditions (e.g., residual method cannot be applied to individuals with fewer than 4 measurements; conditional growth percentile method cannot be applied to an individual's first measurement) and regardless of whether a given measurement was a suspected outlier in the original data set. As such, the sensitivity and specificity of any of these methods were not expected to be 100%. Sensitivity analyses also calculated sensitivity and specificity of these approaches after stratifying by time point or by number of measurement points per individual.

Results
A total of 8868 length measurements, 8883 weight measurements and 8888 head circumference measurements were available from 1211 infants in the MDIG trial (Table 1). Because data collection is still ongoing, there were fewer measurements available at later time points.
After inducing errors at a 5% rate, applying the jackknife residual method to either sex-and age-standardized z-scores or raw measurements performed comparably for both length and weight; the induced errors within each pair were correctly identified in an average of 88%e92% of pairs and kappa statistics ranged from 0.760 to 0.795 (Table 2). For head circumference, applying the jackknife residual method to the sex-and age-standardized z-scores performed better than when applied to the raw measurements ( Table 2).
When using the jackknife residuals method to identify any induced error, sensitivities ranged from 10.7% to 14.1% and specificities ranged from 97.4% to 97.6% when applied to sex-and agestandardized z-scores for length, weight, and head circumference ( Table 3). Sensitivity estimates were lower when the jackknife residual method was used for raw length, weight, or head circumference measurements, although specificities were similar to the models based on z-scores (Table 3). Alternative methods to identify induced errors in length, weight, and head circumference measurements had much lower sensitivities ( Table 3). The conditional growth percentile method had specificities that were slightly lower than the jackknife residual approach, whereas the conventional cross-sectional methods were very insensitive (<1%) but had nearly perfect specificities (>99%) ( Table 3).
As expected, sensitivity decreased and specificity increased with increasing absolute values of cutoffs used for the jackknife residual  Table S2). When stratified by timing of the measurement, higher sensitivity and lower specificity were also generally observed when the residual method was applied to the first measurement taken for a given individual compared with midtrajectory visits or last visits, although the pattern of differences in sensitivities was not evident for raw measurements (Supplementary Material Table S3). Substantial differences in sensitivity and specificity were not observed between midtrajectory visits and last visits when the conditional growth percentile method was applied. Increasing the overall induced error rate to 10% or 15% resulted in a decrease in sensitivity for the residual method, but specificity was largely unchanged (Supplementary Material Table S4). In contrast, doubling or tripling the width of the distribution of the magnitudes of errors resulted in increased sensitivity, but specificity remained fairly constant (Supplementary Material Table S4). For the alternative methods, changes in the error rate had smaller effects on sensitivities and specificities (Supplementary Material Table S4). As a case study, the jackknife residual method was applied to the original MDIG data set without induced errors. For length, 21 pairs of measurements with biologically implausible decreases were identified from the data set. An example of an individual with such a pair of measurements is presented in Figure 2, where applying the jackknife residual method to LAZ and raw measurements were consistent in identifying the incorrect measurement time point, as represented by the red marker.
To reduce the probability of labeling true measurements as errors/outliers by this process, a cutoff of ±5 was used in general, with a more extreme cutoff of ±6 applied to individuals with only 4 to 5 measurements, or if it was an individual's first measurement. As such, of the 8868 available length measurements, the residual method identified 133 (1.50%) outliers when applied to LAZ, and 139 (1.57%) outliers when applied to the raw measurements, with an overlap of 43 (0.48%) measurements that were identified by both methods. A total of 85 (0.96%) measurements could not be evaluated using this method, as these individuals had fewer than 4 separate measurements taken. An example of an outlier identified by both methods is presented in Figure 3.

Discussion
We demonstrated the use of a novel, simple, and objective approach to identify outliers and biologically implausible values in longitudinal growth data. Although the regression models chosen may provide a crude fit to the data, relative to more sophisticated modeling strategies, this was done intentionally to preclude overfitting the data and biasing the residuals toward the null. The simplicity of the models chosen to assess the jackknife residuals of given measurements will allow this method to be easily applied, especially for large data sets in which manual inspection may not be feasible.
In assessing the application of the jackknife residual method to identify the incorrect measurement within pairs of adjacent values with a biologically implausible decrease, incorrectly labeling an unmodified value as an error or failing to correctly identify an induced error was largely due to instances in which errors were introduced in both measurements within the pair or if the participant had many errors introduced at other measurement time points. As such, the use of the jackknife residual method to identify the incorrect measurement in instances of a biologically implausible decrease is predicated on the assumption that one measurement is correct whereas the other is notdthis method is Table 2 Comparison of using jackknife residuals from linear versus nonlinear models of z-scores or raw growth data, respectively, as a function of age to identify biologically implausible decreases in length, weight, and head circumference measurements over 1000 simulations with an induced error rate of 5% considered to be biologically implausible, whereas a decrease of greater than 15% in the raw measurements for weight were considered biologically implausible. y Agreement between the jackknife residual method and truth in the classification of induced plausible values. z Linear equation of sex-and age-standardized z-score as a function of age x Raw anthropometric measurement as a function of square root age ( . y Raw anthropometric measurement as a function of square root of age ( . z Based on a random effects restricted cubic spline (with 5 knots) model. x For LAZ, <À6 SD or >6 SD; for WAZ, <À6 SD or >5 SD; and for HCAZ, <À5 SD or >5 SD [7]. unable to identify instances in which errors occurred at both time points. In addition, increased number and magnitude of errors per individual will distort growth trajectories fit to each individual's measurements and thereby reduce the utility of this approach. Errors in head circumference measurements were more difficult to identify, likely due to the smaller absolute change in head circumference that occurs during development, relative to length and weight. Errors which were large enough to cause a biologically implausible decrease in measurements only caused minor changes in the overall head circumference trajectory, making it difficult for the jackknife residual method, as well as other methods, to discern between correct and incorrect measurements.
The accuracy of the jackknife residual method when applied to age-and sex-standardized z-scores versus raw measurements was comparable, except for head circumference, where using a ±5 cutoff resulted in much higher sensitivity and slightly lower specificity when using HCAZ rather than the raw measurements. Although observed sensitivities may appear low, this was expected since the magnitude of many induced errors were quite small and were very unlikely to be detected using any available method. In addition, since precleaning of the data set was not conducted, naturally occurring errors could be identified as errors and would therefore reduce estimated specificities. However, this would affect not just the jackknife residual method but all methods that were assessed, although likely to different extents. For example, conventional cross-sectional methods have overall lower sensitivity, and therefore are less likely to detect naturally occurring errors. As such, specificity of these methods is less likely to be reduced by naturally Absolute value of cut-offs used for jackknife residual method on HCAZ Sensitivity Specificity Fig. 1. Sensitivity and specificity of the jackknife residual method for detection of outliers in child (A) raw length, (B) length-for-age z-score, (C) raw weight, (D) weight-for-age z-score, (E) raw head circumference, and (F) head circumference-for-age z-score data using cutoffs from ±3 to ± 8. occurring errors than for methods which have higher sensitivity such as the jackknife residual method. The effect on estimates of sensitivity are more difficult to predict, but given that the true prevalence of errors in the data set is greater than how much was induced, our estimates of sensitivity may also be an underestimate of the true sensitivity because our sensitivity analyses have shown that the sensitivity of these methods decreased with increasing error rate (Table S4). We also showed that the timing of the measurement as well as the number of measurements available for a given individual has implications for the sensitivity and specificity of the jackknife residual method, but can be accounted for by combining a variety of different cutoffs when flagging potential outliers. Both sensitivity and specificity were lower when applying the conditional growth percentile method to identify the induced errors compared with the jackknife residual method. The substantially reduced specificity can be attributed to the calculation of specificity in the whole data set, rather than in the subset of measurements to which the method can be applied. For example, the jackknife residual method could not be applied to measurements of individuals who have fewer than four total measurements, which comprised approximately 1% of all measurements in the data set, effectively reducing the observed overall sensitivity of this method by 0.5% and specificity by 0.95% (assuming errors were induced in 5% of those observations, as expected since they were randomly generated). In contrast, the conditional growth percentile method could not be applied to the first measurement time point of each individual, which comprised approximately 13.6% of all measurements in the data set, thus affecting the observed overall sensitivity and specificity of the method to a much greater extent and thereby highlighting the most consequential limitation of this method. The reduced sensitivity of the conditional growth percentile method may be attributed to the use of the 4 SD threshold as recommended by Yang et al [9] to prioritize specificity. Similarly, careful consideration of the sensitivity-specificity tradeoff is required for the jackknife residual method, and further inquiry and investigation into outliers flagged by this method should be conducted. Unsurprisingly, conventional cross-sectional methods performed quite poorly in discriminating between unmodified measurements and induced errors. Although they had near-perfect specificity, a very limited number of the induced errors were detected by these methods, resulting in extremely poor sensitivity. Although lowered cutoffs were assessed to try to increase the sensitivity of these methods (data not shown), substantial increases in sensitivity were not observed until cutoffs were lowered to values that would have eliminated measurements which were likely to be plausible.
Although we demonstrated that jackknife residuals are a practical approach to identify both biologically implausible values and outliers, assumptions were made regarding the functional forms used to reflect the shape of the growth trajectory. Although we used A B Fig. 2. Example of a participant for whom an error was identified within a pair of values in which there was a biologically implausible decrease in length between two adjacent encounters (shown in hollow circles). The error was similarly identified when the jackknife residual method was applied to (A) length-for-age z-scores (LAZ) or (B) raw length measurements. Each measurement is labeled with its corresponding jackknife residual values. Fig. 3. Example of a participant for whom an outlier (shown in hollow circles) was identified when the jackknife residual method is applied to (A) length-for-age z-score (LAZ) or (B) raw length measurement. Each measurement is labeled with its corresponding jackknife residual values.

A B
linear equations of z-scores as a function of age or raw anthropometric measurement as a function of square root of age, the general shape of the growth trajectory can be described using other functional forms. Our models were selected on the basis of being generalized forms of the expected growth trajectories of infants from birth to 2 years of age. For example, the linear equation of z-scores as a function of age represents the expected trajectory of a child maintaining a z-score at 0, whereas the square root equation reflects the rapid but decelerating rate of change in raw size measurements that occurs in the postnatal period. Different functional forms may be needed to account for alternate patterns of growth that may be expected with other types or timing of measurements. However, caution is warranted against overfitting the data, as not only will this increase the number of measurements needed per individual, where (k þ 2) measurements are needed for a model with k parameters, residuals are biased closer toward the null when the data are overfit. In addition, our simulations assumed an arbitrary error rate and distribution for the magnitude of the errors, which may be impossible to characterize in real longitudinal growth data. However, our sensitivity analyses indicated that although increasing the error rate or magnitude of these errors has implications for the sensitivity of the jackknife residual method, its specificity remains relatively constant. While no single method will be able to identify all errors in a longitudinal growth data set, a combination of approaches, such as applying the jackknife residual method using various regression equations as well as the conditional growth percentile method, may provide the best sensitivity without erroneously identifying real measurements as errors. A multitude of factors, including type of measurement, number of measurements available per individual, and the timing between measurements, should be considered in deciding on a data cleaning strategy. Ultimately, the acceptable balance of sensitivity and specificitydthereby determining the parameters used to implement the method (e.g., cutoff values, use of variable cutoff values for different time points)dis determined by individual investigators based on the study design and overall sample size. Flagged values should undergo manual review and adjudication before being excluded from further analyses; therefore, the choice of cutoffs may be determined by available resources to undertake a manual review process.
In conclusion, the use of jackknife residuals provides a simple and flexible method to identify biologically implausible values and outliers in longitudinal growth data in studies in which most children have at least 4 serial measurements. The detection and correction (or exclusion, if necessary) of measurement errors can increase precision in analyses to identify determinants of growth trajectories or the effects of child growth on later health outcomes.

Appendix A. Supplementary methods
Generation of sex-and age-standardized z-scores for length, weight, and head circumference measurements (1) Sex-and age-standardized z-scores for length, weight, and head circumference measurements were generated using the Intergrowth-21st Newborn Size standards for length, weight, and head circumference measurements taken within 48 hours of birth; (2) the Intergrowth-21st International Postnatal Growth Standards for Preterm Infants for length, weight, and head circumference measurements taken up to 64 weeks of postmenstrual age among preterm infants (born earlier than 37 weeks of gestation); and (3) the World Health Organization (WHO) Child Growth Standards for length, weight, and head circumference measurements among term infants and measurements taken after 64 weeks of postmenstrual age among preterm infants.
Models for individual sex-and age-standardized z-scores and raw measurements for the jackknife residual method Linear regression was used to fit a straight line through the individual's sex-and age-standardized z-score of the corresponding anthropometry measurement as a function of age: (1) where "i" denotes the i th individual and "j" denotes the j th time point.
For raw measurements, each individual's measurements were regressed on the square root of age (t ½ ) to model a curvilinear relationship:

Minimum number of measurements required per individual
In both the sex-and age-standardized z-score and raw measurement models, two parameters are being estimated: one for the intercept and one for age or square root of age. Because a line fit to two measurements is simply directly connecting one point to the other, a minimum of three measurements is required in order for there to be some variation (i.e., residual not equal to zero) around the fitted trajectory. In the case of calculating jackknife residuals, a fourth measurement is required since mean squared error is calculated with the i th observation dropped. In general, (k þ 2) measurements per infant are required for the jackknife residual method, where k represents the number of parameters in the model.

Table S1
Comparison of various cutoffs to identify induced errors in length, weight, and head circumference using the jackknife residual method over 1000 simulations with an induced error rate of 5% . y Raw anthropometric measurement as a function of square root of age (