Cumulative deviation of a subpopulation from the full population

Assessing equity in treatment of a subpopulation often involves assigning numerical “scores” to all individuals in the full population such that similar individuals get similar scores; matching via propensity scores or appropriate covariates is common, for example. Given such scores, individuals with similar scores may or may not attain similar outcomes independent of the individuals’ memberships in the subpopulation. The traditional graphical methods for visualizing inequities are known as “reliability diagrams” or “calibrations plots,” which bin the scores into a partition of all possible values, and for each bin plot both the average outcomes for only individuals in the subpopulation as well as the average outcomes for all individuals; comparing the graph for the subpopulation with that for the full population gives some sense of how the averages for the subpopulation deviate from the averages for the full population. Unfortunately, real data sets contain only finitely many observations, limiting the usable resolution of the bins, and so the conventional methods can obscure important variations due to the binning. Fortunately, plotting cumulative deviation of the subpopulation from the full population as proposed in this paper sidesteps the problematic coarse binning. The cumulative plots encode subpopulation deviation directly as the slopes of secant lines for the graphs. Slope is easy to perceive even when the constant offsets of the secant lines are irrelevant. The cumulative approach avoids binning that smooths over deviations of the subpopulation from the full population. Such cumulative aggregation furnishes both high-resolution graphical methods and simple scalar summary statistics (analogous to those of Kuiper and of Kolmogorov and Smirnov used in statistical significance testing for comparing probability distributions).

disjoint intervals with endpoints B 1 , B 2 , ..., B ℓ−1 such that B 1 < B 2 < . . . < B ℓ−1 and B 1 <B 2 < . . . <B ℓ−1 . We can then form the averages for the subpopulation and for the full population for k = 1 , 2, ..., ℓ , under the convention that B 0 =B 0 = −∞ and B ℓ =B ℓ = ∞ . We also calculate the average scores in the bins for the subpopulation and for the full population for k = 1 , 2, ..., ℓ , under the same convention that B 0 =B 0 = −∞ and B ℓ =B ℓ = ∞ . A graphical method for assessing the deviation of the subpopulation from the full population is then to scatterplot the pairs (X 1 , Y 1 ) , (X 2 , Y 2 ) , ..., (X ℓ , Y ℓ ) in black and the pairs (X 1 ,Ỹ 1 ) , (X 2 ,Ỹ 2 ) , ..., (X ℓ ,Ỹ ℓ ) in gray. Comparing the black plotted points (possibly connected with black lines) to the gray plotted points (possibly connected with gray lines) then indicates how much the subpopulation deviates from the full population. Especially when assessing the calibration or reliability of probabilistic predictions, this graphical method is known as a "reliability diagram" or "calibration plot, " as reviewed, for example, by [1] and [2].
A full review of the literature is available in section "Introduction to assessing calibration" of Appendix A.
There are at least two common choices of the bins whose endpoints are B 1 , B 2 , ..., B ℓ−1 (and similarly for the bins whose endpoints are B 1 , B 2 , ..., B ℓ−1 ). The first is to make B 1 , B 2 , ..., B ℓ−1 be equispaced. The second is to select B 1 , B 2 , ..., B ℓ−1 such that the number of scores from the subpopulation that fall in the kth bin, that is, #{j : B k−1 < S i j ≤ B k } , is the same for all k (aside from the rightmost bin, that for k = ℓ , if n is not perfectly divisible by ℓ ). In both cases, the number ℓ of bins is a parameter that we can vary to trade-off higher-confidence estimates for finer resolution in detecting deviation as a function of score (and vice versa). Unfortunately, no choice can fully offset how the difference between the subpopulation and the full population is typically the primary interest, whereas the standard plot bins the subpopulation and the full population separately (potentially smoothing away information due to the discretization). Plotting the difference directly would solve this particular problem. Even then, however, no choice of bins can be optimal for all possible distributions of scores or for all possible distributions of deviations between the subpopulation and (1) the full population. Binning will always discretize the distributions, smoothing away potentially important information.
Fortunately, binning so coarsely is unnecessary in the methods proposed below. The methods discussed below employ exactly one bin per score S i j , j = 1, 2, ..., n, thus viewing the subpopulation on its own terms, dictated by the observations from the subpopulation. Employing only one bin per score would be nonsensical in the classical plots, as then the classical plots would trade-off all statistical confidence for the finest resolution possible. The cumulative methods sacrifice no resolution that is available in the observed data for the subpopulation.
For a simple illustrative example, Fig. 1 displays both the conventional reliability diagrams as well as the cumulative plot proposed below. A detailed description is available in section "Synthetic" below. The lowermost two rows of Fig. 1 are the classical diagrams, with m = 50,000 and n = 5000; there are ℓ = 10 bins for each diagram in the second row and ℓ = 50 for each diagram in the third row. In the lowermost two rows, the bins are equispaced along the scores in the leftmost plots, whereas each bin contains the same number of scores from S i 1 , S i 2 , ..., S i n (or from S 1 , S 2 , ..., S m ) in the rightmost plots. The diagram of Fig. 2 plots the ordered pairs (S 1 , P 1 ), (S 2 , P 2 ) , ..., (S m , P m ) in gray and (S i 1 , P i 1 ), (S i 2 , P i 2 ) , ..., (S i n , P i n ) in black, where P 1 , P 2 , ..., P m are the expected values of R 1 , R 2 , ..., R m , respectively; for this example, R 1 , R 2 , ..., R m are drawn independently from Bernoulli distributions with probabilities of success P 1 , P 2 , ..., P m . Thus, Fig. 2 depicts the "ground-truth" expectations that the lowermost two rows of classical plots in Fig. 1 are trying to characterize. The gray points correspond to the full population, while the solid black points correspond to the subpopulation.
The topmost row of Fig. 1 displays both the cumulative plot introduced below as well as its ideal noiseless "ground-truth" constructed using the expected values P 1 , P 2 , ..., P m of the random observations R 1 , R 2 , ..., R m . Leaving elucidation of the cumulative plots and their construction to section "Methods" below, we just point out here that the deviation of the subpopulation from the averages over the full population across an interval is equal to the slope of the secant line for the graph over that interval, aside from expected stochastic fluctuations detailed in section "Significance of stochastic fluctuations" below. Steep slopes correspond to substantial deviations across the ranges of scores where the slopes are steep, with the deviation over an interval exactly equal to the expected value of the slope of the secant line for the graph over that interval. The cumulative plot closely resembles its ideal ground-truth in Fig. 1, and the Kolmogorov-Smirnov and Kuiper metrics conveniently summarize the statistical significance of the overall deviation across all scores, in accord with sections "Scalar summary statistics" and "Significance of stochastic fluctuations" below.
The structure of the remainder of the present paper is as follows: Section "Methods" details the statistical methodology that section "Results and discussion" illustrates via numerical examples. 1 Section "Results and discussion" also proposes avenues for further development. Section "Conclusion" then briefly summarizes the methods and numerical results. Appendix A describes methods for calibrating probabilistic Tygert J Big Data (2021) 8:117 predictions that are analogous to the cumulative methodology introduced in section "Methods" for assessing deviation of a subpopulation from the full population. Appendix A also contains a literature review, in section "Introduction to assessing calibration"; please consult section "Introduction to assessing calibration" for a discussion of related work. Appendix B warns about potentially tempting overinterpretations of the various plots. Table 1 gives a glossary of the notation used throughout except in the appendices, while Table 2 gives a glossary of the notation used in the appendices.  Figure 2 displays the ground-truth reliability diagram. The reliability diagram with 50 bins that each contain the same number of scores from the subpopulation is able to detect the notch around scores of 0.25; however, the oscillation of the bin frequencies for the subpopulation complicates disentangling real variations from statistical noise. The reliability diagrams that each have only 10 bins exhibit fewer random oscillations, but smear out the notch. In the reliability diagrams, the averages for the subpopulation are black, while the averages for the full population are gray. In the top row, the plot of cumulative deviation resolves the notch nicely while displaying minimal random fluctuations across the full range of scores. The scalar summary statistics of Kuiper and of Kolmogorov and Smirnov very successfully detect the statistically significant deviation of the subpopulation from the full population Tygert J Big Data (2021) 8:117 Methods This section formulates the cumulative statistics mathematically, with section "Highlevel strategy" proposing a workflow for large-scale analysis. Section "Graphical method" details the graphical methods. Section "Scalar summary statistics" details the scalar summary metrics. Section "Significance of stochastic fluctuations" discusses statistical significance for both the graphical methods and the summary statistics. Section "Weighted sampling" presents a generalization of these statistical methodologies to the case of weighted samples (beyond just equally or uniformly weighted).

High-level strategy
This subsection suggests a hybrid method for large-scale data analysis. When there are many data sets and subpopulations to assess, a two-step approach may be the most practical: 1. A screening stage assigning a single scalar summary statistic to each pair of data set and subpopulation (where the size of the statistic measures the deviation of the subpopulation from the full population). 2. A detailed drill-down for each pair of data set and subpopulation whose scalar summary statistic is large, drilling down into how the deviation of the subpopulation from the full population varies as a function of score.
The drill-down relies on graphically displaying the deviation of the subpopulation from the full population as a function of score; the scalar statistic for the first stage simply summarizes the overall deviation across all scores, as either the maximum absolute deviation of the graph or the size of the range of deviations in the graphical display. Thus, for each pair of data set and subpopulation, both stages leverage a graph; the first stage collapses the graph into a single scalar summary statistic. The following subsection constructs the graph.

Graphical method
This subsection details the construction of cumulative plots. Two sequences will define the cumulative plot based on the notation set in the introduction. The cumulative sequence for the subpopulation is   for k = 1, 2, ..., n.
Facilitating comparison of the subpopulation with the full population, the average result for the full population in a bin around S i k is for k = 1, 2, ..., n, where the thresholds for the bins are for k = 0, 1, 2, ..., n, under the convention that S i 0 = −∞ and S i n+1 = ∞ (so B 0 = −∞ and B n = ∞).
The cumulative sequence for the full population at the subpopulation's subset of scores is for k = 1, 2, ..., n, where R i j is defined in (6).
Although the accumulation from lower scores might at first glance appear to overwhelm the contributions from higher scores, a plot of F k −F k as a function of k will display deviation of the subpopulation from the full population for any score solely in slopes that deviate significantly from 0; any problems accumulated from earlier, lower scores pertain only to the constant offset from 0, not to the slope deviating from 0. In fact, the increment in the expected difference F j −F j from j = k − 1 to j = k is thus, on a plot with the values for k spaced 1/n apart, the slope from j = k − 1 to j = k is where the expected value of the average result R i k in the bin around S i k is equal to the average of the expected results, that is, for k = 1, 2, ..., n, and R i k is defined in (6). The subpopulation deviates from the full population for the scores near S i k when k is significantly nonzero, that is, when the slope of the plot of F k −F k deviates significantly from horizontal over a significantly long range.
To emphasize: the deviation of the subpopulation from the full population over a contiguous range of S i k is the slope of the secant line for the plot of F k −F k as a function of k n over that range, aside from expected random fluctuations. Figure 1 presents a simple illustrative example, and many examples analyzing data from a popular data set in computer vision, "ImageNet, " are available in section "ImageNet". The leftmost plot in the topmost row of Fig. 1 graphs F k −F k versus k/n; the rightmost plot in the top row is the ideal noiseless "ground-truth" constructed using the precise expected values of the random observations R 1 , R 2 , ..., R m (as detailed in section "Synthetic", the exact expectations are available for constructing the ground-truth in the synthetic example corresponding to Fig. 1). Steep slopes correspond to substantial deviations across the ranges of scores where the slopes are steep, with the deviation over an interval exactly equal to the expected value of the slope of the secant line for the graph over that interval. The cumulative plot nicely matches its ideal ground-truth in Fig. 1.
The following subsection discusses metrics summarizing how much the graph deviates from 0 (needless to say, if the slopes of the secant lines are all nearly 0, then the whole graph cannot deviate much from 0). Section "Significance of stochastic fluctuations" then leverages these metrics in a discussion of the expected stochastic fluctuations mentioned above.

Scalar summary statistics
This subsection details the construction of scalar statistics which provide broadbrush summaries of the plots introduced in the previous subsection.
Two standard metrics for the overall deviation of the subpopulation from the full population over the full range of scores that account for expected random fluctuations are that due to Kolmogorov and Smirnov, the maximum absolute deviation and that due to Kuiper, the size of the range of the deviations where F 0 = 0 =F 0 ; the following remark explains the reason for including F 0 and F 0 (a reason that often makes D modestly preferable to G). Under appropriate statistical models, G and D can form the basis for tests of statistical significance, the context in which they originally appeared; see, for example, Section 14.3.4 of [3]. For assessing statistical significance (rather than overall effect size), G and D should be rescaled larger by a factor proportional to √ n ; further discussion of the rescaling is available in the next subsection. The captions of the figures report the values of these summary statistics for all examples.

Remark 1
The statistic D from (13) has the same statistical power across all indices. Indeed, shifting the index in the definitions of F k and F k where the summation starts has no effect on the value of D, for the following reasons: Recall that an integral whose lower limit of integration is greater than the upper limit is simply the negative of the integral with the lower and upper limits of integration interchanged. Thus, the natural generalization when starting from arbitrary values of j the summations defining F k in (5) and F k in (8) is Tygert J Big Data (2021) 8:117 and for k = 0 , 1, 2, ..., n; ℓ = 0 , 1, 2, ..., n, with the case ℓ = 0 reducing to (5) and (8). Of course, the first summation in the right-hand side of (14) vanishes when k ≤ ℓ and the second summation vanishes when k ≥ ℓ ; similarly, the first summation in the right-hand side of (15) vanishes when k ≤ ℓ and the second summation vanishes when k ≥ ℓ . Consideration of each case, for k < ℓ , for k = ℓ , and for k > ℓ , yields that and for k = 0 , 1, 2, ..., n; ℓ = 0 , 1, 2, ..., n. The definition when combined with (16) and (17), then yields that for ℓ = 0 , 1, 2, ..., n, where D is from (13). This shows that the statistic D has the same statistical power for any index, as the statistic is invariant to shifts in where the summation for the cumulative differences starts.

Significance of stochastic fluctuations
This subsection discusses statistical significance both for the graphical methods of section "Graphical method" and for the summary statistics of section "Scalar summary statistics".
The plot of F k −F k as a function of k/n automatically includes some "confidence bands" courtesy of the discrepancy F k −F k fluctuating randomly as the index k increments-at the very least, the "thickness" of the plot coming from the random fluctuations gives a sense of "error bars. " To give a rough indication of the size of the fluctuations of the maximum deviation expected under the hypothesis that the subpopulation does not deviate from the full population in the actual underlying distributions, the plots should include a triangle centered at the origin whose height above the origin is proportional to 1/ √ n . Such a triangle can be a proxy for the classic confidence bands around an empirical cumulative distribution function introduced by Kolmogorov and Smirnov, as reviewed by [4]. Indeed, a driftless, purely random walk deviates from zero by roughly √ n after n steps, so a random walk scaled by 1/n deviates from zero by roughly 1/ √ n . Identification of deviation between the subpopulation and the full population is reliable when focusing on long ranges (as a function of k/n) of steep slopes for F k −F k ; the triangle gives a sense of the length scale for variations that arise solely due to randomness even in the absence of any actual underlying deviation between the subpopulation and the full population. A simple illustrative example is available in Fig. 1, and many examples analyzing data from a popular data set in computer vision, "ImageNet, " are available in section "ImageNet".
In cases for which either R i = 0 or R i = 1 for each i = 1 , 2, ..., m, and for which the scores are nothing but the probabilities of success, that is, S i is the probability that R i = 1 (where i = 1 , 2, ..., m), the tip-to-tip height of the triangle centered at the origin should be 4/n times the standard deviation of the sum of independent Bernoulli variates with success probabilities S i 1 , S i 2 , ..., S i n , that is, 4 n k=1 S i k (1 − S i k )/n . This height will be representative to within a factor of √ 2 or so provided that the subpopulation is a minority of the full population-see Remark 7 and section "Significance of stochastic fluctuations for assessing calibration" of Appendix A below. Needless to say, similar remarks pertain whenever the variance of R i is a known function of S i for each i = 1 , 2, ..., m.
In cases for which either R i = 0 or R i = 1 for each i = 1 , 2, ..., m, and for which there are many scores from the full population in the bin for each score from the subpopulation, that is, #{i : B k−1 < S i ≤ B k } is large for every k = 1, 2, ..., n, the average of the outcomes for each bin will be a good approximation to the average of the underlying probabilities of success for that bin, that is, for k = 1 , 2, ..., n, where B k is from (7), E [R i k ] is from (11), and E [R i ] is the probability that the outcome is a success, that is, the probability that R i = 1 . In such cases, the tipto-tip height of the triangle at the origin should be 4/n times the standard deviation of the sum of independent Bernoulli variates with success probabilities /n -and we may use (20) to approximate this height as four times where R i k is the left-hand side of (20), as seen from (6). The triangles in the figures all have this height when each R i is either 0 or 1, since the numerical results reported in the following section pertain to the case in which there are quite a few scores from the full population in the bin for each score from the subpopulation. When R i can take on more than two possible values, the figures instead use the empirical variance of the members from the full population for each bin from B k−1 to B k , that is, we replace (21) with Tygert J Big Data (2021) 8:117 where the empirical variance is for k = 1, 2, ..., n, with R i k defined in (6); R i k is also the left-hand side of (20).
[Rigorous justification of (20) is straightforward: the expected value of the lefthand side of (20) is the right-hand side of (20), and , so the standard deviation of the left-hand side of (20) is which converges to 0 as #{i : Remark 2 Interpreting the scalar summary statistics G and D from (12) and (13) is straightforward in these latter cases, using σ defined in (21) or (22). Indeed, under the null hypothesis that the subpopulation has no deviation from the average values of the full population at the corresponding scores, the expected value of G/σ is less than or equal to the expected value of the maximum (over a subset of the unit interval [0, 1]) of the absolute value of the standard Brownian motion over [0, 1], in the limit n → ∞ and #{i : B k−1 < S i ≤ B k } → ∞ for all k = 1 , 2, ..., n. As reviewed below in Remark 7, the expected value of the maximum of the absolute value of the standard Brownian motion over the unit interval [0, 1] is √ π/2 ≈ 1.25 ; and the discussion by [5] immediately following Formula 44 of the associated arXiv publication 2 shows that the probability distribution of the maximum of the absolute value of the standard Brownian motion over [0, 1] is sub-Gaussian, decaying past its mean √ π/2 ≈ 1.25 . So, values of G/σ much greater than 1.25 imply that the subpopulation's deviation from the full population is significant, while values of G/σ close to 0 imply that G did not detect any statistically significant deviation. Similar remarks pertain to D, since G ≤ D ≤ 2G.

Weighted sampling
This subsection generalizes the methods of the preceding subsections to the case of weighted samples.
Specifically, some data sets include a weight for how much each observation should contribute to the data analysis. In the setting of section "Graphical method" above, such a data set would supplement the results R 1 , R 2 , ..., R m and scores S 1 , S 2 , ..., S m with positive 2 A freely available preprint of [5] is available at https:// arxiv. org/ pdf/ 1401. 4939. pdf.
weights W 1 , W 2 , ..., W m . Section "Graphical method" will correspond to the special case that W 1 = W 2 = . . . = W m (admittedly, the "special" case is the standard in practice).
With weights, the cumulative sequence for the subpopulation, replacing (5), becomes for k = 1, 2, ..., n. The average result for the full population in a bin around S i k , replacing (6), becomes  Figure 4 displays the ground-truth reliability diagram. Distinguishing random fluctuations from real variations is difficult in the reliability diagrams with 50 bins each. The reliability diagrams that each have only 10 bins could be misleading, as the depicted variations in the subpopulation's outcomes are grossly lower than the actual underlying variations as a function of score. The plot of cumulative deviation is far from perfect, yet captures the exact expectations quite well qualitatively and tolerably well quantitatively. The scalar summary statistics have trouble detecting the significant deviation of the subpopulation from the full population. This illustrates a blind spot in the Kolmogorov-Smirnov and Kuiper statistics, namely, they have a hard time detecting oscillatory discrepancies that average away upon summation. Neither the Kolmogorov-Smirnov metric nor the Kuiper metric is very sensitive to high-frequency deviations whose mean is small for k = 1, 2, ..., n, where (7) defines B 0 , B 1 , ..., B n , the thresholds for the bins.
The cumulative sequence for the full population at the subpopulation's subset of scores, replacing (8), becomes The cumulative sequence of weights is In a plot of the weighted cumulative differences (25)

and (27) versus the cumulative weights
.., F n −F n are the ordinates (vertical coordinates), and A 1 , A 2 , ..., A n are the corresponding abscissae (horizontal coordinates), the expected value of the slope from A k−1 to A k is where the expected value of the weighted average result R i k in the bin around S i k is equal to the weighted average of the expected results, that is, for k = 1, 2, ..., n, and R i k is defined in (26). Thus, k defined in (29) is equal to the expected value of the deviation of the subpopulation from the full population for the scores near S i k , with W i k canceling in the rightmost identity of (29). Hence, the subpopulation deviates from the full population for the scores near S i k when k is significantly nonzero, that is, when the slope of the plot of F k −F k versus A k deviates significantly from horizontal over a significantly long range.  Figure 6 displays the ground-truth reliability diagram. The observed reliability diagrams fail to depict the underlying discontinuous jumps in the subpopulation's expected outcomes as a function of the score. The plot of cumulative deviation succeeds in resolving some of the corresponding cusps, but does exhibit significant random fluctuations nearly as high as a quarter of the height of the triangle at the origin. The scalar summary statistics successfully detect the statistically significant deviation of the subpopulation from the full population To emphasize: the deviation of the subpopulation from the full population over a contiguous range of S i k is the slope of the secant line for the plot of F k −F k as a function of A k over that range, aside from expected random fluctuations.
A simple illustrative example is available in Fig. 7 of section "Synthetic", and many examples analyzing data from the U.S. Census Bureau are available in section "American Community Survey of the U.S. Census Bureau".
The slope of line segments connecting the points in the plot of F k −F k versus A k is constant between successive values of k, and those successive values are spaced further apart on the horizontal axis when the weight W i k is larger. A plotted line that is straight for a wide horizontal range is therefore indicative of a large weight. Moreover, setting the (major) ticks on the upper horizontal axis at the positions corresponding to equispaced values for k visually depicts the distribution of weights; including equispaced minor ticks on the same upper horizontal axis provides a comparison to the case of uniform weighting.
In cases for which either R i = 0 or R i = 1 for each i = 1 , 2, ..., m, and for which there are many scores from the full population in the bin for each score from the subpopulation, that is, #{i : B k−1 < S i ≤ B k } is large for every k = 1, 2, ..., n, the tip-to-tip height of the triangle at the origin should be four times where R i k is defined in (26); that is, we replace (21) with (31). When R i can take on more than two possible values, the figures instead use the empirical variance of the members from the full population for each bin from B k−1 to B k , that is, we replace (22) and (31) with where the empirical variance is for k = 1, 2, ..., n, with R i k defined in (26). The numerators of (31) and (32) include the square (W i k ) 2 , unlike the other formulae.
The scalar summary statistics due to Kuiper and to Kolmogorov and Smirnov of course use the same formulae (12) and (13) as in the unweighted (or uniformly  Figure 8 displays the ground-truth reliability diagram. The cumulative plot displays the distinguished observation from the subpopulation as a straight, steep jump at its score around 0.75; the constant slope of that steep jump shows that the corresponding high deviation between the subpopulation and the full population is due to a single highly weighted observation. This single observation has no effect on the slopes in the rest of the cumulative plot, whereas the few highly weighted observations dramatically (perhaps misleadingly?) alter the bins around scores of 0.75 in the observed reliability diagrams. The scalar summary statistics report statistically significant deviation of the subpopulation from the full population, though the steep jump in the cumulative plot reduces the effectiveness of the scalar statistics weighted) case, except for replacing the definition of F k from (5) with the definition from (25) and the definition of F k from (8) with the definition from (27).

Remark 4
We can adapt to the case of weighted sampling the classical methods discussed in the introduction. As in the introduction, we choose some partitions of the real line into ℓ disjoint intervals with endpoints B 1 , B 2 , ..., B ℓ−1 and another (possibly the same) ℓ disjoint intervals with endpoints  population is then the scatterplot of the pairs (X 1 , Y 1 ) , (X 2 , Y 2 ) , ..., (X ℓ , Y ℓ ) in black and the pairs (X 1 ,Ỹ 1 ) , (X 2 ,Ỹ 2 ) , ..., (X ℓ ,Ỹ ℓ ) in gray. Comparing the black plotted points (possibly connected with black lines) to the gray plotted points (possibly connected with gray lines) gives an indication of deviation of the subpopulation from the full population. Two natural choices of the bins whose endpoints are B 1 , B 2 , ..., B ℓ−1 (similar choices pertain to the bins whose endpoints are Eskimo Dog or Husky, with scores ( S 1 , S 2 , ..., S m ) being the negative log-likelihoods; n = 1300; Kuiper's statistic is 0.08082/σ = 6.363 , Kolmogorov's and Smirnov's is 0.08082/σ = 6.363 . None of the reliability diagrams is able to smooth away the irrelevant variations while simultaneously capturing the severe deviation at the lowest negative log-likelihoods. The scalar summary statistics very successfully detect the statistically highly significant deviation of the subpopulation from the full population has a similar value for all k = 1, 2, ..., ℓ . Remark 5 below details the procedure we followed in the second case; the plots entitled, "reliability diagram ( W 2 / W 1 is similar for every bin), " display this second possible choice of bins. The plots entitled simply, "reliability diagram, " display the first possible choice of bins. In the special case that the weights are uniform, that is, W 1 = W 2 = · · · = W m , the second choice of bins results in every bin containing about the same number of scores, with 10 Eskimo Dog or Husky, with scores being the probabilities; n = 1300; Kuiper's statistic is 0.08082/σ = 6.363 , Kolmogorov's and Smirnov's is 0.06959/σ = 5.478 . As in Fig. 9, none of the reliability diagrams with the same number of subpopulation scores per bin is able to smooth away the irrelevant variations while simultaneously capturing the severe deviation at the highest probabilities (however, the reliability diagram with 10 bins that are equispaced in probabilities captures everything nicely). The scalar summary statistics very successfully detect the statistically highly significant deviation of the subpopulation from the full population for k = 1, 2, ..., ℓ.

Remark 5
In the case of weighted sampling, the most useful reliability diagrams are usually those entitled, "reliability diagram ( W 2 / W 1 is similar for every bin). " These diagrams construct ℓ bins with endpoints B 0 , where U k is defined in (38). These diagrams also construct l bins with endpoints B 0 , B 1 , ..., Bl such that Ũ 1 ≈Ũ 2 ≈ . . . ≈Ũl , where Ũ k is defined in (39). The algorithmic details are as follows: Given a value U for which hopefully U k ≈ U for all k = 1, 2, ..., ℓ , we set B 0 = −∞ and, iterating from k = 1 to k = ℓ , incrementally increase B k to the least value greater than B k−1 such that U k ≤ U . If this causes the bin ℓ (the bin for the highest scores) to contain less than half as many subpopulation observations as bin ℓ − 1 , that is, Fig. 11 Night snake (Hypsiglena torquata), with scores being the negative log-likelihoods; n = 1300; Kuiper's statistic is 0.1365/σ = 11.35 , Kolmogorov's and Smirnov's is 0.1365/σ = 11.35 . Bins equispaced across either the subpopulation's observations or the scores (where the scores are negative log-likelihoods in this figure) cannot resolve the severe deviations at low scores without being overly noisy elsewhere. The scalar summary statistics extremely successfully detect the statistically highly significant deviation of the subpopulation from the full population In the aforementioned algorithm, we computed U via the formula where p 1 , p 2 , ..., p n are a uniformly random permutation of the integers i 1 , i 2 , ..., i n , and l is the desired number of bins. Calculating U via the heuristic (40) worked well for all examples reported below, producing U 1 ≈ U 2 ≈ . . . ≈ U ℓ , with ℓ close to l ; the procedure yielding Ũ 1 ≈Ũ 2 ≈ . . . ≈Ũl is similar (in fact, the implementation is identical, viewing the full population as a subpopulation of itself ). Fig. 12 Night snake (Hypsiglena torquata), with scores being the probabilities; n = 1300; Kuiper's statistic is 0.1365/σ = 11.35 , Kolmogorov's and Smirnov's is 0.1353/σ = 11.24 . As in Figure 11, uniform resolution in the binned plots (whether with respect to observations or to scores) is insufficient. The scalar summary statistics extremely successfully detect the statistically highly significant deviation of the subpopulation from the full population Tygert J Big Data (2021) 8:117

Results and discussion
This section illustrates via numerous examples the previous section's methods, together with the traditional plots-so-called "reliability diagrams"-discussed in the introduction. 3 Section "Synthetic" presents several (hopefully insightful) toy examples. Section "ImageNet" analyzes a popular, unweighted data set of images, ImageNet. Section "American Community Survey of the U.S. Census Bureau" analyzes a weighted data set, the most recent (year 2019) American Community Survey of the United States Census Bureau. Section "Future outlook" proposes directions for future developments.
The figures display the classical calibration plots ("reliability diagrams") as well as both the plots of cumulative differences and the exact expectations in the absence of noise from random sampling when the exact expectations are known (as with synthetically generated data). The captions of the figures discuss the numerical results depicted.
The title, "subpopulation deviation is the slope as a function of k/n, " labels a plot of F k −F k from (5) and (8) as a function of k/n. In each such plot, the upper axis specifies k/n, while the lower axis specifies S i k for the corresponding value of k. The title, "subpopulation deviation is the slope as a function of A k , " labels a plot of F k −F k from (25) and (27) versus 13 Sunglasses, with scores being the probabilities; n = 1300; Kuiper's statistic is 0.03597/σ = 2.935 , Kolmogorov's and Smirnov's is 0.03365/σ = 2.745 . The reliability diagrams with only 10 or 30 bins each fail to resolve the very high deviation for probabilities greater than 0.9 (unlike the diagram with 50 bins)-the smaller numbers of bins average away interesting behavior, without warning. The scalar summary statistics detect some statistically significant deviation, yet both are blind to the serious deviation for the highest scores that the plot of cumulative differences displays prominently; the steep drop at the highest scores in the cumulative plot has little to no effect on the Kolmogorov-Smirnov or Kuiper metrics, unfortunately Tygert J Big Data (2021) 8:117 the cumulative weight A k from (28). In each such plot, the major ticks on the upper axis specify k/n, while the major ticks on the lower axis specify S i k for the corresponding value of k; the points in the plot are the ordered pairs (A k , F k −F k ) for k = 1, 2, ..., n, with A k being the abscissa and F k −F k being the ordinate.
To give a sense of the uncertainties in the classical, binned plots, we vary the number of bins and observe how the plotted values vary. Displaying the bin frequencies is another way to indicate uncertainties, as suggested, for example, by [6]. Still other possibilities could use kernel density estimation, as suggested, for example, by [7] and [8]. Such uncertainty estimates require selecting widths for the bins or kernel smoothing; varying the widths (as done in the present paper) avoids having to make what would otherwise be a rather arbitrary choice. A thorough survey of the various possibilities is available in Chapter 8 of [8].  Fig. 13-the reliability diagrams with only 10 or 30 bins each underplay the very high deviation for the highest probabilities, smoothing away big deviation without warning. As in Fig. 13, the scalar summary statistics detect statistically significant deviation, while both are blind to the significant deviation for the highest scores that the plot of cumulative differences highlights; the steep drop at the highest scores in the cumulative plot has essentially no impact on the Kolmogorov-Smirnov or Kuiper metrics, unfortunately As noted in the introduction, there are two canonical choices for the bins in the case of unweighted (or uniformly weighted) sampling: {1} make the average of S i k (or S i ) in each bin be approximately equidistant from the average of S i k (or S i ) in each neighboring bin or {2} make the number of S i k (or S i ) in every bin (except perhaps for the last) be the same. The figures label the first, more conventional possibility with the short title, "reliability diagram, " and the second possibility with the longer title, "reliability diagram (equal number of subpopulation scores per bin). " As noted in Remark 4, there are two natural choices for the bins in the case of weighted sampling: {1} make the weighted average of S i k (or S i ) in each bin be approximately equidistant from the weighted average of S i k (or S i ) in each neighboring bin or {2} follow Remark 5 above. The figures label the first possibility with the short title, "reliability diagram, " and the second possibility with the longer title, "reliability diagram ( W 2 / W 1 is similar for every bin). " Setting the number of bins together with any of these choices fully specifies the bins. As discussed earlier, we vary the number of bins, since there is no perfect setting-using fewer bins offers higher-confidence estimates, yet limits the resolution for detecting deviations and for assessing how the deviations vary as a function of S i k .

Synthetic
In this subsection, the examples draw observations at random from various statistical models so that the underlying "ground-truth" is available. To generate the corresponding figures, Figs. 1, 2, 3, 4, 5, 6, 7, and 8, we specify values for the scores S 1 , S 2 , ..., S m and for the indices i 1 , i 2 , ..., i n . We also select probabilities P 1 , P 2 , ..., P m and then independently draw the outcomes R 1 , R 2 , ..., R m from the Bernoulli distributions with parameters P 1 , P 2 , ..., P m , respectively. The first three examples, corresponding to Figs. 1, 2, 3, 4, 5, and 6, use unweighted (or, equivalently, uniformly weighted) data; the fourth example, corresponding to Figs. 7 and 8, uses weighted data, in which each pair of score S i and result R i comes with an additional positive scalar weight W i . Appendix B illustrates the degenerate case of when deviation is absent, via further examples. 856 . The deviation of the subpopulation below the full population that occurs for scores greater than 1.5 is apparent only in the cumulative plot or in the (very noisy) reliability diagrams whose bins are roughly equispaced along the scores. The scalar summary statistics detect somewhat significant deviation, but not as much as would be possible if the Kolmogorov-Smirnov and Kuiper metrics were to take fully into account the steep drop that occurs in the plot of cumulative differences The top rows of Figs. 1, 3, and 5 plot F k −F k from (5) and (8) as a function of k/n, with the rightmost plot displaying its noiseless expected value rather than using the random observations R 1 , R 2 , ..., R m . The top row of Figure 7 plots F k −F k from (25) and (27) versus the cumulative weight A k from (28), with the rightmost plot displaying its noiseless expected value rather than using the random observations R 1 , R 2 , ..., R m . Figs. 2, 4, 6, and 8 plot the pairs (S 1 , P 1 ), (S 2 , P 2 ) , ..., (S m , P m ) in gray, and plot the pairs (S i 1 , P i 1 ), (S i 2 , P i 2 ) , ..., (S i n , P i n ) in black, producing ground-truth diagrams that the lowermost two rows of plots from the associated Figs. 1, 3, 5, and 7 are trying to estimate using only the observations R 1 , R 2 , ..., R m , without access to the underlying probabilities P 1 , P 2 , ..., P m .
For the examples, we consider full populations which are mixtures of many subpopulations, including for each example one specific subpopulation that we analyze for deviations  Fig. 16, the deviation of the subpopulation below the full population that occurs for probabilities less than 0.1 is apparent only in the cumulative plot or in the very noisy reliability diagrams whose bins are roughly equispaced along the probabilities. The scalar summary statistics again detect somewhat significant deviation, though not as much as if the Kolmogorov-Smirnov and Kuiper metrics could better account for the steep drop that happens in the plot of cumulative differences from the average over the full population. To make the models realistic, we assign expected outcomes to the members of the full population such that the expected outcomes span a continuous range which includes the expected outcomes attained by members of the subpopulation being analyzed for deviations. The reliability diagrams use gray points and gray connecting-lines to indicate the full population, and solid black points and black connecting-lines to indicate the specific subpopulation under consideration.
For the first example, corresponding to Figs. 1 and 2, we consider a mixture of subpopulations with a reasonably wide range of expected outcomes for most ranges of scores, except for a narrow notch around scores of 0.25 which lacks the significant subpopulation deviation. We select for the specific subpopulation analyzed for deviations a Fig. 18 Sidewinder/horned rattlesnake (Crotalus cerastes), with scores being the negative log-likelihoods; n = 1300; Kuiper's statistic is 0.1343/σ = 10.47 , Kolmogorov's and Smirnov's is 0.1343/σ = 10.47 . The reliability diagrams whose bins each contain roughly the same number of subpopulation scores are either noisier or missing the severest deviations in comparison with the other reliability diagrams; so there do exist cases in which choosing bins that are nearly equispaced along the scores works better than choosing bins that each contain roughly the same number of subpopulation scores. The scalar summary statistics extremely successfully detect the statistically highly significant deviation of the subpopulation from the full population subpopulation with among the highest expected outcomes around every score. The scores are S j = ((j − 0.5)/m) 2 for j = 1, 2, ..., m, thus more concentrated near 0 than near 1. For the indices i 1 , i 2 , ..., i n of the subpopulation, we start with the integer multiples of 20 and then add three levels of refinement increasingly focused around m/2, where m/2 corresponds to the middle of the notch. The total number of scores considered is m = 50,000, and (following the refinement) the number of subpopulation indices is n = 5000.
For the second example, corresponding to Figs. 3 and 4, we consider a mixture of subpopulations that includes one whose expected outcomes oscillate smoothly between the minimum and the maximum of the expected outcomes of all subpopulations as a function of the  Figure 18, the reliability diagrams whose bins each contain roughly the same number of subpopulation scores either miss the severest deviations or are noisier compared to the other reliability diagrams for this example; so sometimes choosing bins that are approximately equispaced along the scores works better than choosing bins that each contain a similar number of subpopulation scores. The scalar summary statistics extremely successfully detect the statistically highly significant deviation of the subpopulation from the full population score; we analyze that particular one subpopulation for deviations from the full population. The scores are S j = (j − 0.5)/m for j = 1, 2, ..., m, hence equispaced between 0 and 1. The total number of scores considered is m = 50,000. For the indices i 1 , i 2 , ..., i n of the subpopulation, we raise to the power 4/3 each positive integer, then round to the nearest integer, and finally retain the lowest n = 3300 unique resulting integers.  Fig. 9, the middlemost plots are for the night snake (Hypsiglena torquata) of Fig. 11, and the lowermost plots are for the sidewinder/horned rattlesnake (Crotalus cerastes) of Fig. 18; each plot on the right zooms in on the origin of the plot to its left, adjusting the height of the triangle at the origin, the number n of observations, and the Kuiper and Kolmogorov-Smirnov statistics to reflect the smaller range of scores on the horizontal axis For the third example, corresponding to Figs. 5 and 6, we consider a mixture of subpopulations similar to that for the second example, but this time selecting for analysis a subpopulation whose expected outcomes oscillate in discrete steps between the minimum and the maximum of the expected outcomes of all subpopulations as a function of the score. The scores are S j = (j − 0.5)/m) for j = 1, 2, ..., m, thus less concentrated near 0 than near 1. For the indices i 1 , i 2 , ..., i n of the subpopulation, we generate a random permutation of the integers 1, 2, ..., m, retain the first n, and then sort them. The total number of scores considered is m = 50,000, and the number of subpopulation indices is n = 2500.

Fig. 21
San Joaquin County, reporting the number of related children in the household, with scores being log 10 of the adjusted household income; n = 2282; Kuiper's statistic is 0.2449/σ = 9.120 , Kolmogorov's and Smirnov's is 0.2429/σ = 9.045 . The lack of deviation between the subpopulation and the full population right near scores of 4.0 is difficult to discern in the reliability diagrams with 10 or 20 bins each. The reliability diagrams with around 100 bins each do display the lack of deviation near scores of 4.0, but the rest of these diagrams is really noisy. The cumulative plot nicely captures the lack of deviation near scores of 4.0. Overall, the scalar summary statistics detect highly statistically significant deviation For the fourth example, corresponding to Figs. 7 and 8, we consider weighted samples (whereas the three previous examples all used unweighted or uniformly weighted data, in which all weights would be equal). We start by constructing the full population as a mixture of subpopulations whose probabilities of success oscillate between 0 and 0.2, and with the subpopulation being analyzed selected uniformly at random such that the probability of success for each observation is 0. The scores are S j = (j − 0.5)/m for j = 1, 2, ..., m, hence equispaced between 0 and 1. The total number of scores considered is m = 50,000, and the number of subpopulation indices is n = 2500. All observations just mentioned we weight equally with weight 1, and then introduce three outliers near the score 0.75: one being an observation from the subpopulation, altering its probability of success to 1 and its weight to 0.02n, and the other two being the two members from the full population not belonging to the subpopulation whose scores are adjacent to the score for the observation from the subpopulation, setting the probabilities of success for these two members to 0 and 1, both with the same weight 0.002m.
The captions of the figures comment on the numerical results displayed.

Fig. 22
Humboldt County, reporting whether no one in the household 14 or over speaks English very well, with scores being log 10 of the adjusted household income; n = 583; Kuiper's statistic is 0.1028/σ = 6.062 , Kolmogorov's and Smirnov's is 0.1028/σ = 6.062 . A single, highly weighted outlying observation from the subpopulation corrupts a bin at the highest scores in each reliability diagram. The cumulative plot includes this outlying observation, too, but displays the observation as an unmistakable steep jump in the plotted curve; the constant slope of that steep jump shows that the corresponding high deviation between the subpopulation and the full population is due to a single highly weighted observation. This single observation has no effect on the slopes in the rest of the cumulative plot, so this problematic observation corrupts only the reliability diagrams and not the plot of cumulative differences-an analogue of Simpson's Paradox of [9] afflicts the reliability diagrams but not the cumulative plot. The scalar summary statistics report highly statistically significant deviation commensurate with all the plots, albeit slightly less due to the steep jump in the cumulative plot. These behaviors are analogous to those displayed in Fig. 7 above (see also Remark 6 above) Figs. 7 and 8) the weight W i for some single observation is extraordinarily disproportionately high, then any bin containing that observation will be strongly biased toward only that individual observation, blind to the other observations in the bin. Such a phenomenon can lead to misleading behavior akin to Simpson's Paradox of [9] in the canonical binned reliability diagrams; for instance, the subpopulation may very well attain results R i 1 , R i 2 , ..., R i n that are less than all others in the full population except for the one disproportionately heavily weighted one, yet if the results R i 1 , R i 2 , ..., R i n for the subpopulation are greater than the result R i corresponding to the one disproportionately large weight W i , then any bin in reliability diagrams containing that heavily weighted observation will show that the subpopulation attains greater results than the weighted average in Fig. 23 Orange County, reporting whether the household has high-speed access to the internet, with scores being log 10 of the adjusted household income; n = 10,680; Kuiper's statistic is 0.02499/σ = 6.013 , Kolmogorov's and Smirnov's is 0.02484/σ = 5.979 . The severe deviation at the very lowest scores is misleadingly underestimated in the reliability diagrams with 10 or 20 bins each, even as compared to the diagrams with around 100 bins each. However, the diagrams with around 100 bins each are far too noisy for other scores. Only the plot of cumulative differences resolves the severe deviation at the lowest scores while being informative at the other scores, too. The scalar summary statistics indicate that the overall deviation is highly statistically significant the full population. In contrast, a single heavily weighted observation affects only one subpopulation observation in the plots of cumulative differences, merely introducing a jump in the constant offset at the score corresponding to the one observation (and the jump will be disproportionately large only in proportion to the weight of the corresponding observation from the subpopulation). Figure 7 illustrates related behavior, and Fig. 22 below exhibits similar behavior for the data from the U.S. Census Bureau analyzed there. Comparing the subpopulation against the full population via cumulative differences is thus similar to subtracting off baseline rates for calibration (that is, similar to "de-trending").

Fig. 24
Los Angeles County, reporting the number of people in the household, with scores being log 10 of the adjusted household income; n = 35,364; Kuiper's statistic is 0.06674/σ = 9.605 , Kolmogorov's and Smirnov's is 0.06495/σ = 9.347 . Discerning the analogue of the dip in the plot of cumulative differences at the highest scores is possible yet difficult in the reliability diagrams, while being unmistakable in the cumulative plot; the reliability diagrams with enough bins do reflect the corresponding deviation at the highest scores, but are hard to interpret without the accompanying cumulative plot. The scalar summary statistics report very highly statistically significant deviation, largely since the number of observations from this largest county in California is so large

ImageNet
This subsection analyzes the standard training data set "ImageNet-1000" of [10]. Each of the thousand labeled classes in the data set forms a natural subpopulation to consider; each class considered consists of n = 1300 images of a particular noun (such as a "sidewinder/horned rattlesnake, " a "night snake, " or an "Eskimo Dog or Husky"). Some classes in the data set contain fewer than 1300 images, so that the total number of members of the data set is m = 1,281,167, but each subpopulation we analyze below corresponds to a class with 1300 images. The particular classes reported below are cherry-picked to illustrate a variety of problems that can afflict the classical reliability diagrams; not all classes exhibit such problems, nor are these all the classes that exhibit problems (many, many others have similar issues). The images are unweighted (or, equivalently, uniformly weighted), not requiring the methods of section "Weighted sampling" above. To generate the corresponding figures, we calculate the scores S 1 , S 2 , ..., S m using the pretrained ResNet18 classifier of [11] from the computer-vision module, "torchvision, " in the PyTorch software library of [12]; the score for an image can be either {1} the probability assigned by the classifier to the class predicted to be most likely or {2} the corresponding negative log-likelihood (that is, the negative of the natural logarithm of the probability), with the scores randomly perturbed by about one part in 10 8 to guarantee their uniqueness. For i = 1, 2, ..., m, the result R i corresponding to a score S i is R i = 1 when the class predicted to be most likely is the correct class, and R i = 0 otherwise. The figures below omit display of reliability diagrams whose bins are equispaced with respect to the scores in the typical case that these reliability diagrams are so noisy as to be useless (the figures do display the diagrams when they are not too noisy). Figures 9,10,11,12,13,14,15,16,17,18,and 19 present several examples, listing in the captions the associated names of the classes for the subpopulations. Figure 20 illustrates the utility of the zooming proposed in Remark 3 above. In all figures, the plots of cumulative differences depict all observed phenomena clearly.

American Community Survey of the U.S. Census Bureau
This subsection analyzes the latest (year 2019) microdata from the American Community Survey of the U.S. Census Bureau; 4 specifically, we consider the full population to be the members from all counties in California put together, and consider the subpopulation to be the observations from an individual county. The sampling in this survey is weighted; we retain only those members whose weights ("WGTP" in the microdata) are nonzero, and discard any member whose household personal income ("HINCP") is zero or for which the adjustment factor to income ("ADJINC") is reported as missing. For the scores, we use the logarithm to base 10 of the adjusted household personal income (the adjusted income is "HINCP" times "ADJINC, " divided by one million when "ADJINC" omits its decimal point in the integer-valued microdata), randomly perturbing the scores by about one part in 10 8 to guarantee their uniqueness. The captions on the figures specify which variables from the data set we use for the results R 1 , R 2 , ..., R m (different figures consider different variables), with m = 134,094. Figures 21, 22, 23, 24, 24, 25, and 26 present the examples. In all figures, the cumulative plots display all significant features clearly.

Future outlook
This subsection suggests generalizations suitable for future research. The present paper plots the cumulative differences between the results of a subpopulation and the results of the full population, accumulated as a function of a scalar realvalued score. If instead the scores to be accumulated are from a Euclidean or Hilbert space of more than one dimension, a natural generalization is to impose a total order on the scores via a space-filling curve such as the Peano or Hilbert curves, effectively replacing the space of more than one dimension with a one-dimensional ordering. Space-filling curves constructed via quad-trees in two-dimensional space or oct-trees in three-dimensional space can be especially efficient. Google's S2 Geometry Library, which drives Google Maps, Foursquare, MongoDB, et al., employs a Hilbert curve. 5 The present paper focuses on deviations of a subpopulation from the full population; comparing the subpopulation to the full population is always legitimate, as binning or interpolating from the scores of the full population to those of the subpopulation happens at a scale finer than the subpopulation's sampling, and there is always at least one score in the full population corresponding to each score in the subpopulation (namely, the same score). In contrast, assessing deviations between two different subpopulations that do not have an explicit pairing for each score can be problematic. Indeed, the ranges of scores for the subpopulations may not even overlap-the scores for one subpopulation can all be significantly less than all the scores for another subpopulation, for instance. Binning or interpolating from one subpopulation to another can be ill-posed. And even when binning or interpolating from one subpopulation to a second subpopulation is well-posed, binning or interpolating from that second subpopulation to the first subpopulation can be ill-posed-the comparison may not be reflexively well-posed. In general, a partial ordering governs the comparisons-we can always analyze deviations of any subpopulation from any larger population containing the subpopulation, but cannot always reliably analyze deviations between two different subpopulations of the same larger population. Applications demanding direct comparison between different subpopulations are currently the subject of intensive investigation, far beyond the scope of the present paper except in the special case that the different subpopulations are paired for each score. One possibility for the general case is to bin both subpopulations being compared to the same finest binning such that both subpopulations have at least one score in each bin; however, interpreting significance in this general case can be tricky. Forthcoming work will treat the special case for which no score associated with any observation from either subpopulation being compared is exactly equal to the score for any other observation from the two subpopulations.

Conclusion
Plotting the cumulative differences between the outcomes for the subpopulation and those for the full population binned to the subpopulation's scores avoids the arbitrary selection of widths for bins or smoothing kernels that the conventional reliability diagrams and calibration plots require. The plot of cumulative differences displays deviation directly as the slope of secant lines for the graph; such slope is easy to perceive independent of any irrelevant constant offset of a secant line. The graph of cumulative differences very directly enables detection and quantification of deviation between the subpopulation and full population, along with identification of the corresponding ranges of scores. The cumulative differences estimate the distribution of deviation fully nonparametrically, letting the data observations speak for themselves (or nearly for themselves-the triangle at the origin helps convey the scale of a driftless random walk's expected random fluctuations). As seen in the figures, the graph of cumulative differences automatically adapts its resolving power to the distributions of deviations and scores for the subpopulation, not imposing any artificial grid of bins or set-width smoothing kernel, unlike the traditional reliability diagrams and calibration plots. The scalar metrics of Kuiper and of Kolmogorov and Smirnov conveniently summarize the overall statistical significance of deviations displayed in the graph.

Introduction to assessing calibration
This appendix treats calibration, specifically, we consider n observations R 1 , R 2 , ..., R n of the outcomes of independent Bernoulli trials, and would like to test the hypothesis that R j is drawn from a Bernoulli distribution whose probability of success is S j , that is, for all j = 1 , 2, ..., n, where the corresponding probabilities of success are S 1 , S 2 , ..., S n ; following the usual conventions, R j = 1 when the outcome is a success and R j = 0 when the outcome is a failure. We view R 1 , R 2 , ..., R n (but not S 1 , S 2 , ..., S n ) as random. Without loss of generality, we order the success probabilities (preserving the pairing of R j with S j for every j) such that S 1 ≤ S 2 ≤ · · · ≤ S n , ordering any ties at random, perturbed so that S 1 < S 2 < · · · < S n .
As in section "Introduction", the classical methods require partitioning the unit interval [0, 1] into ℓ disjoint intervals with endpoints B 1 , B 2 , ..., B ℓ such that 0 < B 1 < B 2 < · · · < B ℓ−1 < B ℓ = 1 . We can then form the averages for k = 1 , 2, ..., ℓ , under the convention that B 0 < 0 . We also calculate the average success probabilities in the bins for k = 1 , 2, ..., ℓ , under the same convention that B 0 < 0 . A graphical method for assessing calibration is then to scatterplot the pairs (X 1 , Y 1 ) , (X 2 , Y 2 ) , ..., (X ℓ , Y ℓ ) , along with the line connecting the origin (0, 0) to the point (1, 1); a pair (X k , Y k ) lies on that line when its calibration is ideal, as then Y k = X k . This graphical method for assessing the calibration or reliability of probabilistic predictions is known as a "reliability diagram" or "calibration plot, " as reviewed, for example, by [13]. Copious examples of such reliability diagrams are available in the figures below, as well as in the works of [1, 2, 6-8, [14][15][16][17] and many others; those works consider applications ranging from weather forecasting to (43) Fig. 27 Calibration of the full ImageNet-1000 training data set (the scores are the probabilities); n = 1,281,167; Kuiper's statistic is 0.03306/σ = 111.7 , Kolmogorov's and Smirnov's is 0.03306/σ = 111.7 . The reliability diagram with 100 bins, each with roughly the same number of observations, looks best among the reliability diagrams; the reliability diagrams with 1000 bins each display unreal stochastic variations. Only the cumulative plot conveniently reveals that a third of all observations (specifically, those with probabilities of at least 0.97) are well-calibrated. The scalar summary statistics report profoundly statistically significant miscalibration, courtesy of the large number of observations (the actual effect size is more modest, as seen by the values without dividing by σ) medical prognosis to fairness in criminal justice to quantifying the uncertainty in predictions of artificial neural networks. An approach closely related to reliability diagrams is to smooth over the binning using kernel density estimation, as discussed by [7,8] and others.
As in section "Introduction", two common choices of the bins whose endpoints are B 1 , B 2 , ..., B ℓ are {1} to make B 1 , B 2 , ..., B ℓ be equispaced, and {2} to make the number of success probabilities that fall in the kth bin, that is, #{j : B k−1 < S j ≤ B k } , be the same for all k (aside from the rightmost bin, that for k = ℓ , if n is not perfectly divisible by ℓ ). In contrast to the classical approach, the methods of the following subsections avoid binning. Recently, [18] and [19] have forcefully reiterated serious problems with existing binned methodologies.
There have been a number of proposals to measure calibration via Kolmogorov-Smirnov statistics, without binning, including by [18] and [13]. Section 3.2 of [15], Chapter 8 of [8], and Fig. 1 of [18] also point to the utility of cumulative reliability diagrams and plots somewhat similar to those in the present paper. The particular plots proposed below focus on calibration specifically, encoding miscalibration directly as the slopes of secant lines for the graphs. Such plots lucidly depict miscalibration with significant quantitative precision. Popular graphical methods for assessing calibration appear not to leverage the key to the approach advocated below, namely that slope is easy to assess visually even when the constant offset of the graph (or portion of the graph under consideration) is arbitrary and meaningless.
To illustrate with an example, Figure 27 displays both the classical reliability diagrams as well as the plot of cumulative differences proposed below. Section "Ima-geNet examples for assessing calibration" below describes the figure in detail. The lowermost two rows of Fig. 27 are the classical diagrams, with n = 1,281,167; there are ℓ = 1000 bins for each diagram in the middlemost row and ℓ = 100 for each diagram in the lowermost row. The bins are equispaced along the probabilities in the leftmost reliability diagrams, whereas each bin contains the same number of probabilities from S 1 , S 2 , ..., S n in the rightmost reliability diagrams. The light gray lines indicate "error bars" constructed via bootstrapping, as elaborated in section "Results and discussion for assessing calibration" below.
The topmost row of Figure 27 displays the cumulative plot. Leaving elucidation of the cumulative plots and their construction to section "Methods for assessing calibration" below, we just point out here that miscalibration across an interval is equal to the slope of the secant line for the graph over that interval, aside from expected stochastic fluctuations detailed in section "Significance of stochastic fluctuations for assessing calibration" below. Steep slopes correspond to substantial miscalibration across the ranges of probabilities where the slopes are steep, with the miscalibration over an interval exactly equal to the expected value of the slope of the secant line for the graph over that interval. In Fig. 27, the cumulative plot reveals a curious changepoint in calibration that is difficult to notice in the conventional reliability diagrams, and the Kolmogorov-Smirnov and Kuiper metrics conveniently summarize the statistical significance of the overall deviation across all scores, in accordance with sections "Scalar summary statistics for assessing calibration" and "Significance of stochastic fluctuations for assessing calibration" below.  Figure 29 displays the ground-truth reliability diagram. All plots, whether cumulative or conventional, appear to work well  Figure 31 displays the ground-truth reliability diagram. All plots, whether cumulative or conventional, appear to work well The structure of the remainder of this appendix is as follows: section "Methods for assessing calibration" details the cumulative methods, and section "Results and discussion for assessing calibration" then illustrates them via several examples. Table 2 gives a glossary of the notation used in the appendices, while Table 1 gives a glossary of the notation used prior to the appendices.

Methods for assessing calibration
This subsection provides a detailed mathematical formulation of the cumulative methods, with section "High-level strategy for assessing calibration" outlining an approach to large-scale analysis. Section "Graphical method for assessing calibration"  Figure 33 displays the ground-truth reliability diagram. The conventional plots become increasingly problematic as n reduces to 100 from 10,000 and 1000 in Figures 28 and 30, whereas the cumulative plot still detects roughly the right level of miscalibration for 0 S k 0.2 and 0.8 S k 1 ; the cumulative plot indicates that too little data is available for 0.2 S k 0.8 to detect any statistically significant miscalibration in that range of S k (note the size of the triangle centered at the origin of the plot) details the graphical methods. Section "Scalar summary statistics for assessing calibration" details the scalar summary metrics. Section "Significance of stochastic fluctuations for assessing calibration" treats statistical significance for both the graphical methods and the summary statistics.

High-level strategy for assessing calibration
This subsubsection proposes a process for large-scale data analysis.
As in section "High-level strategy", we can take a two-step approach when there are many data sets to analyze: 1. A screening stage assigning a single scalar summary statistic to each data set (where the size of the statistic measures miscalibration). 2. A detailed drill-down for each data set whose scalar summary statistic is large, drilling down into the miscalibration's variation as a function of the probability of success.
The drill-down relies on graphical display of miscalibration; the scalar statistic for the first stage simply summarizes the overall miscalibration across all success probabilities, as either the maximum absolute deviation of the graph from the ideal or the size of the range of deviations. Thus, for each data set, both stages are based on a graph; the first stage collapses the graph into a single scalar summary statistic. The following subsubsection constructs the graph.

Graphical method for assessing calibration
This subsubsection details the construction of cumulative plots for calibration. The cumulative response is for k = 1, 2, ..., n.
Under the null hypothesis (41), the expected cumulative response (or, just as well, the cumulative expected response) is for k = 1, 2, ..., n.
A plot of F k −F k as a function of k displays miscalibration directly as slopes that deviate significantly from 0; indeed, the increment in the expected difference F j −F j from j = k − 1 to j = k is (45)   Figure 37 displays the ground-truth reliability diagram. As in Figure 34, all plots work sufficiently well, though the observed reliability diagrams may be misleading relative to the exact expectations, at least without careful attention to the variation with the number of bins Tygert J Big Data (2021) 8:117 thus, on a plot with the values for k spaced 1/n apart, the slope from j = k − 1 to j = k is for k = 1, 2, ..., n. The miscalibration for success probabilities near S k is substantial when k is significantly nonzero, that is, when the slope of the plot of F k −F k deviates significantly from horizontal over a significantly long range.
To emphasize: miscalibration over a contiguous range of S k is the slope of the secant line for the plot of F k −F k as a function of k n over that range, aside from expected stochastic fluctuations. The following subsubsection reviews two metrics that summarize how much the graph deviates from 0 (needless to say, if the slopes of the secant lines are all nearly 0, then the whole graph cannot deviate much from 0). Considering these metrics, section "Significance of stochastic fluctuations for assessing calibration" then discusses the expected random fluctuations. Many examples of plots are available in section "Results and discussion for assessing calibration".

Scalar summary statistics for assessing calibration
This subsubsection details the construction of summary statistics which collapse the plots introduced in the previous subsubsection into scalars. The captions of the figures report the values of the statistics for the corresponding examples.
Two standard metrics for miscalibration over the full range of success probabilities that account for expected random fluctuations are that due to Kolmogorov and Smirnov, the maximum absolute deviation and that due to Kuiper, the size of the range of the deviations where F 0 = 0 =F 0 ; Remark 1 of section "Scalar summary statistics" explains the reason for including F 0 and F 0 . Under the null hypothesis (41), the distributions of G and D are known, and can form the basis for tests of statistical significance, as described, for example, by Section 14.3.4 of [3]. The distributions are easy to calculate under the null hypothesis (41), as then the sequence F 1 , F 2 , ..., F n defined in (44) is a random walk with fully specified transition probabilities S 1 , S 2 , ..., S n . For assessing statistical significance (rather than overall effect size), G and D should be divided by σ , where σ is 1/n times the standard deviation of the sum of independent Bernoulli variates whose probabilities of success are S 1 , S 2 , ..., S n , that is,  Figure 39 displays the ground-truth reliability diagram. The plots, whether cumulative or conventional, reveal similar information here, though the reliability diagrams with an equal number of observations per bin provide more reliable estimates than the other reliability diagrams. The cumulative plot is perhaps the easiest to interpret: the miscalibration is significant for 0.1 S k 0.23 and 0.27 S k 0.6 , with about the correct amount of miscalibration (the amount is correct since the secant lines have the expected slopes) Tygert J Big Data (2021) 8:117   Figure 41 displays the ground-truth reliability diagram. The cumulative plot captures more of the oscillations in the miscalibration, as do to some extent the reliability diagrams with an equal number of observations per bin; however, the variations in the reliability diagrams could be difficult to interpret without access to the ground-truth exact expectations the following remark explains why.

Remark 7
Under the null hypothesis (41) that assumes that the outcomes R 1 , R 2 , ..., R n are drawn independently from Bernoulli distributions whose probabilities of success are S 1 , S 2 , ..., S n , the expected value of G/σ is less than or equal to the expected value of the maximum (over a subset of the unit interval [0, 1]) of the absolute value of the standard Brownian motion over the unit interval [0, 1], in the limit n → ∞ . As discussed by [5] (with x = 0 and D = 1 in Formula 46 from the associated arXiv publication 6 ... or see immediately before Remark I on the relevant StackExchange thread 7 ), the expected value of the maximum of the absolute value of the standard Brownian motion over [0, 1] is √ π/2 ≈ 1.25 . The discussion by [5] immediately following Formula 44 from the associated arXiv publication 8 shows that the probability distribution of the maximum of the absolute value of the standard Brownian motion over [0, 1] is sub-Gaussian, decaying past its mean √ π/2 ≈ 1.25 . Values of G/σ much greater than 1.25 imply serious miscalibration, while values of G/σ close to 0 imply that G did not detect any statistically significant miscalibration. Needless to say, similar remarks pertain to D.

Significance of stochastic fluctuations for assessing calibration
This subsubsection discusses statistical significance both for the graphical methods of section "Graphical method for assessing calibration" and for the summary statistics of section "Scalar summary statistics for assessing calibration".
The plot of F k −F k as a function of k/n automatically includes some "error bars" courtesy of the discrepancy F k −F k fluctuating randomly as the index k increments. Of course, the standard deviation of a Bernoulli variate whose expected value is S j is S j (1 − S j )-smaller both for S j near 0 and for S j near 1. To indicate the size of the fluctuations, the plots should include a triangle centered at the origin whose height above the origin is 2/n times the standard deviation of the sum of independent Bernoulli variates with success probabilities S 1 , S 2 , ..., S n ; thus, the height of the triangle above the origin (where the triangle itself is centered at the origin) is 2 n j=1 S j (1 − S j )/n . The expected deviation from 0 of |F k −F k | (at any specified value for k) is no greater than this height, under the assumption that the responses R 1 , R 2 , ..., R n are draws from independent Bernoulli distributions with the correct success probabilities S 1 , S 2 , ..., S n , that is, under the null hypothesis (41). The triangle is similar to the classic confidence bands around an empirical cumulative distribution function given by Kolmogorov and Smirnov, as reviewed by [4].

Results and discussion for assessing calibration
This subsection illustrates via several examples the previous subsection's methods, together with the traditional plots-so-called "reliability diagrams"-discussed in section "Introduction to assessing calibration". 9  To generate the figures in section "Synthetic examples for assessing calibration", we specify values for S 1 , S 2 , ..., S n and for P 1 , P 2 , ..., P n differing from S 1 , S 2 , ..., S n , then independently draw R 1 , R 2 , ..., R n from the Bernoulli distributions with parameters P 1 , P 2 , ..., P n , respectively. Ideally the plots would show how and where P 1 , P 2 , ..., P n differs from S 1 , S 2 , ..., S n . Appendix B considers the case in which P k = S k for all k = 1, 2, ..., n.
The top rows of the figures with three rows plot F k −F k from (44) and (45) as a function of k/n, with the rightmost plot displaying its noiseless expected value rather than using the observations R 1 , R 2 , ..., R n ( Figure 27 omits the rightmost plot since the expected values are unknown in that case). In each of these plots, the upper axis specifies k/n, while the lower axis specifies S k for the corresponding value of k. The lowermost two rows of the figures with three rows plot the pairs (X 1 , Y 1 ), (X 2 , Y 2 ) , ..., (X ℓ , Y ℓ ) from (42) and (43), with the rightmost plots using an equal number of observations per bin. The dark black lines and points of the left and right plots in the lowermost two rows of Figures 28, 30, and 32 are in fact identical, since S 1 , S 2 , ..., S n are equispaced for those examples (so equally wide bins contain equal numbers of observations). The figures with only a single diagram plot pairs (X 1 , Y 1 ), (X 2 , Y 2 ) , ..., (X n , Y n ) from (42) Fig. 41 Ground-truth reliability diagram for Fig. 40 and (43), but this time using their noiseless expected values (S 1 , P 1 ), (S 2 , P 2 ) , ..., (S n , P n ) instead of using the random observations R 1 , R 2 , ..., R n .
Perhaps the simplest, most straightforward method to gauge uncertainty in the binned plots is to vary the number of bins and observe how the plotted values vary. All figures displayed employ this method, with the number of bins increased in the second rows of plots beyond the number of bins in the third rows of plots. The figures also include the "error bars" resulting from one of the bootstrap resampling schemes proposed by [14], obtained by drawing n observations independently and uniformly at random with replacement from (S 1 , R 1 ) , (S 2 , R 2 ) , ..., (S n , R n ) and then plotting (in light gray) the corresponding reliability diagram, and repeating for a total of 20 times (thus displaying 20 gray lines per plot). The chance that all 20 lines are unrepresentative of the expected statistical variations would be roughly 1/20 = 5 %, so plotting these  Figure 43 displays the ground-truth reliability diagram. As in Fig. 40, the cumulative plot resolves more of the oscillations in the miscalibration, as do somewhat the reliability diagrams with an equal number of observations per bin; that said, the variations in the reliability diagrams could be hard to interpret without access to the exact expectations 20 lines corresponds to approximately 95% confidence. An alternative is to display the bin frequencies as suggested, for example, by [6]. Other possibilities often involve kernel density estimation, as suggested, for example, by [7] and [8]. All such methods require selecting widths for the bins or kernel smoothing; avoiding having to make what is a necessarily somewhat arbitrary choice is possible by varying the widths, as done in the plots of the present paper. Chapter 8 of [8] comprehensively reviews the extant literature.
We may set the widths of the bins such that either {1} the average of S k for k in each bin is approximately equidistant from the average of S k for k in each neighboring bin or {2} the range of k for every bin has the same width. Both options are natural; the first is the canonical choice, whereas the second ensures that error bars would be similarly sized for every bin. The figures display both possibilities, with the first on the left and the second on the right. Setting the number of bins together with either of these choices fully specifies the bins. As discussed earlier, we vary the number of bins since there is no perfect setting-using fewer bins offers estimates with higher confidence yet limits the resolution for detecting miscalibration and for assessing the dependence of calibration as a function of S k .

Synthetic examples for assessing calibration
In this subsubsection, the examples are sampled randomly from various statistical models so that the underlying "ground-truth" is known explicitly. Figures 28, 29, 30, 31, 32, and 33 all draw from the same underlying distribution that deviates linearly as a function of k from the distribution of S k , and S 1 , S 2 , ..., S n are equispaced; Figures 28 and 29 set n = 10,000, Figures 30 and 31 set n = 1000, and Figures 32 and 33 set n = 100. Overall, the cumulative plots seem more informative (or at least easier to interpret) in Figures 28, 29, 30, 31, 32, and 33, but only mildly. Figures 34, 35, 36, 37, 38, and 39 all draw from the same underlying distribution that is overconfident (lying above the perfectly calibrated ideal), with the overconfidence peaking for S k around 0.25 (aside from a perfectly calibrated notch right around 0.25), where  In all cases with n = 10,000, that is, in Figures 28, 34, and 40, the scalar summary statistics detect extremely statistically significant miscalibration. Moreover, in all cases with n = 1000, that is, in Figures 30, 36, and 42, the scalar summary statistics detect highly statistically significant miscalibration. In all cases with n = 100, that is, in Figures 32, 38, and 44, the scalar summary statistics detect some statistically significant miscalibration, though not with nearly as much confidence as when n = 1000 or (even more starkly) as when n =  Figure 45 displays the ground-truth reliability diagram. Just like in Figs. 40 and 42, the cumulative plot captures more of the oscillations in the miscalibration, as do to some extent the reliability diagrams with an equal number of observations per bin; even so, the variations in the reliability diagrams are difficult to interpret without access to the exact expectations Beware when conducting multiple tests of significance (or be sure to adjust the confidence level accordingly). Figures 46, 47, 48, 49, 50, and 51 illustrate these cautions; these figures are analogous to those presented in Appendix A, but with the observations drawn from the same predicted probabilities of success used to generate the graphs, so that the discrepancy from perfect calibration should be statistically insignificant. More precisely, Figures 46, 47, 48, 49, 50, and 51 all set S k to be proportional to (k − 0.5) 2 and draw R 1 , R 2 , ..., R n from independent Bernoulli distributions with expected success probabilities S 1 , S 2 , ..., S n , respectively; this corresponds to setting P k = S k for all k = 1, 2, ..., n, in the numerical experiments of Appendix A. Figures 46, 48, and 50 consider n = 10,000, n = 1000, and n = 100, respectively. Please note that the ranges of the vertical axes for the top rows of plots are tiny. The leftmost topmost plots in Figs. 46, 48, and 50 look like driftless random walks; in fact, they really are driftless random walks. The variations of the graphs are comparable to the heights of the triangles centered at the origins. Comparing the second rows with the third rows shows that the deviations from perfect calibration are consistent with expected random fluctuations.   Indeed, all plots in this appendix depict only small deviations from perfect calibration, as expected (and as desired). None of the scalar summary statistics significantly exceeds its expected value under the null hypothesis of (41).