A graphical method of cumulative differences between two subpopulations

Comparing the differences in outcomes (that is, in “dependent variables”) between two subpopulations is often most informative when comparing outcomes only for individuals from the subpopulations who are similar according to “independent variables.” The independent variables are generally known as “scores,” as in propensity scores for matching or as in the probabilities predicted by statistical or machine-learned models, for example. If the outcomes are discrete, then some averaging is necessary to reduce the noise arising from the outcomes varying randomly over those discrete values in the observed data. The traditional method of averaging is to bin the data according to the scores and plot the average outcome in each bin against the average score in the bin. However, such binning can be rather arbitrary and yet greatly impacts the interpretation of displayed deviation between the subpopulations and assessment of its statistical significance. Fortunately, such binning is entirely unnecessary in plots of cumulative differences and in the associated scalar summary metrics that are analogous to the workhorse statistics of comparing probability distributions—those due to Kolmogorov and Smirnov and their refinements due to Kuiper. The present paper develops such cumulative methods for the common case in which no score of any member of the subpopulations being compared is exactly equal to the score of any other member of either subpopulation.

graphical methods allow for in-depth investigation into the variation of the deviations as a function of score. The graphical methods (and hence an intuitive interpretation of the associated scalar summary statistics) rely on the weighting used by [10][11][12], which is different from the weighting used by the otherwise closely related approach of [13] and the others cited by [14]. The scalar summary statistics of [10] are almost the same as those in the present paper, but for the simpler setting in which each score comes with precisely one observation from one subpopulation and one observation from the other subpopulation. The scalar summary statistics of [11,12] are analogues of those from the appendix of [1] in the special case that the parametric regression function they consider is nothing but the identity function on the unit interval [0, 1].
The graphs introduced in the present paper are easy to interpret. For instance, in the topmost plots (a and b) of Fig. 1, the deviation between the two subpopulations over a range of scores is simply the expected slope of the secant line for the graph over that range of scores, as a function of the index k/n (positive slope indicates that the responses for one subpopulation are greater on average than those for the other subpopulation, while negative slope indicates that the responses for the former subpopulation are less than the latter's on average). Long ranges of steep slopes correspond to ranges of scores for which the average responses are significantly different between the two subpopulations; the triangle along the vertical axis on the left of each plot indicates the magnitude of the deviation across the full range of scores that would be statistically significant at around the 95% confidence level. The connection with statistical significance also motivated related works, including that of [15,16], which offer Kolmogorov-Smirnov metrics to help gauge calibration of probabilistic predictions, much like in the appendix of [1]. Similarly, Section 3.2 of [17] and Chapter 8 of [5] propose cumulative reliability diagrams, albeit without leveraging the key to the approach of the present paper, namely that slope is easy to assess visually even when the constant offset of the part of a graph under consideration is arbitrary and uninformative. Detailed explanation of statistical significance and Fig. 1 is available in sections "Methods" and "Results and discussion" below.
Section "Methods" introduces the methodology of cumulative differences, both for graphs of the differences and for the scalar metrics of Kuiper and of Kolmogorov and Smirnov that summarize the graphs' deviation away from being perfectly flat. Section "Results and discussion" presents several illustrative examples, via both simple synthetic and complicated real data sets. 1 Section "Conclusion" concludes the paper with a brief discussion. Table 1 summarizes the notation used throughout the present paper. Readers interested mainly in seeing results and comparisons of the proposed methods to the old standbys may wish to start with section "Results and discussion".

Methods
This section details the methodology proposed in the present paper. Section "Approach to big data" breaks data analysis into two stages: a first, broad-brush stage of screening for potentially significant deviations across many data sets and pairs of subpopulations, and a second, finely detailed investigation of the variations in the deviations as a function of score. Section "Unweighted sampling" develops the graphical method for the second stage, in the simplest case of unweighted sampling. Section "Scalar summary statistics" then collapses the graphs of section "Unweighted the reliability diagrams with only 10 bins each (c and d) smooth out the jumps at high scores, and while the reliability diagrams with 50 bins each (e and f) give some indication of the jumps, the jumps still get smoothed over, while the bins for lower scores are too narrow to average away noise well. The cumulative graph (a) clearly displays the jumps, while remaining easily interpretable at lower scores. The statistics of Kuiper and of Kolmogorov and Smirnov are both several times greater than σ , so both reflect that the deviation displayed in the graphs is highly statistically significant sampling" into scalar statistics useful for the first, broad-brush stage. Section "Significance of stochastic fluctuations" explains how to gauge statistical significance. Finally, section "Weighted sampling" treats the case of weighted sampling, generalizing the previous sections to the more complicated case of data with weights.

Approach to big data
This subsection proposes a two-step approach to analyzing multiple data sets and subpopulations (the same approach taken by [1] in a related setting): 1. Calculate a single scalar summary statistic for each data set for each pair of subpopulations of interest, such that the size of the statistic measures the deviation between the subpopulations. 2. Analyze in graphic detail each data set and pair of subpopulations whose scalar summary statistic is large, graphing how the deviation between the subpopulations varies as a function of score.
The scalar statistic for the first step simply summarizes the overall deviation across all scores, as either the maximum absolute deviation of the second step's graph or the size of the range of deviations in the graph. Thus, both steps rely on a graph, with the first stage collapsing the graphical display into a single scalar summary statistic. The following subsection details the construction of this graph, for the case of unweighted sampling (later, section "Weighted sampling" treats the weighted case).  (1) and (2) (1) and (2) k Expected slope of C j from j = k to j = k + 1 T k Total weight for R 0 k/2 or R 1 W k Aggregated weight (Not applicable) (14) Unweighted sampling This subsection presents the special case in which the observations are unweighted (or, equivalently, uniformly or equally weighted). Section "Weighted sampling" treats the more general case of weighted observations, which is more complicated. The present and all following subsections focus on a single data set together with a single pair of subpopulations; the previous subsection outlines a strategy for handling multiple data sets and pairs of subpopulations, based on the processing of individual cases. The data being considered should be observations of independent responses, with each response taking one of finitely many real-valued possibilities, and with each (random) response being paired with a real-valued score viewed as given not random (the responses across the different scores should be independent). Hence, the scores can take on any real values, whereas the responses should be drawn from discrete distributions. In the present paper, the scores from the observations in both subpopulations put together must be distinct-the score for every observation from either subpopulation must be unique or else slightly perturbed to become different from all the other scores (perturbing as little as possible while accounting for roundoff, for instance).
Under this assumption of uniqueness, a graphical method for analyzing deviation between the outcomes of the two subpopulations as a function of score comprises the following procedure: 1. Merge all scores into a single sequence. 2. Sort the merged sequence into ascending order and let "subpopulation 0" denote the subpopulation associated with the first (the least) score in the sorted sequence. 3. Partition the sorted sequence into blocks such that the scores in every other block all come from subpopulation 0, interleaved with blocks in which all scores come from subpopulation 1; that is to say: (a) the scores in the first (lowest) block all come from subpopulation 0, (b) the scores in the second lowest block all come from subpopulation 1, (c) the scores in the third lowest block all come from subpopulation 0, (d) the scores in the fourth lowest block all come from subpopulation 1, (e) and so on, alternating between the two subpopulations, with all scores in each block coming from only one of the subpopulations. 4. Denote by S 0 k the (arithmetic) average of the scores in the (2k + 1) th block and denote by S 1 k the average of the scores in the (2k + 2) th block; denote by R 0 k the average of the responses (the random outcomes) corresponding to the scores in the (2k + 1) th block and denote by R 1 k the average of the responses (the random outcomes) corresponding to the scores in the (2k + 2) th block. 5. Form the sequence of average differences with even-indexed entries and odd-indexed entries Graph as a function of j/n the sequence of cumulative average differences for j = 1 , 2, ..., n, where n is the length of the sequence D 0 , D 1 , ..., D n−1 from the previous step. Supplement C 1 , C 2 , ..., C n with Figure 2 illustrates Steps 1-4, while Fig. 3 illustrates Step 5. The increment in the expected cumulative average difference from j = k to j = k + 1 is so that the expected slope of a graph of C k versus k/n is (3) . The averages of the responses for subpopulation 0 corresponding to the indicated blocks of observed scores are R 0 0 , R 0 1 , ..., R 0 9 , while the averages of the responses for subpopulation 1 are R 1 0 , R 1 1 , ..., R 1 9 . The scores need not range from 0 to 1 as in the present figure, but that is a common case In each of these subfigures, the operation indicated by " + " sums its two inputs and the operations indicated by "−" subtract their inputs, with one of these "−" operations subtracting its rightmost input from its leftmost input, while the other subtracts its leftmost input from its rightmost input. In all cases, the operations indicated by "−" subtract subpopulation 1 from subpopulation 0, in that order. The operation indicated by " ÷2 " divides its input by 2. These subfigures depict visually Formulaes (1) and (2), respectively which is simply the expected value of the difference between the two subpopulations. Thus, the slope of a secant line over a long range of k/n for the graph of C k versus k/n becomes the average difference in responses between the subpopulations. Figure 1 presents a synthetic example from section "Synthetic" below for which the ground-truth is known explicitly. In accord with (5), the topmost plots (a and b) of Fig. 1 display deviation between the two subpopulations over a range of scores as the expected slope of the secant line for the graph over that range of scores, as a function of the index k/n given along the horizontal axis. As mentioned in the introduction, long ranges of steep slopes correspond to ranges of scores for which the average responses are significantly different between the two subpopulations, with the triangle along the vertical axis on the left of each plot indicating the magnitude of the deviation across the full range of scores that would be statistically significant at around the 95% confidence level. Section "Significance of stochastic fluctuations" below provides details on statistical significance and the computation of the triangle's height.

Remark 1 The blocked sequence of responses is
... The backward differences are and while the forward differences are and so that D 2k from (1) is the average of (7) and (9) while D 2k+1 from (2) is the negative of the average of (8) and (10). The reason for D 2k+1 to be the negative is to align with D 2k when summing them in (3)-the differences need to be in the same direction for the sum to make sense, and the negative synchronizes the directions of the differences (which would otherwise be alternating or staggered in the sequence); with the negative, the differences always compare subpopulation 0 to subpopulation 1, in that order.

Remark 2
In the absence of any reason to prefer backward differences to forward differences (or vice versa), we opt to average the two possibilities together. In the absence of any reason to prefer entries in the sequence with even indices ( D 0 , D 2 , D 4 , ...) to entries with odd indices ( D 1 , D 3 , D 5 , ...), we include both.

Scalar summary statistics
This subsection constructs standardized statistics which summarize in single scalars the plots of the previous subsection. Two standard metrics for the overall deviation between the two subpopulations over the full range of scores and that take into account expected random fluctuations are that due to Kolmogorov and Smirnov, the maximum absolute deviation and that due to Kuiper, the size of the range of the deviations where C 0 is defined in (4) and C 1 , C 2 , ..., C n are defined in (3). Under appropriate statistical models, G and H can form the basis for tests of statistical significance, the context in which they originally appeared; see, for example, Section 14.3.4 of [18]. To assess statistical significance (rather than absolute effect size), G and H should be rescaled larger by a factor proportional to √ n ; further discussion of the rescaling is available in the next subsection. Needless to say, if the graph constructed in the previous subsection is fairly flat for all scores (which indicates a lack of deviation between the subpopulations for all scores), then both the maximum absolute deviation of the graph and the size of the range of deviations (G and H, respectively) will be close to 0. The captions of the figures report the values of these scalar statistics for numerical examples.  (12), as well as why H is often slightly preferable to G.

Significance of stochastic fluctuations
This subsection discusses statistical significance both for the graphical methods of section "Unweighted sampling" and for the summary statistics of section "Scalar summary statistics".
The graph of C k as a function of k/n generally displays some "confidence bands" due to C k fluctuating randomly as the index k increments; the "thickness" of the plot arising from the random fluctuations gives some sense of "error bars. " To indicate the rough size of the fluctuations of the maximum deviation expected under the hypothesis that the actual underlying response distributions of the two subpopulations are the same, the plots should include a triangle centered at the origin whose height above the origin is proportional to 1/ √ n . The triangle is similar to the conventional confidence bands around an empirical cumulative distribution function introduced by Kolmogorov and Smirnov, as reviewed by [19]-a driftless, purely random walk deviates from zero by roughly √ n after n steps, so a random walk scaled by 1/n deviates from zero by roughly 1/ √ n . Identification of deviation between the two subpopulations is reliable when focusing on long ranges of steep slopes (as a function of k/n) for C k ; the triangle gives a sense of the length scale for the largest stochastic variations that are likely to happen even when there is no underlying deviation between the subpopulations. The remainder of the present subsection derives this conservative upper bound on the length scale in cases for which the value of every observed response is either 0 or 1.
The long-range deviations of C 0 , C 1 , C 2 , ..., C n from zero can be biased even when the two subpopulations are drawn from the same underlying distribution as a function of score; however, the use of centered, second-order differences in (1) and (2) makes this a second-order effect. In the sequel, we make two assumptions about bias: {1} the bias arising from averaging together multiple responses at slightly different scores into a single R 0 k or R 1 k is offset by the reduction in variance due to the averaging, and {2} the bias arising from taking differences of responses from the different subpopulations at slightly different scores is negligible in comparison with the square root of the accumulated variance. The first assumption can be especially reasonable when the scores considered for a single R 0 k or R 1 k are in reality drawn at random from some probability distribution, such that the variance in the probabilities of success for the associated Bernoulli responses is comparable to the variance of a Bernoulli variate with a given probability of success. In such cases, the first assumption permits us to regard each R 0 k or R 1 k as contributing no more to the long-range deviation than a single Bernoulli variate would. The second assumption means that we will neglect the second-order effect of accumulated bias, which is often reasonable due to the use of second-order differences in (1) and (2).
In cases for which the value of every observed response is either 0 or 1, the tip-totip height of the triangle centered at the origin should be 8/n times the standard deviation of the sum of n independent Bernoulli variates. This is simply 8/n times the square root of the sum of the variances of n Bernoulli variates, which could be at most (8/n)( √ n/4) = 4σ , where since the variance of a Bernoulli variate is p(1 − p) ≤ 1/4 , where p is the unknown probability of success. Note that the factor 8 incorporates a factor of 2 for the triangle extending both above and below the origin, a factor of 2 to extend for 2 standard deviations rather than just 1 (setting the confidence level at approximately 95%), a factor of √ 2 due to the dependency between the even-and odd-indexed entries in the sequence of second-order differences from (1) and (2), and a factor of √ 2 to account for having 2 independently drawn subpopulations. Needless to say, the upper bound of 4σ is often somewhat loose in practice, as the two assumptions discussed in the previous paragraph yield rather conservative guarantees. Tighter bounds may exist in settings for which the scores are drawn from a specified probability distribution (unlike in the setting of the present paper).

Weighted sampling
This subsection presents the general case in which the observations come with weights, where each weight is a positive real number associated with the corresponding observation. Section "Unweighted sampling" treats the special case of unweighted (or, equivalently, uniformly or equally weighted) observations, which is simpler.
The weighted case uses the same procedure as in section "Unweighted sampling", but with S 0 k , S 1 k , R 0 k , and R 1 k being weighted averages rather than unweighted averages (the weighted average for each S 0 k , S 1 k , R 0 k , and R 1 k should be normalized separately). Then, we define T 2k to be the average of the weights associated with the scores whose weighted average is S 0 k , and define T 2k+1 to be the average of the weights associated with the scores whose weighted average is S 1 k . Setting W k to be the sum of the weights associated with D k defined in (1) and (2), that is, Formula (3) generalizes to for j = 1 , 2, ..., n, while C 0 = 0 exactly as before in Formula (4). In the weighted case, the abscissae (that is, the horizontal coordinates) for the graph consist of the normalized aggregated weights for j = 1 , 2, ..., n, and The original, unweighted procedure of section "Unweighted sampling" yields precisely the same results as the weighted procedure of the present subsection in the special case that the weights for the original observations are all the same.
The increment in the expected cumulative weighted average difference from j = k to j = k + 1 is while the increment in the normalized aggregated weights from j = k to j = k + 1 is so that the expected slope of a graph of C k versus A k is the ratio of (18) to (19), that is, which is none other than the expected value of the difference between the two subpopulations. Thus, the slope of a secant line over a long range of k for the graph of C k versus A k becomes the average difference in responses between the subpopulations.
The scalar summary statistics in the weighted case are given by the same formulae from section "Scalar summary statistics" as for the unweighted case, just using C j from (15) in place of C j from (3). In cases for which the value of every observed response is either 0 or 1, the tip-to-tip height of the triangle centered at the origin analogous to that from section "Significance of stochastic fluctuations" could be set conservatively at 4σ , where which is an upper bound on the worst case under the same two assumptions as in section "Significance of stochastic fluctuations".

Remark 4
The classical methods for reliability diagrams discussed in the introduction easily adapt to the case of weighted sampling. Rather than plotting the plain, unweighted average of responses against the unweighted average of scores in each bin, the weighted case involves plotting the weighted average of responses against the weighted average of scores in each bin. Two natural choices of bins in the weighted case are {1} make the widths of the bins all be the same or {2} use the binning of the following remark (Remark 5). As in the unweighted case, the second choice can adapt to each subpopulation under consideration, with each subpopulation having its own binning.

Remark 5
In the case of weighted sampling, the most useful reliability diagrams are usually those entitled, "reliability diagram ( W 2 / W 1 is similar for every bin). " These diagrams construct bins such that, for every bin, the ratio of the sum of the squares of the bin's weights to the square of the sum of the bin's weights is similar for every bin. Remark 5 of [1] details the specific procedure employed for setting the bins.

Results and discussion
This section illustrates via numerous examples the previous section's methods, including comparisons with the canonical plots-the "reliability diagrams"-discussed in the introduction. 2 Section "Synthetic" presents several synthetic examples. Section "Ima-geNet" gives examples from a popular, unweighted data set of images, ImageNet. Section "American Community Survey of the U.S. Census Bureau" considers a weighted data set, the year 2019 American Community Survey of the United States Census Bureau. Finally, section "Cautions" issues a warning about possible overinterpretations of the plots (both for the cumulative graphs and for the classical reliability diagrams) and suggests following [1] by comparing a subpopulation to the full population (when apposite). The figures display the reliability diagrams (that is, the classical calibration plots) as well as both the graphs of cumulative differences and the exact expectations in the absence of the random sampling's noise (the figures include the exact expectations only when they are known, as for the synthetic data). The captions of the figures discuss the numerical results depicted.
The title, "subpopulation deviation is the slope as a function of k/n, " labels a plot of C k from (3) as a function of k/n. In each such plot, the upper axis specifies k/n, while the lower axis specifies the score for the corresponding value of k. The title, "subpopulation deviation is the slope as a function of A k , " labels a plot of C k from (15) versus the cumulative weight A k from (16). In each such plot, the major ticks on the upper axis specify k/n, while the major ticks on the lower axis specify the score for the corresponding value of k; the points in the plot are the ordered pairs (A k , C k ) for k = 1, 2, ..., n, with A k being the abscissa and C k being the ordinate. (The abscissa is the horizontal coordinate; the ordinate is the vertical coordinate.) In all cases, if the second subpopulation ends up being subpopulation 0 in the notation of section "Methods", then the cumulative graph technically actually plots −C k rather than C k (in the same notation of section "Methods").
The titles, "reliability diagram, " "reliability diagram (equal number of subpopulation scores per bin), " and "reliability diagram ( W 2 / W 1 is similar for every bin), " label plots of the pairs from the introduction (in the unweighted case) or from Remark 4 (in the case of weighted sampling), with the pairs from the first subpopulation in black and the pairs from the second subpopulation in gray.
In the traditional, binned plots, we vary the number of bins to see how the plotted values vary. Displaying the bin frequencies is another way to indicate uncertainties, as suggested, for example, by [20]. Still other possibilities for uncertainty quantification could use kernel density estimation, as suggested, for example, by [6,21] and [5]. Such uncertainty estimates involve setting widths for the bins or kernel smoothing; such settings are fairly arbitrary and actually unnecessary when varying the widths as in the plots of the present paper. A comprehensive review of the various possibilities is available in Chapter 8 of [5].
As the introduction discusses, there are two standard choices for the bins when the sampling is unweighted (or uniformly weighted): {1} make the average of the scores in each bin be roughly equidistant from the average of the scores in each neighboring bin or {2} make the number of scores in every bin (except perhaps for the last) be the same. The figures label the first, more conventional possibility with the short title, "reliability diagram, " and the second possibility with the longer title, "reliability diagram (equal number of subpopulation scores per bin). " As noted in Remark 4, there are two typical choices for the bins when the sampling is weighted: {1} make the weighted average of the scores in each bin be roughly equidistant from the weighted average of the scores in each neighboring bin or {2} follow Remark 5 above. The figures label the first possibility with the short title, "reliability diagram, " and the second possibility with the longer title, "reliability diagram ( W 2 / W 1 is similar for every bin). " Needless to say, reliability diagrams with fewer bins provide estimates that are less noisy, at the cost of restricting the resolution for detecting deviations and for resolving variations as a function of the score.

Synthetic
This subsection presents several toy examples that consider instructive "ground-truth" statistical models and generate observations at random from them. The examples set values for the scores and expected values of the responses, and then independently draw the observed responses from the Bernoulli distributions whose probabilities of success are those expected values.
Each top row of Figs. 1, 4, 5, and 6 plots C 1 , C 2 , ..., C n from (3) as a function of k/n, with the rightmost plot displaying its noiseless expected value rather than using the random observations ( R 0 k and R 1 k ). (Technically speaking, the top row of Fig. 5 actually plots −C 1 , −C 2 , ..., −C n , since for Figs. 5 the second subpopulation ends up being subpopulation  illustrate how well the various plots can detect substantial deviations, while the fourth example illustrates how the plots look in the absence of any deviation. For the first example, corresponding to Fig. 1, the scores for the first subpopulation are 0.5(1 + 2 3 (x − 0.5) 3 ) for 10,000 values of x drawn uniformly at random from the unit interval [0, 1], whereas the scores for the second subpopulation are 7000 values drawn uniformly at random from the unit interval [0, 1] (the latter values are also equal to 0.5(1 + 2(x − 0.5)) for 7000 values of x drawn uniformly at random from the unit interval [0, 1]). The expected values are as indicated in the lowermost plot of Fig. 1, with the expected values for each subpopulation varying smoothly as a function of the score, aside from swapping the values between the two subpopulations for scores in a short range near 0.9. The deviation in the expected values between the subpopulations is substantial for this example.
For the second example, corresponding to Fig. 4, the scores for the first subpopulation are For the fourth example, corresponding to Fig. 6, the scores are the same as in the first example, and the expected values are equal to the scores. Since the expected values are equal to the scores, the expected values are given by the same function of the score for both subpopulations, and thus there is no deviation between the expected responses for the subpopulations in this example.
The captions of the figures comment on the numerical results displayed.

ImageNet
This subsection applies the methods of section "Methods" to the training data set "Ima-geNet-1000" of [22], which contains a thousand labeled classes. Each class forms a natural subpopulation to consider, with each class considered consisting of 1300 images of a particular noun (such as a "cheetah, " a "night snake, " or an "Eskimo Dog or Husky"). The total number of members of the data set over all classes is 1,281,167, as some classes in the data set contain fewer than 1300 images, but each subpopulation considered below comes from a class with 1300 images. The images are unweighted (or, equivalently, uniformly or equally weighted), not requiring the methods of section "Weighted sampling" above. We calculate the scores using the pretrained ResNet18 classifier of [23] from the computer-vision module, "torchvision, " in the PyTorch software library of [24]; the score for an image is the negative of the natural logarithm of the probability assigned by the classifier to the class predicted to be most likely, with the scores randomly perturbed by about one part in 10 8 to guarantee their uniqueness. The response (also known as "result" or "outcome") corresponding to a given score takes the value 1 when the class predicted to be most likely is the correct class; the response takes the value 0 otherwise. Figures 7, 8, and 9 present three examples; the captions first list the names of the classes for the subpopulations and then compare the different kinds of plots.

American Community Survey of the U.S. Census Bureau
This subsection applies the methods of section "Weighted sampling" to the latest (year 2019) microdata from the American Community Survey of the United States Census Bureau; 3 specifically, we consider each subpopulation to be the observations from a county in California. The sampling in this survey is weighted, and we retain only those members whose weights ("WGTP" in the microdata) are nonzero, omitting any member whose household personal income ("HINCP") is zero or for which the adjustment factor to income ("ADJINC") is missing. The scores are the logarithm to base 10 of the adjusted household personal income (the adjusted income is "HINCP" times "ADJINC, " divided by one million when "ADJINC" omits its decimal point in the integer-valued microdata), and we randomly perturb the scores by about one part in 10 8 to guarantee their uniqueness. The response (also known as "result" or "outcome") for a given score takes the value 1 when the corresponding household has limited English speaking (limited English speaking refers to a household in which every member strictly older than 13 has some difficulty speaking English); the response takes the value 0 when the corresponding household is fully English speaking. Table 2 lists the numbers of scores in the subpopulations prior to any binning. Figures 10,11,12,13,14,and 15 present several examples; the captions first list the names of the counties corresponding to the subpopulations considered and then compare the reliability diagrams with the cumulative graph.

Cautions
This subsection warns about some limitations of both the methods of the present paper and the conventional reliability diagrams. The fourth example from section "Synthetic", with its corresponding Fig. 6, emphasizes a cautionary note: avoid hallucinating deviations between the subpopulations on account of statistically insignificant random fluctuations! The indicators such as σ and the triangle at the origin discussed in sections "Scalar summary statistics", "Significance of stochastic fluctuations", and "Weighted sampling" are critical for the proper interpretation of statistical significance. (Note that similar questions of significance also arise for the conventional reliability diagrams, on account of multiple testing: error bars for each bin could report 95% confidence intervals, for instance, but then 1 out of every 20 such bins would be expected to report results exceeding its error bar.) A chief drawback of the approach of the present paper is the limitation highlighted in the abstract, in the introduction, and in an italicized sentence of section "Methods", too: the score for every observation in either subpopulation must not be exactly equal to the score for any other observation from the subpopulations. Of course, one way to enforce the required uniqueness of scores is to perturb them at random slightly. Another drawback is that the observations from one subpopulation get compared to observations from the other subpopulation at slightly different scores; although the bias that this introduces in the cumulative approach is less than in the classical reliability diagrams, the bias is still there and potentially worrisome. An ideal means of circumventing such drawbacks is to compare a subpopulation to the full population as detailed by [1]. The approach of [1] is effectively ideal and should be the method of choice whenever applicable. The approach of the present paper is only relevant when comparing subpopulations directly is necessary.

Conclusion
The plot of cumulative differences between the two subpopulations is easy to interpretthe slope of a secant line for the graph over a long range becomes the average difference between the two subpopulations, and slope is easy to gauge irrespective of any constant offset of the secant line. The plots for the examples of section "Results and discussion" clearly demonstrate many advantages of the cumulative approach over the classical reliability diagrams, and the scalar summary statistics of Kuiper and of Kolmogorov and Smirnov usually faithfully reflect significant differences between the subpopulations if any occur across the full range of scores in the plots. The graphs of cumulative differences avoid explicitly making a trade-off between statistical confidence and resolution as a function of score-a tradeoff that is inherent to the traditional binned diagrams.