Seeing through noise in power laws

Despite widespread claims of power laws across the natural and social sciences, evidence in data is often equivocal. Modern data and statistical methods reject even classic power laws such as Pareto’s law of wealth and the Gutenberg–Richter law for earthquake magnitudes. We show that the maximum-likelihood estimators and Kolmogorov–Smirnov (K-S) statistics in widespread use are unexpectedly sensitive to ubiquitous errors in data such as measurement noise, quantization noise, heaping and censorship of small values. This sensitivity causes spurious rejection of power laws and biases parameter estimates even in arbitrarily large samples, which explains inconsistencies between theory and data. We show that logarithmic binning by powers of λ > 1 attenuates these errors in a manner analogous to noise averaging in normal statistics and that λ thereby tunes a trade-off between accuracy and precision in estimation. Binning also removes potentially misleading within-scale information while preserving information about the shape of a distribution over powers of λ, and we show that some amount of binning can improve sensitivity and specificity of K-S tests without any cost, while more extreme binning tunes a trade-off between sensitivity and specificity. We therefore advocate logarithmic binning as a simple essential step in power-law inference.

1. Introduction.Power laws-where a quantity Y scales as a constant power of another quantity X according to the form Y ∝ X α -are ubiquitous in nature and arise for a variety of reasons (Reed and Hughes, 2002;Willinger et al., 2004;Newman, 2005;Gabaix, 2016).In biology, power laws appear as allometric relationships in physiology and morphology.Such allometries classically established theoretical limits on the heights of trees (Thompson, 1917) and the weights of dinosaurs (Anderson, Hall-Martin and Russell, 1985).In physics, power laws are a hallmark of self-similar and scale-free systems, where the exponent α specifies how to translate from one scale to another.Power laws have been observed or claimed in systems as wide-ranging as skeletal morphology, terrorism, baby names (Hahn and Bentley, 2003) and urban infrastructure (Bettencourt, 2013), while the standards for scientific or statistical justification have varied widely between fields.
Probability distributions with power-law tails are a special case where the probability of an observation scales as a power of its magnitude, with the consequence that extreme magnitude events are far more likely than in the normal or exponential distributions.In the case of a power-law distribution over a continuous quantity x, both the tail distribution function take the form of a power law above some threshold value x m > 0. This continuous case is commonly called the Pareto distribution, so named for Pareto's 1895 observation of powerlaw scaling in the frequencies of extreme wealth and income (Pareto, 1895).Earthquake magnitudes were also classically observed to follow a power-law distribution, which inspired the ongoing practice of recording earthquake magnitudes on the logarithmic Richter scale (Richter, 1935;Gutenberg and Richter, 1944).
Historically, regression on log-log plots was the means to estimate the exponent α in power laws by fitting the equation (log Y ) = α(log X) + c to the data.These methods have been equally applied to probability distributions as to bivariate relationships such as body mass and femur length.For example Gutenberg and Richter (1944) used ordinary least squares, taking Y to be earthquake frequencies within binned magnitude categories X, in order to fit α using Eq. 2. Different regression models assume different error models for X and Y (Warton et al., 2006).Ordinary least squares assumes perfect knowledge of X and identical normal error on Y , whereas major axis regressions allow errors on both X and Y .None of these error models, however, are exactly matched to samples from a probability distribution (Bauke, 2007).
Samples do not have the same statistical variability as errors on independent measurements, and parameters of probability distributions have different constraints than regression parameters.For example, probability distributions require that ∞ xm p(x) dx = 1, but using regression to infer the slope, −(α + 1), and intercept, log αx α m , does not guarantee that this is the case and typically produces a contradiction.Regression methods are also considered potentially problematic as there are few theoretical expectations or guarantees about the accuracy or precision of the resulting estimates (Goldstein, Morris and Yen, 2004;Bauke, 2007).Furtheremore, residuals offer little useful information about goodness of fit if the error models are inappropriate to begin with.Regression nonetheless often happens to produce accurate estimates of α when applied to the empirical cumulative distribution function, Eq. 1, taking X to be the magnitudes in the data and Y to be their quantiles.This method remains popular in some fields and continues to be improved (Gabaix and Ibragimov, 2011).
Another branch of well-developed theory treats the general case of estimating parameters and gauging goodness of fit given samples from a hypothesized probability distribution.Parameter estimation using maximum likelihood is guaranteed to be asymptotically normal, unbiased, consistent, and optimally efficient in the limit of large data (Cramér, 1999) and corresponds to maximum a posteriori estimation in Bayesian statistics (Jaynes, 2003).Maximumlikelihood estimators (MLEs) have been derived specifically for power-law exponents (Muniruzzaman, 1957;Virkar et al., 2014) and correspond to the Hill index estimator in extremal value theory (Hill et al., 1975).Likewise, gauging goodness of fit using Kolmogorov-Smirnov (K-S) statistics is broadly applicable, firmly grounded in probability theory, and easily adaptable to a hypothesis testing framework (Massey, 1951).Goodness-of-fit tests based on the K-S statistic have also been developed for power-law data (Goldstein, Morris and Yen, 2004;Clauset, Shalizi and Newman, 2009) and binned power-law data (Virkar et al., 2014).
However, there are clues that MLEs and goodness-of-fit tests based on K-S statistics give unreasonable answers for power laws in practice.In fields such as neuroscience and vascular biology, the goal is to compare empirical estimates to sound theoretical predictions, and here it is sometimes the estimates that have noticeable flaws, leading empiricists to question (Langlois, Cousineau and Thivierge, 2014) or refuse (Newberry, Ennis and Savage, 2015) maximum likelihood methods.For example, scaling exponents α for the diameters of branching organs such as tree branches, blood vessels and bronchia have a long-established theoretical range between 2 to 3 as a direct consequence of fluid mechanics (Murray, 1926;Zamir and Medeiros, 1982;West, Brown and Enquist, 1997), yet α MLEs routinely fall well outside this range even on ample high quality, hand-curated data (Yeh et al., 1976;Newberry and Savage, 2019).Goodness-of-fit tests using K-S statistics, on the other hand, reject the power law in the otherwise exemplary empirical cases of wealth and earthquake magnitudes (Clauset, Shalizi and Newman, 2009).
Here we explain how such discrepancies between statistical and scientific conclusions can be attributed to inappropriate assumptions about error.The theoretical justifications for MLEs and K-S statistics assume that the hypothesized distribution of the sample is known exactly, including any errors associated with each data point.This assumption almost never holds in practice: a hypothesized distribution is only an approximation to the distribution of the empirical data, because the real process of generating scientific data incorporates known and unknown error sources including measurement and recording errors.
We show that even an error of ±0.2 in a dataset that spans a range from 1 to 100 is capable of biasing the MLE by more than 10%.Whereas in normal statistics, a random measurement can be combined with an unbiased error without affecting the shape of the sample distribution or biasing estimates of the mean, we show that in power-law distributions, even small normal measurement errors qualitatively change the shape of the sample distribution and bias estimates of the slope α.Small errors can then alter conclusions of goodness-of-fit tests in common sample sizes.
Unfortunately, the MLEs and K-S statistics developed for power laws fail to achieve theoretical guarantees once small errors are involved.Thus while linear regression uses an inappropriate error model for samples from a power-law distribution, so too does naïve application of maximum likelihood for samples with routine and otherwise negligible error.In principle, better MLEs could incorporate specifications of the error distribution and its parameters (Gillespie, 2017), but this requires detailed knowledge of the specific dataset and a perfect specification is impossible in practice.
We offer a novel method to tune the robustness of MLEs and K-S statistics to small errors without relying on specific information about the errors.Rather than model error explicitly, we reduce its affect by binning the input data as depicted in Fig. 1.Small errors by definition typically preserve the order of magnitude of each data point, and hence small errors and imsart-aoas ver.2020/08/06 file: manuscript.texdate: January 16, 2023 relatively large bins rarely allow data to move between bins, limiting the possible influence of errors.The power law, meanwhile, specifies frequencies across orders of magnitude, and so binning the data by order of magnitude preserves much of the useful information for inference.
Binning by orders of magnitude is a case of logarithmic binning, where the bin boundaries are integer powers of a ratio λ > 1. Logarithmic binning and the power law distribution are both self-similar, and so power law samples also follow the power law with the original exponent after logarithmic binning (Newberry and Savage, 2019).The discrete distribution of the binned data has the same shape as the original, but ignores errors that are small relative to the binning ratio λ.Taking the limit of λ → 1 + recovers inference using the continuous Pareto distribution.The customary Pareto inference method is therefore an extreme case which is the most sensitive to error.The discrete power law thus provides a more robust general model for power-law distributed data, recovering the benefits of MLEs and K-S statistics even in the presence of small, unspecified errors.
We validate in simulation that logarithmic binning attenuates biases in estimates as well as restores specified false positive rates in goodness-of-fit tests on power-law data with noise.Furthermore these benefits can be achieved with a known and relatively small cost in increased statistical error on the estimator Efron and Hinkley (1978); Newberry and Savage (2019).We further find no impact on false negative rates in rejecting non-power-law data for some binning schemes, whereas others negotiate a tradeoff between false positive and false negative rates.These results show that logarithmic binning preserves most of the useful information for parameter estimation and hypothesis tests about power laws, while removing extraneous and misleading effects of noise.
We further show that observable errors in empirical datasets have caused biases in past inferences and incorrect conclusions about whether data originates from a power-law distribution.For example, errors such as rounding to the nearest tenth, on either a linear or logarithmic scale, can cause goodness-of-fit tests to reject data that otherwise fits the power law, whereas distributions with visually noticeable curvature across their entire range can be accepted as a power law unless binning induces the tests to adopt a broader perspective.We therefore conclude with the recommendation of logarithmic binning as a first step of inference in power law distributions.
2. Method.We propose logarithmic binning as a smoothing method to remove errors and reduce bias in parameter estimates and measures of goodness of fit.Given the minimum possible data value x m , the logarithmic binning scheme is fully specified with a continuous parameter λ > 1 specifying the ratio between adjacent bin boundaries.This bin width, λ, then controls the amount of smoothing.The limit λ → 1 + corresponds to no binning, since every unique data point occupies its own bin and the binned and unbinned data converge, whereas large λ such as 2 or 10 bin the data by orders of magnitude and smooth out all information within each order.
Given input data x i , we assign the binned value x i λ = x m λ k where k corresponds to the closest integer power of λ, rounding down.We denote binning using a floor operator with a subscript λ because logarithmic binning is a floor operation in log λ space: in terms of the usual integer floor, x , logarithmic binning is We can derive an expression for the distribution of binned data.The possible values after binning are x m λ k for k = 0, 1, 2, ....By integrating the probability density function of the continuous power law (Eq.2) over the range of each bin, the probability mass function for binned power-law data is given by imsart-aoas ver.2020/08/06 file: manuscript.texdate: January 16, 2023 This discrete distribution is also a power law distribution since ln p d = −α ln(x m λ k ) + c, with exponent α equal to the α in Eq. 1. Logarithmic binning is the only binning scheme that preserves this property (Newberry and Savage, 2019).The continuous power law is scaleinvariant, whereas the discrete power law (Eq.3) has a discrete-scale invariance (Sornette, 1998) with the same scaling exponent.
For parameter estimation, we use the maximum likelihood estimator for the discrete distribution given by Newberry and Savage (2019), using the logarithmically-binned data x i λ .This estimator is notably undefined for λ = 1.However in the limit λ → 1 + , αλ converges to the classical MLE for the Pareto distribution due to Muniruzzaman (1957), and so we call this estimator α1 , (5) .
The variance on the estimator and hence the statistical error increase polynomially with λ as O(λ 2α ).Conversely as λ decreases, small errors are more likely to move data between bins and bias the estimator.Hence λ negotiates a tradeoff between potential bias and statistical error, with some intermediate optimum that depends on the nature and severity of errors in the data.The Pareto MLE α1 in common use occupies the extreme end of this spectrum with the least variance but also the greatest susceptibility to error.
As a goodness of fit measure, we use the Kolmogorov-Smirnov statistic D. The Kolmogorov-Smirnov statistic is the divergence between a sample and a reference distribution computed as the maximum difference between the empirical and null hypothesized cumulative distribution functions.For continuous distributions, the asymptotic distribution of D, the Kolmogorov distribution, is classically known exactly (Kolmogorov, 1933).The quantile of the D computed from a sample provides a measure of goodness of fit to the distribution (Massey, 1951).Extreme quantiles, corresponding to low p-values, indicate discrepancies between the sample and the hypothesized distribution.The Kolmogorov-Smirnov test rejects the null hypothesis, in favor of the alternative hypothesis that the data originate from some other distribution, if the p-value is less than the stipulated false positive error rate.The p-value cutoff is equal to the stipulated false positive rate because if a sample truly originates from the reference distribution, the distribution of its quantile and p-value are uniform.
The Kolmogorov-Smirnov statistic for logarithmically-binned data x i λ relative to the discrete power law (Eq.3) is imsart-aoas ver.2020/08/06 file: manuscript.texdate: January 16, 2023 where the sum over the indicator function 1 counts the number of data points in bins up to and including x m λ k .The resulting distribution of D, however, is not necessarily equal to the Kolmogorov distribution when the null hypothesized distribution is discrete or involves parameters estimated from the data (Noether, 1963;Walsh, 1963;Lilliefors, 1967).Therefore we build up an approximate distribution of D under the null hypothesis by bootstrapping following Lilliefors (1967).We draw a sample of size n from the discrete power law given by Eq. 3 with parameters λ and αλ , fit the parameter α λ to this sample, then compute D for this sample using the parameters λ and α λ , thus simulating the steps for computing D from the data.This procedure can be repeated on many samples from the discrete power law to build up an empirical distribution of D that successively better approximations to the true null distribution.The p-value is one minus the quantile of D computed from the data among the samples of D computed through bootstrapping.This goodness-of-fit test follows a popular method (Clauset, Shalizi and Newman, 2009;Gillespie, 2015) that has also been adapted to binned power-law data (Virkar et al., 2014).Our test differs only in that we do not jointly estimate x m , as our analysis fixes x m = 1.In lieu of computing precise p-values by bootstrapping many samples, we just as accurately judge whether p < 0.05 using only 19 bootstrapped values, concluding that p < 0.05 if the D observed in the data exceeds all 19.This logarithmic binning method allows the experimenter to specify λ based on tolerances of bias and statistical error.The magnitude of bias cannot be estimated directly for unknown sources of error.We provide tolerances (see Validation) for the case of additive and multiplicative normal noise.In principle, bias can be minimized by choosing λ to exceed relative error on the vast majority of data points or to choose λ as high as possible.Upper limits for λ depend on the application.A given tolerance for statistical error imposes an upper limit on λ given by Eq. 6.Even with unlimited tolerance for statistical error, data must occupy at least two bins for αλ to be defined.The number of bins is equal to log λ max i x i /x m + 1. Hence the maximum feasible λ is equal to the proportional range of the data, r := max i x i /x m .The corresponding αr minimizes bias but also has the highest possible statistical error and is useless for foreseeable practical applications.Hypothesis testing imposes more restrictive upper limits on λ.Generally, λ should be constrained by the capacity to reject alternative distributions.What values this constraint imposes unfortunately depends on the alternative distribution, which is often unspecified.One extreme upper limit is given by requiring at least three bins.With data in only two bins, αλ can be typically be chosen to fit the data exactly so that D < 1/n in every bootstrap sample.The probability a bootstrap run contains fewer than three bins is given by the formula (8) Pr(max derived from Eq. 1. Setting this expression equal to the tolerable fraction of bootstrap simulations with fewer than three bins gives the theoretical upper limit on λ.In practice, other considerations such as false negatives rates will further restrict the upper limit of λ for purposes of hypothesis testing.

Validation.
We validate the performance of binning by synthesizing pure continuous power-law samples and introducing normal additive and multiplicative (proportional) noise.We generate parameter estimates and goodness of fit measures for different levels of binning from no binning to the extremes allowed by the data.We compare the known, true parameters to the estimated parameters and compare the sensitivity and specificity of goodness-of-fit tests with and without binning.
We synthesize data by drawing n independent samples x i from a Pareto distribution (Eqs. 1, 2) with parameters α and x m .Without loss of generality, we set x m to 1, since  .This noise nonetheless biases estimates of the log-log slope −α relative to the true α=1.5.Logarithmic binning with λ=2 or λ=4 (dotted lines) brings the slope estimates αλ closer to the true value.(b) Distributions of the slope estimates over 1000 samples of size n = 500 illustrate a tradeoff between accuracy and precision, where the most precise estimation methods are also the most inaccurate.(c) Distributions of p-values for rejecting the power law based on a K-S statistic are uniform when the data comes from a perfect power law, with or without binning (left).However, noise biases p-values (middle, right) so that the K-S statistic that assumes a continuous power law (blue) has p<0.05 more than 50% of the time.Binning with λ=2 or λ=4 attenuates noise, brings p-value distributions closer to uniformity, and restores stipulated false positive rates.
the general case can always be mapped to x m = 1 by converting the units of x.We simulate experimental error using either unbiased additive (normal) or multiplicative (lognormal) noise with variance σ 2 + and σ 2 × respectively.The data with additive noise is constructed from a sample x i as x i + Normal(0, σ + ) and multiplicative noise as x i × Lognormal(0, σ × ).For x m = 1, the corresponding additive variance is x m σ + , since σ + has the same units as x, whereas the corresponding multiplicative variance is still σ × since σ × is a dimensionless log-ratio.Fig. 2a shows tail distribution plots of samples generated with each kind of noise.
As a proof of principle, we take 1000 samples of size n = 500 and α = 1.5 with and without additive and multiplicative noise with σ + = σ × = 0.2 in order to generate empirical distributions of estimates α and bootstrapped p-values of the K-S statistic (Fig. 2).For each sample we estimate α and compute the p-value both without binning (Pareto) and using logimsart-aoas ver.2020/08/06 file: manuscript.texdate: January 16, 2023 arithmic binning with λ = 2 and λ = 4.We additionally estimate α as the slope by ordinary least squares regression on the empirical cumulative distribution function-the log of each data point versus the log of its quantile in the sample, for which no binning is necessary.
Fig. 2 shows how much the errors can bias estimates of α and p-values.All estimates and p-value behave as expected in samples without error: the estimates cluster around the true value and distributions of p-values are uniform regardless of λ.The averages of each MLE αλ are within 0.21% of the true value whereas the average regression estimate is within 1.2%.In samples with noise however, the naïve Pareto MLE α1 is biased by around 10%, confidence intervals on α1 typically exclude the true value, and p-values approach zero.This observation parallels empirical findings that errors in the data such as measurement error or quantization noise can substantially bias MLEs in practice (Langlois, Cousineau and Thivierge, 2014;Newberry and Savage, 2019).
Coarser binning, with larger values of λ, brings the MLEs αλ closer to the true value on average and brings the distribution of p-values closer to uniform.In all noise treatments, binning with λ = 4 yields approximately uniform p-values and regression provides approximately unbiased estimates of α.
Bias and variability of estimates of α illustrate a clear tradeoff between accuracy and precision.As expected for maximum likelihood estimation, α1 is the most efficient and the distribution of α1 is the most sharply peaked (Fig. 2b).However, α1 is correspondingly the most sensitive to errors in the data and thereby produces the most biased estimates.Binning makes the αλ progressively less sensitive to error for λ = 2 and λ = 4, while providing more precise estimates than linear regression.Estimates by linear regression are the most accurate as well as the most variable.On the whole, linear regression minimizes the squared difference between the estimates and the true value in this example, despite its lack of theoretical guarantees.
Noise also causes the bootstrapped K-S p-values to reject the Pareto distribution nearly all of the time (additive: 97% multiplicative: 93%), in contrast to a stipulated false positive rate of 5% (Fig. 2c).In one sense, the test is correctly performing its job: the null hypothesis specifies that the sample originates from the Pareto distribution, whereas the distribution of the sample is Pareto convolved with noise.By a strict interpretation, the null hypothesis is correctly rejected.In a more practical sense however, the data comes from a power law whether or not small errors are invovled, and so rejecting the power law is incorrect and constitutes a false positive.All measurements have errors, and so in this sense, which we adopt here, errors in the data obscure the truth by causing extreme bias in the p-values.
Binning the data brings p-values back to an approximately uniform distribution, as shown in Fig. 2c.Binning blinds the test to differences in the shape of the empirical and hypothesized distributions within each scale-including differences due to discretization and measurement errors-while still accounting for the shape of the distribution across the range of the data.With λ = 2, the false positive rates fall to 23% and 34% on data with additive and multiplicative noise, respectively.With λ = 4, the rejection rates in these 1000 trials are consistent with the stipulated rejection rate of 5% at 5.4% and 6.1% respectively.More trials would reveal biases on the order of 1%, but these K-S p-values are valid for purposes of roughly controlling the false positives rate.That is, after binning by sufficiently large λ, data originating from a power law, errors or not, will be rejected as a power law with p < 0.05 roughly 5% of the time.
The binning ratio λ controls the tradeoff between accuracy and precision in estimates and the robustness of hypothesis tests to noise, as shown in Fig. 3. Increasing λ attenuates noise but also removes information from the sample.Biases in αλ therefore decrease with λ whereas variability of the estimator (Eq.6) increases.We empirically investigated this trade- replicates; n=1,000,000; α = 1.5; σ + , σ × = 0.1, 0.2).For each noise treatement we found a λ opt.that minimized the total mean squared error on the estimator, (α − αλ ) 2 , incorporating both bias and variability.This λ opt.typically divided the data into two bins.Specifically, setting log λ/ log r = 0.50 with r = max i x i /x m equally divides the range of the data in log space (median: r = 14,267, λ 0.50 = 119).We observed log λ opt./ log r ranging from 0.49 to 0.63 across all noise treatments.Doubling the noise leads to only marginally larger log λ opt./ log r, suggesting that equally dividing the range of the data into two bins is a reasonable generic prescription for minimizing bias and variability when detailed information about errors in the data is not available.Binning with sufficiently large λ restored stipulated false positive rates in hypothesis tests (Fig. 3b).On these large samples, the hypothesis test nearly always rejects the power law unless λ is also large.The minimum λ to achieve the stipulated false positive rates depends on the type and magnitude of noise and ranged from λ = 20 to λ = 60.For yet higher λ, p-values become conservatively biased from too few bins (see Methods), causing fewer than the stipulated rate of false positives.In all four treatments, λ in the range from 60 to 75 gave rejection rates that were not significantly biased upwards or downwards from the stipulated rate of 0.05 over 200 trials (one-tailed binomial p-values: 0.12 to 0.87).According to Eq. 8, the bias from too few bins at λ = 75 is less than 10% or 0.005 while the bias due to errors in the data is minimized.Thus robust p-values are available for λ = 75 across all treatments as well as for lower λ in treatments with lower error.
Statistical power or sensitivity is the test's ability to correctly reject non-power-law data.Intuitively sensitivity might decrease with λ since binning removes information.We investigate how binning affects the ability to reject non-power-law data by applying the test to small imsart-aoas ver.2020/08/06 file: manuscript.texdate: January 16, 2023 samples (n = 500) from the tail of the lognormal distribution truncated below x m = 1.The lognormal distribution, like the power law, is heavy-tailed and so its extremes can be difficult to distinguish from a power law.Both tail distributions can appear straight on a log-log plot, but the lognormal has curvature that depends on its parameters (Fig. 4a).The truncated lognormal distribution has three parameters, the mean and variance µ and σ as well as the tail threshold x m that samples must exceed.The slope and curvature of the lognormal tail depend on all three parameters.We set x m to 1, use σ to set the curvature, and finally choose µ so that the log-log slope of the lognormal tail distribution at x m is equal to a power law with α = 1.5.In this scheme, the lognormal tail can approximate the power law arbitrarily well as the lognormal variance σ increases.
For samples from the lognormal tail, we find that the test sensitivity is unaffected by binning, provided λ divides the data into at least four bins (Fig. 4b).The test's rejection rate on lognormal data depends strongly on the curvature of the lognormal distribution and the amount of data.For σ = 1, the test rejects lognormal data roughly 80% of the time regardless of λ for all λ < 2. The median range of these samples is 12.8, and so λ = 2 typically puts the maximum data point in the fourth bin, [8,16).As σ increases, so too does the maximum value in samples, which widens the range of λ that divide the data into four bins.One quick, intuitive rationalization for the significance of four bins is that only three points are required to detect curvature, while the topmost bin is unreliable and only partially occupied by the range of the data.
Chosing λ within a certain range tunes a tradeoff between sensitivity and specificity.Our earlier measurements in noisy power-law samples of the same size (n=500, Fig. 2) give the specificity.While the sensitivity is the rate of rejection on lognormal tail samples, the specificity is the rate of acceptance on noisy power-law samples or one minus the false positive rate.The test with λ = 2 and a p-value cutoff of 0.05 distinguishes lognormal tail samples with σ = 1 from noisy power-law samples with a sensitivity of 78% and specificities of 77% for additive noise and 66% for multiplicative with σ + , σ × = 0.2.Using λ < 2 yields reduced specificity with no benefit to sensitivity, whereas λ > 2 tunes a tradeoff between sensitivity and specificity.Using λ = 3 for example gives sensitivity of 37% and specificities of 92% and 89% on the same samples.For λ > 4 sensitivity approaches zero and the test is useless.The interval 2 < λ < 4 then offers reasonable tradeoffs between sensitivity and specificity, with lower λ more sensitive and higher λ more specific.
The foregoing results demonstrate cases in which binning has the desired effect of reducing the influence of small errors in the data at modest cost of statistical power.results hold more generally, while exact magnitudes of bias and rejection rates depend on the errors, the sample size, and α.The test sensitivity and how it depends on λ, on the other hand, depend on the possible alternative distributions, which we cannot enumerate explicitly.We can quantitatively generalize the results by varying σ and sample size n.Biases in α increase with greater error magnitudes and decrease with greater λ, while false positive rates as well as the feasible values of λ depend strongly on sample size (Fig. 5).We choose a range of errors from 0 (no error) to a maximum error high enough that the distribution clearly deviates from a straight line on a log-log plot and no one slope clearly represents the distribution (Fig. 5b).The error rates σ + and σ × each have marginal effects on biases in α (Fig. 5a), whereas the biases do not depend on n because greater sample size merely causes the estimators to converge to their expected values.Precision of the estimates, by contrast, increases with sample size, as expected from Eq. 6.This feature can be problematic because as sample size increases, estimators are more confident in an incorrect estimate and confidence intervals are more likely to exclude the true value.Fortunately increasing n also increases the feasible range of λ and the maximum λ subject to a tolerable statistical error.The noise tolerance for a given α and λ is the level of noise σ + or σ × at which bootstrapped K-S p-values are less than 0.05 roughly 10% of the time.Shown here are noise tolerance contours for samples of size n=500.With this amount of error, p-values are biased enough to double the false positive rate relative to the stipulated rate of 0.05.More error quickly renders hypothesis tests meaningless, while less error, greater λ, or lower n can restore the actual false positive rates closer to the stipulated 5%.The inferred αλ (pink contour lines) are also biased to some degree by this amount of noise, as they differ from the horizontal lines of true α.Bias in αλ increases with σ +/× but decreases with λ so that as σ +/× increases with λ, the contour lines remain roughly horizontal and bias remains within 10% of the true α.
Hence increasing sample size can mitigate bias by concommittantly increasing λ according to Eq. 6.
Rejection rates exhibit sigmoid threshold behavior in both σ and n similar but opposite to the thresholds in λ (Fig. 5c, Fig. 3).For sufficiently small error or sample size, the test cannot detect errors, p-value distributions are close uniform, and hence rejection rates are equal to the p-value cutoff such as 0.05.As error or sample size increases, hypothesis tests are more likely to detect the errors and reject the power law until eventually the errors or sample size are sufficient for the test to nearly always reject.Increasing λ increases the threshold of σ or n at which the test begins to detect errors and reject the power law.With λ = 1 (no binning) there is almost no range of tolerable error.For example, noise with σ + = 0.025 roughly doubles the false positive rate.For λ = 4, on the other hand the noise must exceed σ + ≥ 0.4 or σ × ≥ 0.3 before the rejection rate is doubled.While noise can easily be increased to the point that no feasible binning is sufficient to restore the stipulated rejection rates, increasing n allows greater λ by increasing the range of the data.In Fig. 5, we set λ to n 1/(3α) (purple), designed to typically bin the data into 3 or 4 bins.Allowing this λ to depend on n preserves approximately unbiased p-values over the range of n from 100 to 100,000.
The parameter α also influences the error threshold and magnitudes of bias in the MLEs.We varied α and λ over a grid of combinations and computed error thresholds as the σ + or σ × required to double the rejection rate for a p-value cutoff of 0.05.For each α we varied λ from 1 to n 1/3α , which at its maximum still divided the data into at least four bins at least 70% of the time.We conducted a stochastic binary search to estimate these values σ+ and imsart-aoas ver.2020/08/06 file: manuscript.texdate: January 16, 2023 σ× , testing various candidate σ-values and concluding the search when the target rejection rate 0.10 was within a binomial 95% confidence interval of ±0.005 for trials in the interval σ ± 0.02σ.This search thus yields values σ which are within 2% of some σ for which the rejection rate is within 5% of 0.10 approximately 95% of the time.The search also yields mean estimates αλ at the inferred noise level.
Fig. 6 thereby gives an error tolerance for hypothesis tests and the maximum bias on estimates within that error tolerance applicable to samples of n = 500.For example, at the values α = 1.5 and λ = 2, additive noise with σ + < 0.08 gives a rejection rate less than 0.1 and αλ slightly less than 1.5, biased by approximately 0.05 corresponding to results presented in Fig. 5.The lines of inferred αλ are approximately horizontal and approximately equal to α because limiting noise to levels that do not substantially affect hypothesis tests also limits noise that would substantially bias α.At the maximum such σ+ or σ× for any given α, the magnitude of bias α − α is still less than 10% of α when binning by the corresponding λ.
Conversely, for a known amount of noise, Fig. 6 gives a lower bound for λ at a given inferred α.For example, on data with a known ∼10% proportional random noise, σ × ≈ 0.1 and hence, if inferred α = 1.5, λ should be chosen to exceed roughly 2.2 in order to remain within a 10% tolerance of bias on α and a less-than-doubled false positives rate.Normal additive and multiplicative noise can be taken as a proxy for many kinds of measurement errors that routinely occur in data, and so Fig. 6 provides lower bounds on valid λ for a variety of error types in samples of size n = 500.4. Results.We analyse three empirical cases and compare results with and without binning.We use data on earthquake magnitudes and wealth, which are historically believed to follow power-law distributions, as well as wildfire size, for which we have no evidence of a power law.The specific datasets are curated online by Clauset, Shalizi and Newman (2009) as demonstration cases of power-law inference.The earthquakes dataset contains 17,450 positive and valid samples, recorded as two-digit Richter magnitudes ranging from 0.5 to 7.8.The Richter scale records the logarithm to the base 10 of the amplitude of waves recorded by seismographs (Richter, 1935).We convert the data back to the natural scale by exponentiating the Richter magnitudes.The result is a dataset with discrete values proportional to integer powers of λ = 10 0.1 ≈ 1.26.In other words, Richter magnitudes with two digits of precision are inherently a case of logarithmic binning with λ = 10 0.1 .The wealth dataset is the Forbes list of the world's 399 richest people in 2003, including 261 billionaires.The wildfires dataset includes 203,784 measurements of wildfire area from one decade in the US, 99% of which are between 10 −1 and 10 3 acres.
We estimate the exponent α and conduct goodness-of-fit tests for all three datasets using the MLEs α1 and αλ and K-S statistics just as with the synthetic data.For each dataset we compare three binning treatments-no binning (λ = 1), fine binning (small λ) and coarse binning (large λ)-letting the small and large λ depend on the dataset.We also estimate the exponent with linear regression against the empirical cumulative distribution function as with the synthetic data in Fig. 2. We stipulate a fixed x m for each dataset, specifically, magnitude 3.5 for earthquakes, one billion dollars for wealth, and 6324 acres for wildfires.
The earthquake dataset incorporates errors known to complicate measurements, including partial censorship of small values, quantization error, and attraction to particular values.Fig. 7a demonstrates that these subtle errors have different effects on MLEs and goodness-offit tests at different λ.We avoid the partial censorship that causes visible curvature in Fig. 7a by choosing x m = 10 3.5 , or magnitude 3.5, across all treatments.This minimum magnitude without censorship is called the magnitude of completeness or m c in seismology (Woessner and Wiemer, 2005) and depends on the density of the earthquake sensor network.The earthquake dataset also includes quantization error from the two-digit precision of Richter imsart-aoas ver.2020/08/06 file: manuscript.texdate: January 16, 2023 celebrated reputation for tracking the wealth of billionaires.This dataset includes striking discrepancies in its quantification scheme, with values truncated to the nearest 5 million or 100 million depending on whether net worth exceeded one billion, evident in different levels of jaggedness below and above 1 in the curve of Fig. 7b.The log-log plot also suggests inconsistent representation of values between 4 billion and 10 billion, where many billionares are represented has having either "4,000,000" or "5,000,000" while others are distinguished between 9.7 and 9.8 billion, particularly if net worth is just shy of a round number, resulting in noticeable "bumps" in the tail distribution.In this dataset, all three MLEs and regression obtain similar α estimates, but goodness-of-fit tests with and without binning draw distinct conclusions.As we observe in the earthquakes data, quantization noise can be sufficient for K-S p-values bootstrapped from the continuous Pareto distribution to reject the power law.Without binning, the MLE α1 produces a slope aligned with a majority of data points in the dataset and consistent with the other estimates, but the K-S statistic rejects the hypothesis that wealth is drawn from a power-law distribution with p = 0.03.Binning by either λ = 2 or λ = 4 attenuates the impact of bumps on the goodness-of-fit test, which then consistently accepts the power law, vindicating Vilfredo Pareto's 1895 assertion.
The wildfire data represents a contrasting case with ample data recorded consistenly at ±0.1 acre with up to six digits of precision.A basic visual inspection of the distribution in Fig. 7c reveals smooth and substantial curvature clearly inconsistent with a power law, even allowing for random variation.Methods for fitting x m , however, have been devised to locate subsets of the distribution which do follow a power law.Clauset, Shalizi and Newman (2009) found power-law behavior above x m = 6324 using K-S statistics fitting both x m and α.The remaining 520 values eliminate 99.7% of the data, indicating that the large proportion of data does not follow a power law indeed.Given x m = 6324 however, MLEs are consistent regardless of binning whereas the regression estimate differs, possibly indicating curvature persisting beyond 6324.The p-values with and without binning, however, are poles apart.The goodness of fit test to the continuous Pareto distribution accepts the null hypothesis with p = 0.26, consistent with the stipulation for choosing x m initially.Tuning the bin width, λ, however, drives the goodness-of-fit test towards rejecting the power law.For λ = 4, corresponding to four bins with only one data point in the fourth bin, we reject the power law with p = 0.01.As we see in the other empirical cases, K-S statistics for continuous data ascribes undue weight to smoothness of the data, whereas in the present case binning conversely causes the K-S test to ascribe more weight to the overall shape of the distribution.The result is that binning allows the K-S statistics to reveal curvature in the empirical distribution when they otherwise do not.5. Discussion.On perfect power-law data, estimates and goodness-of-fit tests give consistent results with and without binning, and binning can only reduce statistical power (Virkar et al., 2014).When the data deviates from a perfect power law, however, binning may reduce biases in inferences.Logarithmic binning preserves a power law with the original exponent, and so we argue that the resulting discrete power law (Newberry and Savage, 2019) is a better baseline model for real data than the Pareto distribution.When conclusions depend on binning, as we observe in the empirical data, the data deviate from a perfect power law, possibly for trivial reasons such as small errors.In this case, inferences without binning cannot be trusted, as the deviations bias the inference, and inferences based on binned data are more robust.
We find that logarithmic binning can attenuate the affects of additive and multiplicative noise and linear and logarithmic quantization errors.Real data contains many more sources of error, such as censorship or reporting biases, various kinds of measurement error, or dependence between samples.The argument for logarithmic binning makes no reference to the exact structure of the errors so long as the errors do not tend to cause the average addition or removal of data from particular bins.This requirement is much less stringent than requiring samples to be within statistical deviation of the perfect continuous power law.For errors such as noise and quantization that have small effects on individual measurements, the requirements can be assured if the binning ratio λ is sufficiently large relative to the scale of the error.How large is sufficient depends on the purpose of the inference and the size of the sample (Figs. 3,5,and 6).
Loose specification of errors is convenient in practice, because the details of error sources are often unknown.Nonetheless, we expect that binning is effective in attenuating some error types more than others.Binning may be less effective against errors that cause deviation from the pure power law over its full range, such as scale-dependent censorship (Woessner and Wiemer, 2005), contamination, or dependence between samples (Gerlach and Altmann, 2019).These cases need further investigation, but our study nonetheless offers some clues.
Model misspecification near x m and log-periodic fluctuations are two common causes of deviation from pure power laws.We do not study these cases explicitly, but our results do have bearing.Often, values near x m do not follow a perfect power law and log-log tail distribution plots show substantial curvature in small values.Small values may be censored, as in earthquake reporting, or may represent a bulk distribution with distinct behavior from the tail.Despite dealing only with small measurement noise, our validation procedure is nonetheless a reasonable proxy for both cases.Because we simulate noise by sampling from a power law above x m , applying error, and then, of necessity, fitting using only the data above x m , the procedure acts both to create a censorship process-because data reduced below x m by noise are removed from the sample-and a bulk distribution distinct from its tail-because applying normal or lognormal error to power law samples yields a unimodal distribution peaked near x m .Thus samples with higher error such as those in Fig. 5 show more of the notorious curvature for small values common in empirical data.In such cases, increasing x m is known to reduce biases (Danielsson et al., 2001).Our results show that increasing λ also reduces bias from this type of error, offering an alternative or additional measure of conservatism besides increasing x m .
Log-periodic fluctuations are another deviation from power-law behavior common in empirical data.This phenomenon can result from dependence between data points in the sample, such as large blood vessels branching into smaller vessels (Newberry, Ennis and Savage, 2015) or large social groups composed of proportionally smaller units (Zhou et al., 2005).While Pareto samples display a continuous scale invariance, data with log-periodic fluctuations has a discrete-scale invariance, with a fundamental scaling ratio λ set by the "wavelength" of the fluctuations (Sornette, 1998).If the bin width λ matches the scaling ratio λ, the discrete-scale invariance of the model matches that of the data, conformity to the discrete power law is restored and inferences will be valid (Newberry and Savage, 2019).Thus, we expect data with log-periodic fluctuations may be accurately fit with logarithmic binning as long as care is taken to choose bin widths λ that are integer powers of the fundamental scaling ratio of the data.
Our findings challenge some conventional wisdom about power law inference while addressing some ongoing concerns.Maximum likelihood estimation offers strong mathematical guarantees whereas regression has a manifestly incorrect error model (Goldstein, Morris and Yen, 2004;Bauke, 2007;Clauset, Shalizi and Newman, 2009;Stumpf and Porter, 2012).At the same time, the fragility of maximum likelihood in the case of power laws has been underappreciated, and we find that appropriate regression methods are relatively robust in practice (cf.Fig. 2, Gabaix and Ibragimov (2011)).Logarithmic binning offers MLEs the robustness of regression while maintaining the logical footings of maximum likelihood.The parameter λ offers a smooth interpolation between the ideal precision of the continuous MLE and the accuracy of a vague and broadly-applicable error model.
Considerable discussion in recent decades has questioned the validity and the value of power laws (Stumpf and Porter, 2012), including studies that have rejected putative power laws on statistical grounds (Clauset, Shalizi and Newman, 2009).Often, however, these studies have left zero tolerance for error (Gerlach and Altmann, 2019).Binning offers a way to re-evaluate the quality of empirical power laws with some allowance for trivial experimental or data collection error.
Debate about the validity of power laws has also sometimes conflated methods to detect power laws and methods to measure the exponent, whereas these are two quite different scientific concerns.Tuning λ allows empiricists to use different models within the same model family to answer different questions.For example, we find that four bins preserve the maximum ability to reject lognormal data in hypothesis tests, whereas two bins are sufficient for parameter inference, where a power law is necessarily assumed and minimizing bias is the utmost concern.
Finally, estimating the sample fraction for tail behavior or x m is an old and ongoing problem for inference (Hall and Welsh, 1985;Drees et al., 2020).For simplicity, our analysis takes x m = 1 to be given, whereas in practice, a true x m parameter is typically unknown and may not ever perfectly separate a distribution's bulk behavior from its tail.Model-free approaches to estimating x m have been developed for Pareto tails (Clauset, Shalizi and Newman, 2009) and binned power-law tails (Virkar et al., 2014).Underestimating x m , however introduces biases into estimates of α by contamination from the bulk.Binning offers no solution to accurately choosing x m , but does attenuate error from accidental contamination from the bulk distribution resulting from underestimation of x m .
We conclude that logarithmic binning-combined with appropriate maximum likelihood estimators and goodness-of-fit tests-offers a rough but effective control for the effects of common data errors on power-law inference.These errors otherwise make power-law inference unreliable.We call for better methods supporting robust inference in the many scientific contexts in which power laws arise.Given the ubiquity of power laws in nature and errors in data, we hope that the methods we describe here will be widely adopted.

FIG 2 .
FIG 2. Error in power-law data biases estimates of the exponent α and causes tests based on Kolmogorov-Smirnov (K-S) statistics to reject the power-law distribution.(a) Log-log tail distribution plots of samples from perfect data (left) and data with additive and multiplicative noise are visually almost indistinguishable (middle: Normal(0, σ 2 + =0.2), right: Lognormal(0, σ 2 × =0.2)).This noise nonetheless biases estimates of the log-log slope −α relative to the true α=1.5.Logarithmic binning with λ=2 or λ=4 (dotted lines) brings the slope estimates αλ closer to the true value.(b) Distributions of the slope estimates over 1000 samples of size n = 500 illustrate a tradeoff between accuracy and precision, where the most precise estimation methods are also the most inaccurate.(c) Distributions of p-values for rejecting the power law based on a K-S statistic are uniform when the data comes from a perfect power law, with or without binning (left).However, noise biases p-values (middle, right) so that the K-S statistic that assumes a continuous power law (blue) has p<0.05 more than 50% of the time.Binning with λ=2 or λ=4 attenuates noise, brings p-value distributions closer to uniformity, and restores stipulated false positive rates.

FIG 3 .
FIG 3. The binning ratio λ tunes (a) a tradeoff between accuracy and precision of the estimator αλ and (b) robustness of hypothesis tests, on data with additive and multiplicative noise.Here we simulate large datasets (n=1,000,000, median range r=14,267, 200 replicates) with additive and multiplicative noise with variance σ + and σ × = 0.1 and 0.2.(a) Increasing λ brings estimates αλ closer to the true α=1.5 but also increases the variance of the estimator.The optimal binning ratio λ opt. to minimize total mean squared error on the estimate depends on the type and magnitude of noise but roughly divides the data into two bins.(b) Hypothesis tests virtually always reject the power law unless λ is sufficiently large relative to noise.Rejection rates roughly equal the stipulated p-value cutoff 0.05 when λ ≈ 70 in all noise treatments whereas yet greater values of λ produce conservatively biased p-values.The range of λ that achieves the stipulated false positive rate corresponds to dividing the data into 3-4 bins.

FIG 4 .
FIG 4. For (a) lognormal samples chosen to resemble the power-law, (b) binning entails no decrease in statistical power below a threshold of λ.(a) Samples (n=500) from the tail of the lognormal distribution can approximate power-law samples arbitrarily well by increasing the lognormal variance σ.(b) The statistical power (rejection rate) on these samples is unaffected by binning provided λ < 2, corresponding to dividing the data into at least four bins.

FIG 5 .
FIG 5. Increasing noise σ + and σ × (a) increases bias in estimates of α as well as (b) the rate of rejecting the power-law distribution in samples of size n = 500, while (c) increasing sample size n increases the rejection rate without affecting bias in αλ in samples with σ + , σ × = 0.2.At the upper limit of the plotted range, where σ + = 1.0 and σ × = 0.5, (d) the tail distributions of samples are notably curved and no longer resemble a power law on log-log plots, producing extremely biased estimates of the slope.Binning with λ = 2 or 4 (yellow, red) reduces bias in estimates and restores rejection rates to the stipulated 0.05 over a larger range of tolerable error. FIG