Shades of Dark Uncertainty and Consensus Value for the Newtonian Constant of Gravitation

The Newtonian constant of gravitation, $G$, stands out in the landscape of the most common fundamental constants owing to its surprisingly large relative uncertainty, which is attributable mostly to the dispersion of the values measured for it in different experiments. This study focuses on a set of measurements of $G$ that are mutually inconsistent, in the sense that the dispersion of the measured values is significantly larger than what their reported uncertainties suggest that it should be. Furthermore, there is a loosely defined group of measured values that lie fairly close to a consensus value that may be derived from all the measurement results, and then there are one or more groups with measured values farther away from the consensus value, some higher, others lower. This same general pattern is often observed in many interlaboratory studies and meta-analyses. In the conventional treatments of such data, the mutual inconsistency is addressed by inflating the reported uncertainties, either multiplicatively, or by the addition of random effects, both reflecting the presence of dark uncertainty. The former approach is often used by CODATA and by the Particle Data Group, and the latter is common in medical meta-analysis and in metrology. We propose a new procedure for consensus building that models the results using latent clusters with different shades of dark uncertainty, which assigns a customized amount of dark uncertainty to each measured value, as a mixture of those shades, and does so taking into account both the placement of the measured values relative to the consensus value, and the reported uncertainties. We demonstrate this procedure by deriving a new estimate for $G$, as a consensus value $G = 6.67408 \times 10^{-11} \,\text{m}^{-3} \, \text{kg}^{-1} \, \text{s}^{-2}$, with $u(G) = 0.00024 \times 10^{-11} \,\text{m}^{-3} \, \text{kg}^{-1} \, \text{s}^{-2}$.


Introduction
• Gravitatem in corpora universa eri, eamque proportionalem esse quantitati materiae in singulis • Si Globorum duorum in se mutuò gravitantium materia undique, in regionibus quae à centris aequaliter distant, homogenea sit: erit pondus Globi alterutrius in alterum reciprocè ut quadratum distantiae inter centra I. Newton  The NIST Reference on Constants, Units, and Uncertainty (https://physics. nist.gov/cuu/Constants/) includes a list of 22 "Frequently used constants", among them the Newtonian constant of gravitation, , which has the largest relative standard uncertainty among these 22, by far, particularly after the values of the Planck constant ℎ, elementary electrical charge , Boltzmann constant , and Avogadro constant A were xed in preparation for the rede nition of the international system of units (SI) [48]. The constant appears as a proportionality factor in Newton's law of universal gravitation and in the eld equations of Einstein's general theory of relativity [44].
The surprisingly large uncertainty associated with is mostly an expression of the dispersion of the values that have been measured for it, which exceeds by far the reported uncertainties associated with the individual measured values. Rothleitner and Schlamminger [64] suggest that "this gives reason to suspect hidden systematic errors in some of the experiments. An alternative explanation is that although the values are reported correctly, some of the reported uncertainties may be lacking signi cant contributions. The uncertainty budgets can include only what experimenters know and not what they do not know. This missing uncertainty is sometimes referred to as a dark uncertainty" [73].
Speake [69] summarizes the role that plays in classical and quantum physics, reviews the methods used to measure in laboratory experiments, and discusses the outstanding challenges facing such measurements, suggesting that improvements in the measurement of length are key to reducing the uncertainty associated with , but also, somewhat discouragingly, suggesting that, owing to "a multitude of subtle problems", it may be a forlorn hope ever to achieve mutual agreement to within 10 parts per million.
Klein [33] o ers Modi ed Newtonian Dynamics (MOND) [43] as a striking and provocative explanation for why some measured values should lie as far away from the currently accepted consensus value [46] as they do, and shows how they can be "corrected. " The principal aim of this contribution is to present a new approach to derive a consensus value from the set of mutually inconsistent measurement results for that the Task Group on Fundamental Constants of the Committee on Data for Science and Technology (CODATA, International Council for Science) used to produce its most recent recommendation of a value for [46], together with the two, more recent measurement results reported by Li et al. [39]. The procedure we propose is equally applicable to similar reductions of other, mutually inconsistent data sets obtained in interlaboratory comparisons and in meta-analyses [3; 14].
In Section 2 we review a few, particularly noteworthy measurements that directly or indirectly relate to , beginning with the measurement of the density of the Earth undertaken by Henry Cavendish. Section 3 is focused on the evaluation of mutual consistency (or, homogeneity) of the measurement results: we review several ways in which mutual consistency has traditionally been gauged, and discuss how multiplicative and additive statistical models may be used to produce consensus values when the measurement results are mutually inconsistent. Section 4 addresses a common complaint about the use of models where dark uncertainty appears as a uniform penalty that applies equally to all measurement results being combined into the consensus value, regardless of whether the corresponding measured values lie close or far from the consensus value, and motivates an alternative approach.
This novel approach regards the measured values as drawings from a mixture of probability distributions, e ectively clustering the measurements into subsets with di erent levels (shades) of dark uncertainty (Section 5). If denotes the number of measurements one wishes to blend, then we consider mixtures whose number of components ranges from 1 to , and use Bayesian model selection criteria to identify the best model. Section 6 presents the results obtained by application of the proposed model to the measurement results available for .
The conclusions, presented in Section 7, include the observation that advances in the measurement of involve not only substantive developments in measurement methods, but also in the statistical modeling that informs productive data reductions and enables realistic uncertainty evaluations.

Historical retrospective
Cavendish [11, Page 520] lists 29 determinations of the relative density (or, speci c gravity) ⊕ of the Earth. The rst 6, produced in the experiments of August 5-7, 1797, were made using one particular wire to suspend the wooden arm of the apparatus bearing two small leaden balls. Cavendish found that this wire was insu ciently sti , and he replaced it with a sti er wire for the 23 determinations between August 12, 1797 and May 30, 1798 [11,Page 485]. This second group of 23 determinations has average 5.480 g/cm 3 . Cavendish points out that the range of these determinations is 0.75 g/cm 3 , "so that the extreme results do not di er from the mean by more than 0.38, or 1 14 of the whole, and therefore the density should seem to be determined hereby, to great exactness" [11,Page 521]. Following a brief recapitulation of sources of error discussed amply earlier in the paper, Cavendish [11,Page 522] concludes that "it seems very unlikely that the density of the earth should di er from 5.48 by so much as 1 14 of the whole. " Since the standard deviation of those 23 determinations is 0.19 g/cm 3 , the aforementioned 0.38 g/cm 3 (" 1 14 of the whole") may be regarded as an expanded uncertainty for approximate 95 % coverage. (Also in agreement with the conventional, crude estimate of the standard deviation as one fourth of the range, that is (0.75 g/cm 3 )/4 ≈ 0.19 g/cm 3 in this case [27].) In other words, Cavendish seems e ectively to have regarded the 23 determinations made using the second wire as a sample from the distribution of the measurand, and used an assessment of their dispersion as evaluation of what nowadays we would call standard uncertainty, rather than using anything like the standard deviation of the average of the same 23 determinations, which would have been √ 23 ≈ 4.8 times smaller than 0.19 g/cm 3 . (It should be noted that none of the terms probable error [5], mean error [21], standard deviation [54], or standard error [78] were in use at the time.) The mass of an object lying on the surface of the ellipsoid de ned in the World Geodetic System (WGS 84) [51], at geodetic latitude , satis es ( ) = ⊕ / 2 ( ), where is the Newtonian constant of gravitation, ⊕ denotes the mass of the Earth, ( ) denotes the theoretical acceleration due to gravity (exclusive of the e ect of the centrifugal acceleration due to the Earth's rotation), and ( ) denotes the Earth's geocentric radius at latitude . If 3 = 6 371 000.79 m denotes the radius of a sphere with the same volume as the WGS 84 ellipsoid [47], then ⊕ = (4/3) 3 3 ⊕ . Therefore, = 3 ( ) 2 ( )/(4 3 3 ⊕ ). Since ( ⊕ )/ ⊕ = 3.5 % and we take the geometry of WGS 84, and the latitude of Clapham Common, as known quantities, this is also the relative uncertainty associated with C . More impressive still is the fact that the error in C , relative to the CODATA 2014 recommended value, 2014 = 6.674 08 × 10 −11 m 3 kg −1 s −2 [46], is only 0.34 %. The comparable relative "errors" associated with the contemporary measured values listed in Table 1 range from −0.033 % to 0.023 %, indicating that, in the intervening 220 years, the worst relative "error" in the determination of has been reduced by no more than 10-fold.
was of no concern to Cavendish, and neither did Newton introduce it in the Principia [50]. More than 70 years would have to elapse after Cavendish "weighed the Earth", before even a particular symbol would be advanced for the gravitational constant -and the symbol at rst was " ", not " " [15].
According to Hartmut Petzold (formerly with the Deutsches Museum, Munich, personal communication), the birthday of the expression "gravitational constant" was on one of these three days, December 16-18, 1884: on December 16th, Arthur König and Franz Richarz submitted a handwritten proposal to measure "the mean density of the earth"; two days later Helmholtz presented their proposal to the Royal Prussian Academy of Sciences in Berlin with the modi ed title "A new method for determining the gravitational constant" [61].
In the evening session of June 8, 1894, of the Royal Institution of Great Britain, Charles Vernon Boys also used the symbol when he made a presentation on the Newtonian constant of gravitation, and announced 6.6576 × 10 −11 m 3 kg −1 s −2 as "adopted result" derived from experiments using gold and lead balls in a torsion balance [7]. The relative di erence between this determination and CODATA's 2014 is −0.25 %. In this study we focus on the set of measurement results listed in Table 1 [39], obtained using the time-of-swing (TOS) method and the angular-acceleration-feedback (AAF) method for the torsion pendulum, which have the smallest associated uncertainties achieved thus far [67].
includes the results that CODATA used to produce the 2014 recommended value for , together with two, more recent determinations. Since some of these results di er from their originally published versions, the following remarks clarify the precise provenance of all the measurement results listed. For the sake of brevity, we use the scale factor = 10 −11 m 3 kg −1 s −2 .
NIST-82 The result published originally, / = 6.6726 ± 0.0005 [41], had not been corrected for an e ect caused by the anelasticity of the torsion ber. The corresponding result listed in Table 1 re ects an anelasticity correction applied by CODATA. It should be noted that the change in the reported uncertainty (down from 0.0005 in the original publication, to the 0.00043 in Table 1) is not a consequence of this correction but results from a re nement of the uncertainty analysis that the authors did between the time when the result was rst published and when it was used for the 1986 adjustment of the fundamental physical constants [13].

LANL-97
In 2010, CODATA corrected the result published originally, / = 6.6740 ± 0.0007 [2] to take into account uncertainties in the measurement of the quality factor of the torsion pendulum. The quality factor is needed to calculate the correction caused by the anelastic properties of the ber.
UWash-00 The measured value listed in the original work [25], / = 6.674 215± 0.000 092, is 6 × 10 −6 lower than the value used by CODATA. After the result was published, the authors noticed the omission of a small e ect and communicated a corrected value to CODATA. The small e ect was caused by a a mass that is mounted on the top of the torsion ber, and is itself suspended by a thicker ber. In this experiment, the gravitational torque is counteracted by the inertia of the pendulum in an accelerated rotating frame. The acceleration acts also on the pre-hanger, and its e ect must be taken into account. No erratum is publicly available.

HUST-05
The measurement result published originally in 1999, / = 6.6699 ± 0.0007 [40], di ers appreciably from the corresponding result used by CO-DATA. The measured value is lower than its CODATA counterpart, with a relative di erence of 3.5 × 10 −4 . However, two needed corrections had not been applied: rst, for the gravitational e ect of the air that is displaced by the eld masses; second, for the density inhomogeneity of the source masses. The result, as updated in 2005, became / = 6.672 3 ± 0.000 9 [28], where the updated measured value is larger than CODATA's, the relative di erence being 1.1 × 10 −5 . In 2014, CODATA applied a third correction for the anelasticity of the ber.

JILA-10
The authors of the original work [52], which listed / = 6.672 34 ± 0.000 14 as measurement result, realized that two e ects had been miscalculated. In 2018, they sent an erratum to CODATA reporting a corrected value of / = 6.672 60 ± 0.000 25. First, the pendulum bob rotates under excursion from the equilibrium position due to a di erential stretching of the support wire. The rotation is di erent in the calibration mode from the measurement mode. The second e ect also has to do with the rotation of the bob. If the laser beam is not perfectly centered on the mass centers, a rotation can cause an apparent length change (Abbe e ect). The Abbe e ect was not properly calculated in the initial publication. These two effects have di erent signs, yielding a nal result that di ers relatively only by +3.9 × 10 −5 from the value in the original publication. An erratum has been submitted for publication in Physical Review Letters.

BIPM-14
The measurement result reported originally / = 6.675 45 ± 0.000 18 [57] was superseded by the result listed in an erratum published in 2014 [58]. This was the value used by CODATA. The relative change in value of −13.5×10 −5 was caused by the density inhomogeneity of the source masses.
In the original publication, the corresponding correction had inadvertently been applied twice.

UCI-14
In the original publication [49], the authors reported a slightly (by 3 × 10 −6 ) smaller value, / = 6.674 33 ± 0.000 13. The reported value is an average of three measurements. The authors used an unweighted average, while CODATA used a weighted average and considered the correlation between the three results.

Mutual consistency
A set of measurement results, comprising pairs of measured values and associated standard uncertainties, for example {( , ( ))} as in Table 1, is said to be mutually consistent (or, homogeneous) when the variability of the measured values is statistically comparable to the reported uncertainties: for example, when the standard deviation of the { } is practically indistinguishable from the "typical" { ( )} (say, their median).
The standard deviation of the { } in Table 1 is 0.001 09 m 3 kg −1 s −2 , while the median of the { ( )} is 0.000 26 m 3 kg −1 s −2 : the former is 4.2 times larger than the latter, indicating that the measured values are much more dispersed than their associated uncertainties suggest they should be.
This implies that either the di erent experiments are measuring di erent measurands, or there are sources of uncertainty yet unrecognized that are not expressed in the reported uncertainties. If the di erent experiments indeed are measuring the same measurand, then these uncertainties are much too small, and the lurking, yet unrecognized "extra" component is what Thompson and Ellison [73] felicitously have dubbed dark uncertainty because it is perceived only once independent results are inter-compared. Dark uncertainty may derive from a single or from multiple sources of uncertainty.
Cochran's test, which is the conventional chi-squared test of mutual consistency, is very widely used, even if it su ers from important limitations and misunderstandings [26]. For the measurement results in Table 1, the test statistic is = 198 on 15 degrees of freedom: since the reference distribution is chi-squared with 15 degrees of freedom, 2 15 , the -value of the test is essentially zero, hence the conclusion of heterogeneity.

Multiplicative models
Birge [6] suggested an approach for the combination of mutually inconsistent measurement results that involves: rst, in ating the reported standard uncertainties using a multiplicative in ation factor su ciently large to make the results mutually consistent; second, combining the measured values into a weighted average whose weights are inversely proportional to the squared uncertainties. This choice of value for makes Cochran's statistic equal to its expected value, hence is a method-of-moments estimate. Birge's approach is used routinely by the Particle Data Group (pdg.lbl.gov) [71], and also by CODATA to produce recommended values for some of the fundamental physical constants [46], in particular.
The in ation factor may be determined in many other ways. For example, as the smallest multiplier for the { ( )} that yields a value of the chi-squared statistic as large as possible yet shy of the critical value for the test. For the data in Table 1, and for a test whose probability of Type I error is 0.05 (the probability of incorrectly rejecting the hypothesis of homogeneity), the critical value is 24.996, and the corresponding, smallest in ation factor that achieves homogeneity is = 2.813.
The statistical model underlying the multiplicative adjustment of the uncertainties regards the th measured value as the true value plus an error commensurate with ( ) and magni ed by . More precisely, as = + , where the { } are modeled as non-observable outcomes of independent Gaussian random variables all with mean 0 but with standard deviations { ( )}. The consequence is that the e ective measurement errors { } are then Gaussian random variables with standard deviations { ( )}.
The two choices of reviewed above appear reasonable but are ad hoc (yet another ad hoc choice is discussed in Section 3.2). A principled, and generally preferable alternative is maximum likelihood estimation, whereby the "optimal" consensus value and in ation factor maximize a product of Gaussian densities evaluated at the measured values { }, all with the same mean , and standard deviations { ( )}. The idea here is to select values for and that render the data "most likely. " The maximum likelihood estimates derived from the data in Table 1, arê = 6.674 29 × 10 −11 m 3 kg −1 s −2 and̂ = 3.5. The evaluations of the associated uncertainties, obtained using the parametric statistical bootstrap [20], are (̂ ) = 0.000 13 × 10 −11 m 3 kg −1 s −2 ( Table 4, row BRM), and (̂ ) = 0.6. A 95 % coverage interval for ranges from 2.2 to 4.6, thus suggesting that any estimate of the ination factor is bound to be clouded by very substantial uncertainty. Figure 1 depicts the results. Rothleitner and Schlamminger [64] sound a note of despair at the conclusion of their review of the history, status, and prospects for improvement of the measurements of : "Given the current situation in the measurement of , it is difcult to see how our knowledge of can be improved, for example, 2 will not decrease by adding new experiments, as it is a sum of squares and can increase only with new data. The Birge ratio can decrease by increasing √ − 1 in the denominator; however, this will be a slow process. If an additional 13 experiments are performed (which could take another 30 years if past experiments are an indication), B can be reduced by a factor 1.4 if the values are close to the current average value. It is equally di cult to see how the multiplicative factor that CODATA used to bring all normalized residuals below two can be decreased. Thus, decreasing the current uncertainty assigned to the recommended value of G does not seem to be possible -at least, not in the foreseeable future. " Although we agree that reducing the uncertainty associated with is an outstanding challenge in precision measurement, we believe that conventional metrics for mutual inconsistency, be they Cochran's or the Birge ratio, are not the most informative means to gauge progress or lack thereof, and that more productive avenues for data reduction are available as we shall illustrate forthwith. Furthermore, in Section 3.2 we show that the goal of bringing "all normalized residuals below two" is excessively restrictive, hence ought not to be used as a quality criterion whereon to judge the mutual consistency of any collection of measurement results.

Normalized residuals
Mohr and Taylor [45] introduce the notion of "normalized residual" in the context of the nonlinear least squares method that CODATA has been using to derive adjusted values of the fundamental constants 1 , … , from a collection of    For the 2014 adjustment of the value of , "the Task Group decided that it would be more appropriate to follow its usual approach of treating inconsistent data, namely, to choose an expansion factor that reduces each | | to less than 2" [46]. The idea is aligned with Birge's approach, that should be replaced by , where is the aforementioned expansion factor, thereby reducing the magnitude of the residuals. The adjusted value is the weighted average of the { }, with weights proportional to 1/( ) 2 .
This approach is justi ed by the belief that the { } should be approximately like a sample from a Gaussian distribution, which Figure 2 indeed supports. There are, however, two issues with this approach to achieve mutual consistency.
First, and this is the minor issue, the denominator of = ( − )/ should be ( − ), not ( ), because the latter does not recognize the uncertainty associated with or the correlation between and . Table 2 lists the values of the normalized residuals { } as de ned conventionally, and their counterparts { * = ( − )/ ( − )} involving the correct denominator (evaluated using the parametric statistical bootstrap [20]). The di erences between corresponding values indeed are minor and largely inconsequential in this case.
Second, and this is the major issue, if the (properly) normalized residuals indeed are like a sample of size from a Gaussian distribution with mean 0 and standard deviation 1, then, according to the Fisher-Tippett-Gnedenko theorem [24], the expected value of the largest residual is approximately equal to (1 − )Φ −1 (1 − 1/ ) + Φ −1 (1 − 1/( )), where Φ −1 denotes the quantile function (inverse of the probability distribution function) of the Gaussian distribution with mean 0 and standard deviation 1, ≈ 2.718 282 is Euler's number, and ≈ 0.577 215 7 is the Euler-Mascheroni constant. This expected value increases with , and it is already 1.8 for = 16.
Furthermore, when there are = 16 normalized residuals, and the data are mutually consistent and the underlying statistical model applies, the odds are better than even (53 % probability, in fact) that at least one will have absolute value greater than 2. Therefore, and in general, requiring that all normalized residuals, after application of the expansion factor, should have absolute values less than 2 leads to excessively large expansion factors.

Additive models
An alternative treatment, which indeed is the most prevalent approach to blend independent measurement results, from medicine to metrology, including when the results are mutually inconsistent, involves an additive model for the measured values, of the form = + + .
This model acknowledges the possibility that the di erent experiments may be measuring di erent quantities, by introducing experiment e ects { } such that, given , the expected value of is + . The standard deviation of the measurement error is the reported uncertainty, ( ). Since the experiment e ects may be indistinguishable from zero, this model can also accommodate mutually consistent data.
On rst impression, it may seem that the model is non-identi able: that by making large and small, or vice-versa, the same value of may be reproduced. However, the fact that the data are not only the { } but also the { ( )}, resolves the potential ambiguity: since the { } are comparable to their corresponding { ( )}, if the { } turn out to be appreciably more dispersed than the { ( )} intimate, then this suggests that the { } cannot all be zero. The most common modeling assumption is that the { } are a sample from a Gaussian distribution with mean 0 and standard deviation , which quanti es the dark uncertainty. Koepke et al. [36] discuss several variants of this random e ects model, and describe procedures to t them to measurement data. Some of these procedures are implemented in the NIST Consensus Builder, which is a Webbased application publicly and freely available at https://consensus.nist.gov [37].
The DerSimonian-Laird procedure to t random e ects models to measurement data is used most commonly in meta-analysis in medicine [18; 19]. This procedure yields the conventional weighted mean when the estimate of dark uncertainty is 0.
The version of the DerSimonian-Laird procedure implemented in the NIST Consensus Builder estimates as 6.673 99 × 10 −11 m 3 kg −1 s −2 , with associated standard uncertainty ( ) = 0.000 25 × 10 −11 m 3 kg −1 s −2 (including the Knapp-Hartung adjustment [35]), and dark uncertainty DL = 0.000 56 × 10 −11 m 3 kg −1 s −2 . These results are depicted in Figure 3, and appear in Table 4, row DL. Several other versions of the additive random e ects model are implemented in various packages for the R environment for statistical data analysis and graphics [60], including: metafor [76] (used to produce the estimates of labeled ML, MP, and REML in Table 4); metaplus [4] (for estimate STU in Table 4); and metamisc [16] (for estimate MM in Table 4), among many others.

Shades of Dark Uncertainty
A comparison of Figures 1 and 3, and of the underlying models and corresponding numerical results, reveals important and obvious di erences, as well as two noteworthy commonalities: (i) the consensus values, although numerically different, neither di er signi cantly from one another once their associated uncertainties are taken into account, nor do they di er signi cantly from the 2014 CODATA recommended value, even though both incorporate measurement re-  The pattern of the measurement results depicted in Figure 3 is fairly typical: on the one hand, there is a cluster of results (including UZur-06 and HUST-TOS-18) that, all by themselves, would be mutually consistent and indeed have measured values that lie quite close to the consensus value; on the other hand, there is another cluster (including BIPM-01, JILA-10 and BIPM-14) whose measured values lie much farther a eld, to either side of the consensus value.
To increase the exibility of additive random e ects models, in particular to enable them to cope with such mixed bag of results, and to alleviate the inequities arising from applying the same dark uncertainty penalty to all the results, regardless of how they are situated relative to the consensus value, we have developed a new model that yields di erent evaluations of dark uncertainty for di erent subsets of the measurement results. We call the corresponding, di erent s, shades of dark uncertainty.
This new model, which we introduce in the next Section, represents the probability distributions of the measured values as mixtures of distributions, similarly to how the linear opinion pool, implemented in the NIST Consensus Builder [36], represents them. (The results of applying the linear opinion pool to the data in Table 1 are labeled LOP in Table 4.) For a simple example of a mixture, consider two dice: one is cubic with faces numbered 1 through 6; the other is dodecahedral with faces numbered 1 through 12; the faces of each die are equally likely to land up when the die is rolled. Suppose that one die is chosen at random so that the cubic die is twice as likely to be chosen as the dodecahedral die, and then it is rolled. The probability distri-bution of the outcome is a mixture of two discrete, uniform distributions: the probability of a four is (2/3) × (1/6) + (1/3) × (1/12) = 5/36.
And if one is told that a four turned up, but not which die was rolled, then one can use Bayes rule [17; 56] to infer that it was the cubic die with ((1/6)×(2/3))/(5/36) = 80 % probability. Given the results of multiple realizations of this procedure (choosing a die at random and rolling this die), one may then compute the probabilities of the outcomes having originated in the cubic die. Those outcomes for which this probability is greater than 50 % may be said to form one cluster, and the others a di erent cluster.

Bayesian mixture model
The mixture model that we propose is parametric and Bayesian, and depends on the number, , of shades of dark uncertainty to be entertained. "Parametric" means that all probability distributions are determined by a nite number of scalar parameters. "Bayesian" means that the data ({( , ( )}) are modeled as observed values of random variables, that the unknowns (true value of , probabilities of membership in the latent clusters, and shades of dark uncertainty) are modeled as non-observable random variables, and that the information the data hold about the unknowns is extracted by application of Bayes's rule and distilled into the posterior distribution of the unknowns (which is the conditional distribution of the parameters given the data). Subsection 5.1 characterizes the model given the number, , of components in the mixture, and Subsection 5.2 describes how a value for is chosen automatically, from among the models corresponding to = 1, 2, … , , so that the procedure produces the "best" model, according to a Bayesian model selection criterion.

Model de nition
Mixture models do not actually partition the measured values into clusters, each with its own shade of dark uncertainty. Instead, each measured value belongs to all the latent clusters simultaneously, but typically with rather di erent probabilities of belonging to each one of them. This fuzzy reality notwithstanding, it is often a useful simpli cation to say that a measured value belongs to the latent cluster that it has the largest posterior probability of belonging to. Accordingly, and to present the results vividly, in Section 6 we "assign" each measurement to the latent cluster that the measurement has the largest posterior probability of belonging to -the so-called maximum a posteriori estimate (MAP) of cluster membership.
The distributions being mixed (which de ne the latent clusters) are Gaussian, and they have di erent standard deviations, which are the shades of dark uncertainty, 1 , … , . The results include an estimate of , an evaluation of the associated uncertainty, estimates of the { }, as well as the identi cation of the latent cluster that each measurement result most likely belongs to.
Since the model is Bayesian and will be t to the measurement results via Markov Chain Monte Carlo (MCMC) [23], not only estimates and standard uncertainties, but also coverage (credible) intervals, may easily be derived for all the parameters in the model: , the { }, and the cluster membership probabilities = ( ,1 , … , , ), where , = 1 − ( ,1 + ⋯ + , −1 ), for = 1, … , , and , denotes the probability that measurement belongs to cluster , for = 1, … , . Therefore, the model corresponding to a particular value of has 1 + + ( − 1) parameters.
The reported uncertainties { ( )}, even though they are data, are treated as known quantities on the assumption that they are based on in nitely many degrees of freedom. In cases where they are not, the model can easily be modi ed to accommodate the nite numbers of degrees of freedom that the { ( )} may be based on.
The model is hierarchical [22]: (i) given , the { }, and the { }, the measured values are modeled as observed outcomes of Gaussian random variables, with having a Gaussian distribution with mean and standard deviation such that We implemented this model in the JAGS language [55], and then used the implementation in R function jags de ned in package R2jags [70], to produce samples from the distribution of all the parameters via MCMC.

Model selection
Since the mixture representation of the dark uncertainty that appears in the second term on the right-hand side of Equation (1) involves latent clusters and not a partition of the measurements into actual clusters, in principle there is no constraint on the number, , of latent clusters. However, common sense dictates that there ought not to be more than the number, , of measurements being combined, hence 1 .
We consider the models corresponding to = 1, … , in turn, and use each one to predict the value of that a future, independent experiment may produce.
The we choose the model that makes the most accurate predictions. To be able to explain how this is done, even if we omit all of the technical details, we need to introduce some notation.
Let denote the data in hand ( measured values and their associated uncertainties), and denote the parameters in the model de ned in Subsection 5.1, with latent clusters. Therefore, includes the unknown value of the Newtonian constant of gravitation, , the shades of dark uncertainty { }, and the probabilities, { , }, of membership in the latent clusters. The probability density of the data given the parameters is ( | ), and ( ) is the prior probability density of the parameters. The density of the posterior distribution of the parameters given the data is given by Bayes's rule [17]: ( | ) = ( | ) ( )/ ( ), where ( ) = ( | ) ( )d , and the integral is over the set of possible values of the parameters.
Our goal is to select the value of for which ℎ ( * | ) is largest, where * denotes a future measurement, and ℎ is the predictive posterior density de ned as ℎ ( * | ) = ( * | ) ( | )d [23]. Since this future observation * is speculative (hence, unknown), the best we can do is estimate ℎ ( * | ) pretending that * is one of the results that we have, and that comprises all the results that we have except that one.
For model selection, we rely on the Bayesian Leave-One-Out cross validation score (LOO), which gauges the posterior predictive acumen of the model under consideration. To compute it, the model is tted to − (all the measure-ments except the th), and the corresponding predictive density is evaluated at (the measurement left out, here playing the role of future, independent measurement), this process being repeated for = 1, … , . Thus, for each number of latent clusters , the model is tted times, producing posterior densities −1, , … , − , , each based on − 1 measurements, and log ℎ ( * | ) is estimated by the cross-validated predictive accuracy score which we then transform into the LOO Information Criterion, LOOIC = −2 × LOO, which is numerically comparable to Akaike's Information Criterion (AIC), a widely used model selection criterion [8].
Since determining each − , involves an MCMC run, the procedure outlined in the previous paragraph requires MCMC runs. However, R package loo [75] o ers a shortcut to this onerous procedure and produces an approximation to the foregoing average of values of log posterior densities using the results of a single MCMC run.
Since the LOOIC involves the data and MCMC sampling, it is surrounded by uncertainty, which we have evaluated using R function loo de ned in the package of the same name. In general, the smaller the LOOIC, the better the model. However, di erences between values of LOOIC have to be interpreted taking their associated uncertainties into account, as we will explain in Section 6.

Similar models
There is a growing collection of models whose purpose and devices are similar to the model we described above. Here we mention only a few of these alternatives.
Burr and Doss [10] describe a Bayesian semi-parametric model for random-e ects meta-analysis in the form of a Dirichlet mixture, which is implemented in R package bspmma [9]. Jara et al. [31] present Bayesian non-parametric and semi-parametric models for a wide range of applications, including for linear, mixed-e ects models used in meta-analysis, using a Dirichlet process prior distribution, or a mixture of Dirichlet process prior distributions [72], for the distribution of the random e ects. Both R packages DPpackage [30; 31] and dirichletprocess [63] facilitate the use of these priors. Jagan and Forbes [29] propose adjusting (typically in ating) each reported uncertainty just enough to achieve mutual consistency, with the adjustments obtained by minimization of a relative entropy criterion. The results may be interpreted as involving estimates of dark uncertainty that are tailored for each measurement result individually. Our proposal and Rukhin [65]'s are similar in that they both model the additional uncertainty directly, and not through the distribution of the random e ects as is done in most other models. The main di erences between our approach and Rukhin [65]'s are the following: • Our mixture model comprises latent clusters, and each measurement may belong to all the clusters simultaneously, possibly with di erent probabilities, hence its e ective dark uncertainty is a mixture of the shades of dark uncertainty of the latent clusters; Rukhin [65] partitions the measurements into clusters and assigns a particular, same value of dark uncertainty to all the measurements in the same cluster.
• Rukhin [65] assumes that the measurements in one of the clusters are mutually consistent, hence that it has no dark uncertainty (the "null" cluster). In most cases there will be multiple clusters whose measurements are mutually consistent, and the results may depend on which one is chosen to play the role of "null" cluster. Table 3 lists the values of the model selection criterion LOOIC, and associated uncertainties, for the models corresponding to = 0, 1, … , 16 shades of dark uncertainty. The case with = 0 is the common mean model, = + (cf. Equation (1)), which does not recognize dark uncertainty, and is vastly inferior to the models that do recognize it. As increases from 1 to , the LOICC undergoes its largest drop in value from = 1 to = 2, where it reaches its minimum, thus suggesting that the best model should have = 2 latent clusters. However, the large uncertainties associated with the LOOIC caution that this choice is only nominally better than any other. One of the reasons why the LOOIC does not achieve a sharp, deep minimum, and instead keeps hovering near its minimum as increases above 2, is that for some  of the larger values of , the number of e ective latent clusters is much smaller than . For example, when = 10, there are only 5 di erent MAP estimates of cluster "membership", that is, 5 di erent e ective latent clusters. Next we explain what we mean by "e ective latent clusters. "

Results
In Subsection 5.1 we pointed out that we "assign" each measurement to the latent cluster that the measurement has the largest posterior probability of belonging to -the so-called maximum a posteriori estimate (MAP) of cluster membership: these MAP assignments are re ected in the di erent colors of the labels in Fig Recognizing that the model with = 2, although nominally the best, is not head and shoulders above the other models with 1, we further invoke the general principle that, everything else being just about comparable, one is well-advised to take the simpler model: therefore, we will proceed on the assumption that the best model has = 2 latent clusters. This choice is also supported by the fact that the model with = 2 assigns clearly smaller amounts of dark uncertainty to UWash-00 and to UZur-06 than to results that are similarly precise, or even more precise, but lie farther away from the consensus value. The model with = 1 would be incapable of drawing such distinctions.
The MCMC procedure yielded a sample of size 512 000 drawn from the joint posterior distribution of the parameters, resulting from collating every 25th outcome from each of four chains of length 4 × 10 6 , with burn-in of 8 × 10 5 iterations per chain. Each point in this sample comprises one value for , values for 1 and 2 , and cluster memberships 1 , … , , and cluster membership probabilities 1,1 , 1,2 = 1 − 1,1 , … , ,1 , ,2 = 1 − ,1 for all the measurements in Table 1.
The upper panel of Figure   This fact helps explain why, as shown in Figure 6, the dark uncertainty assigned to HUST-AAF-18 is closer to the dark uncertainty assigned to NIST-82 than to the dark uncertainty assigned to HUST-TOS-18, even though, on the one hand, the standard uncertainty reported for HUST-AAF-18 is quite similar to the standard uncertainty reported for HUST-TOS-18, and on the other hand HUST-AAF-18 lies much closer to the consensus value than NIST-82. The reason is that cluster membership is determined by the distance to the consensus value gauged in terms of the reported standard uncertainty: from this viewpoint HUST-AAF-18 is just about as far from the consensus value as NIST-82, and so much farther from it than HUST-TOS-18.  that they have in Figure 3. A word of explanation is in order for how the lengths of the thin lines were determined. The thin line centered at represents ± , where was de ned in Equation (2). However, this is not how the { } were computed.
The approach we took for computing tracks the actual way in which MCMC unfolds, as closely as possible: each time the MCMC process generates an acceptable sample, it provides a cluster membership ∈ {1, 2} for , and produces also values for 1 and 2 for the latent clusters: the corresponding value of 2 is 2 ( ) + 2 . The value used for in this Figure is the square root of the median of the 512 000 samples of 2 computed as just described, for each = 1, … , .

Conclusions
The mixture model introduced in Section 5, driven by the model selection criteria outlined in Subsection 5.2, by and large achieved one of the central goals of this contribution: to impart exibility to the conventional laboratory e ects model, in particular addressing successfully the long-standing grievance resulting from penalizing (with dark uncertainty) measured values lying close to the consensus value as severely as those that lie farther a eld.
This model truly excels in producing shades of dark uncertainty that are nely attuned to the structure of the data, in particular to how the measured values are arranged relative to one another and relative to the consensus value, while taking their associated uncertainties into account, as Figure 6 shows. Furthermore, the model does all this without widening the uncertainty associated with the consensus value. The thin line segments that extend the thick segments indicate the contribution from the mixture of the two shades of dark uncertainty in the model, possibly of di erent size for the di erent measurements, but generally larger for those whose labels are highlighted in orange. The horizontal green line represents the consensus value, and the light green band represents the associated standard uncertainty. The horizontal brown line and light brown band are as in Figure 3. The colors of the labels at the bottom indicate the more likely clusters that the di erent measurement results belong to: these colors are the same that are used in Figures 4 and 5. Since cluster membership is determined by the distance from the measured value to the consensus value gauged in terms of the reported standard uncertainty, the dark uncertainty assigned to HUST-AAF-18 is more comparable to the dark uncertainty assigned to NIST-82 than to the dark uncertainty assigned to HUST-TOS-18.  Table 4: Consensus values, standard uncertainties, and 95 % coverage intervals (Lwr95, Upr95) for , and estimates of shades of dark uncertainty ( , or 1 and 2 for BMM) produced by di erent statistical models and methods of data reduction. BRE = Birge's approach with in ation factor that makes Cochran's equal to its expected value. BRM = Birge's approach with in ation factor equal to its maximum likelihood estimate. BRQ = Birge's approach with smallest in ation factor that makes data consistent according to Cochran's test. MTE = Weighted average after expansion of standard uncertainties to achieve normalized residuals with absolute value less than 2. DL = DerSimonian-Laird with Knapp-Hartung adjustment. ML = Gaussian maximum likelihood. MP = Mandel-Paule. REML = Restricted Gaussian maximum likelihood. STU = Random e ects are a sample from a Student's distribution. BG = Bayesian hierarchical model from the NIST Consensus Builder [37], with estimate of set to the median of its posterior distribution. LAP = Laboratory e ects and measurement errors modeled as samples from Laplace distributions [66]. MM = Bayesian model using a non-informative Gaussian prior distribution for and a uniform prior distribution for the dark uncertainty, implemented in R function uvmeta de ned in package metamisc [16]. LOP = Linear opinion pool from the NIST Consensus Builder [37].
overall dispersion of the measured values above and beyond what their associated, reported uncertainties suggest.
In this contribution we have argued in favor of model-based approaches to consensus building (as opposed to ad hoc approaches like those that are driven by the Birge ratio or by the sizes of the absolute values of normalized residuals), particularly when faced with measurement results that are mutually inconsistent.
Additive laboratory random e ects models using mixtures, like the model we introduced in Section 5, seem especially promising as they are able to identify subsets of results that appear to express di erent shades of dark uncertainty, and then weigh them di erently in the process of consensus building, yet without disregarding the information provided by any of the results being combined.
We have assembled in Table 4 the results produced by an assortment of methods not only to provide perspective on the new method we are proposing (BMM), but also because arguably any of these methods could reasonably be selected by di erent professional statisticians working in collaboration with physicists engaged in the measurement of . Even though the underlying models and speci c validating assumptions di er, choosing one or another re ects mostly inessential di erences in training, preference, and experience in statistical data modeling and analysis.
However, the variability of these estimates of , which is attributable solely to di erences in approach, model selection, and data reduction technique, amounts to about 78 % of the median of the values of ( ) listed in the same table, and to about 20 % of the median of the values of .
In other words, the statistical "noise" (that is, the vagaries and incidentals that would lead one researcher to opt for a particular model and method of data reduction, and another for a di erent model and method) is clearly not negligible. Therefore, the development, dissemination, and widespread adoption of best practices in statistical modeling and data analysis for consensus building will be contributing factors in reducing the uncertainty associated with any consensus value that may be derived from an ever growing set of reliable measurement results for obtained by increasingly varied measurement methods.
Finally, we express our belief that one feature already apparent in the collection of measurement results assembled in Table 1 provides the greatest hope yet for appreciable progress in the years to come: the fact that fundamentally di erent, truly independent measurement methods have been employed, relying on di erent physical laws, and yet there has been convergence toward a believable consensus, even if it still falls short of achieving a reduction in relative uncertainty to levels comparable to what prevails for other fundamental constants.