Assessing Differences Between Results Determined According to the Guide to the Expression of Uncertainty in Measurement

In some metrology applications multiple results of measurement for a common measurand are obtained and it is necessary to determine whether the results agree with each other. A result of measurement based on the Guide to the Expression of Uncertainty in Measurement (GUM) consists of a measured value together with its associated standard uncertainty. In the GUM, the measured value is regarded as the expected value and the standard uncertainty is regarded as the standard deviation, both known values, of a state-of-knowledge probability distribution. A state-of-knowledge distribution represented by a result need not be completely known. Then how can one assess the differences between the results based on the GUM? Metrologists have for many years used the Birge chisquare test as ‘a rule of thumb’ to assess the differences between two or more measured values for the same measurand by pretending that the standard uncertainties were the standard deviations of the presumed sampling probability distributions from random variation of the measured values. We point out that this is misuse of the standard uncertainties; the Birge test and the concept of statistical consistency motivated by it do not apply to the results of measurement based on the GUM. In 2008, the International Vocabulary of Metrology, third edition (VIM3) introduced the concept of metrological compatibility. We propose that the concept of metrological compatibility be used to assess the differences between results based on the GUM for the same measurand. A test of the metrological compatibility of two results of measurement does not conflict with a pairwise Birge test of the statistical consistency of the corresponding measured values.

In some metrology applications multiple results of measurement for a common measurand are obtained and it is necessary to determine whether the results agree with each other. A result of measurement based on the Guide to the Expression of Uncertainty in Measurement (GUM) consists of a measured value together with its associated standard uncertainty. In the GUM, the measured value is regarded as the expected value and the standard uncertainty is regarded as the standard deviation, both known values, of a state-of-knowledge probability distribution. A state-of-knowledge distribution represented by a result need not be completely known. Then how can one assess the differences between the results based on the GUM? Metrologists have for many years used the Birge chisquare test as 'a rule of thumb' to assess the differences between two or more measured values for the same measurand by pretending that the standard uncertainties were the standard deviations of the presumed sampling probability distributions from random variation of the measured values. We point out that this is misuse of the standard uncertainties; the Birge test and the concept of statistical consistency motivated by it do not apply to the results of measurement based on the GUM. In 2008, the International Vocabulary of Metrology, third edition (VIM3) introduced the concept of metrological compatibility. We propose that the concept of metrological compatibility be used to assess the differences between results based on the GUM for the same measurand. A test of the metrological compatibility of two results of measurement does not conflict with a pairwise Birge test of the statistical consistency of the corresponding measured values. measurement procedure in the same conditions. Therefore, the consistency of measured values assessed by the Birge test is statistical consistency. The Birge test applies to uncorrelated measured values only. In Sec. 2, we review a concept of statistical consistency motivated by the Birge test. The idea of statistical consistency belongs to the period when the error analysis view of measurements was prevalent. The error analysis view of measurements was a hindrance to communicating the results of measurement and in advancing the science and technology of measurement. Therefore leading authorities in the field of metrology developed the Guide to the Expression of Uncertainty in Measurement (GUM) [2]. According to the GUM, a result of measurement consists of a measured value together with its associated standard uncertainty. In the GUM, the measured value is regarded as the expected value and the standard uncertainty is regarded as the standard deviation, both known values, of a stateof-knowledge probability distribution. A state-ofknowledge distribution represented by a result of measurement need not be completely known. We note in Sec. 3 that the Birge test and the concept of statistical consistency motivated by it are not applicable to the results of measurement based on the GUM. Then how can one assess the differences between results based on the GUM for the same measurand? In 2008, the International Vocabulary of Metrology, third edition (VIM3) [3] introduced the concept of metrological compatibility of two or more results of measurement determined according to the (GUM). In Sec. 4, we review the VIM3 concept of metrological compatibility and propose that this concept be used to assess the differences between multiple results based on the GUM for the same measurand. In Sec. 5, we show that a test of the metrological compatibility of two results of measurement does not conflict with a pairwise Birge test of the statistical consistency of the corresponding measured values.

The Birge Test and Concept of Statistical Consistency
Suppose x 1 , …, x n are n measured values for a common measurand which is believed to be sufficiently stable. The Birge test is based on regarding the measured values x 1 , …, x n as realizations of random draws from their presumed sampling pdfs. A sampling pdf models possible outcomes in contemplated replications of a measurement procedure subject to random effects in the same conditions. Therefore, the consistency (lack of significant differences between measured values) assessed by the Birge test is statistical consistency. The Birge test is applicable when the sampling pdfs of the measured values x 1 , …, x n are uncorrelated. The Birge test requires knowledge of the variances σ 1 2 , …, σ n 2 of the sampling pdfs of respectively. Statistical consistency of the measured values x 1 , …, x n means that their expected values are indistinguishable 1 in view of the corresponding variances. Specifically, the Birge test checks whether the measured values x 1 , …, x n may be modeled as realizations from normal (Gaussian) sampling pdfs with unknown but equal expected values and known variances σ 1 2 , …, σ n 2 . Birge proposed that to check the consistency of the measured values x 1 , …, x n , one can calculate the test statistic (1) where w i = 1/σ i 2 , for i = 1, 2, …, n, and x W = Σ i w i x i /Σ i w i is the weighted mean of x 1 , …, x n . If the calculated value of R 2 is substantially larger than one, then the dispersion of x 1 , …, x n is greater than what can be expected from the normal pdfs with equal expected values and known variances σ 1 2 , …, σ n 2 . In that case the measured values x 1 , …, x n can be declared to be statistically inconsistent.
Statistical interpretation of the Birge test: Birge was a physicist and he proposed his test independently of and before much of the statistical theory as it is known today was established. However, the Birge test of consistency can now be interpreted as a classical (sampling theory) statistical test of hypothesis. The measured values x 1 , …, x n are presumed to have normal sampling pdfs with unknown but equal expected values and variance-covariance matrix τ 2 × Diag [σ 1 2 , …, σ n 2 ], where τ 2 is an unknown parameter and σ 1 2 , …, σ n 2 are known. The null hypothesis H 0 is that τ 2 ≤ 1 and the alternative hypothesis H 1 is that τ 2 > 1. The null hypothesis H 0 means that the variances of x 1 , …, x n are not greater than σ 1 2 , …, σ n 2 , respectively. The alternative hypothesis H 1 means that the variances of x 1 , …, x n are greater than σ 1 2 , …, σ n 2 [4]. The classical p-value p C is the maximum probability under the null hypothesis of realizing in contemplated replications of the n measurements a value of the test statistic more extreme Volume 115, Number 6, November-December 2010 Journal of Research of the National Institute of Standards and Technology 454 1 In statistical literature the term consistency is applied to a statistical estimator. A point statistical estimator is said to be consistent if it approaches the parameter being estimated as the sample size increases.
than its realized (calculated) value. The classical p-value of a realization of (n -1) R 2 is (2) where χ 2 (n -1) denotes a variable with the chi-square probability distribution with degrees of freedom (n -1) [4]. If the classical p-value p C is too small, say, less than 0.05, then the null hypothesis is rejected with level of significance 0.05 or less. A rejection of the null hypothesis means that the dispersion of the measured values x 1 , …, x n is greater than what can be expected from normal distributions for x 1 , …, x n with equal expected values and stated variances σ 1 2 , …, σ n 2 , respectively. The dispersion of x 1 , …, x n can be greater than expected under the null hypothesis because either the variances of x 1 , …, x n are greater than σ 1 2 , …, σ n 2 or their expected values are not equal. If the stated variances σ 1 2 , …, σ n 2 are not questionable then the assumption that the expected values of x 1 , …, x n are equal appears to be unreasonable. In that case, the measured values x 1 , …, x n can be declared to be statistically inconsistent.
Limitations of the Birge test: A limitation of the Birge test is that it is applicable for uncorrelated measured values x 1 , …, x n only. However, it can be easily generalized to correlated measured values x 1 , …, x n whose covariances denoted by σ 1 2 , …, σ (n -1) n are known [4]. The Birge test suggests the following notion of the statistical consistency of the measured values x 1 , …, x n : The measured values x = (x 1 , …, x n ) t are said to be statistically consistent if their dispersion is not greater than what can be expected from the normal consistency model which postulates that the joint n-variate sampling pdf of x is normal N(1μ , D) with unknown expected value 1μ and variance-covariance matrix D = [σ ij ], where 1 = (1, …, 1) t , σ ij is the covariance between x i and x j , and σ ii = σ i 2 for i, j = 1, 2, …, n [4].
Another limitation of the Birge test (and of its generalized version for correlated measured values) is that it is a one sided test of hypothesis which checks whether the dispersion of x 1 , …, x n is more than what can be expected from a normal consistency model. A review of the Birge test in [5] notes that if the realized value of the Birge test statistic R 2 is substantially less than one, then the stated variances σ 1 2 , …, σ n 2 may well be too large. To avoid declarations of statistical consistency from overstated variances, the following definition of statistical consistency was proposed in [6].

Definition of statistical consistency:
The measured values x = (x 1 , …, x n ) t are said to be statistically consistent if they reasonably fit the normal consistency model which postulates that the joint n-variate sampling pdf of x is normal N(1μ , D) with unknown expected value 1μ and variance-covariance matrix This definition requires a different approach for testing statistical consistency than the Birge test and its generalized version for correlated values. A modern method to assess the fit of a statistical model to the data is Bayesian posterior predictive checking [6]. Posterior predictive checking is a Bayesian adaptation of the classical (sampling theory) statistical hypothesis testing. A function of the data (and possibly unknown parameters) called 'discrepancy measure' is defined to characterize a potential discrepancy between the statistical model and the data. The posterior predictive p-value p P of adiscrepancy measure T(x) is the probability of realizing in contemplated replications a value of the discrepancy measure more extreme than its realized value. If the posterior predictive p-value is close to zero (or to one) then the fit of the statistical model to data is suspect.
If the measured values x 1 , …, x n were uncorrelated, then the statistic T c (x) = (n -1) 2 is a useful discrepancy measure to check the overall fit of the normal consistency model N(1μ , D) to the measured values x 1 , …, x n . As discussed in [6,Sec. 2.4], the posterior predictive p-value of the realized discrepancy measure T c (x) = (n -1) R 2 is (3) We note that (3) is identical to the classical p-value p C given in (2). Thus Bayesian posterior predictive checking of the discrepancy measure T c (x) = (n -1) R 2 is equivalent to the Birge test of statistical consistency.
Bayesian posterior predictive checking can be used to investigate any number of potential discrepancies between the statistical model and the data. To assess the difference between two particular measured values x i and x j , the statistic T i -j (x) = | x i -x j | is a useful discrepancy measure, for i, j = 1, 2, …, n and i ≠ j . The Bayesian posterior predictive p-value of the realized discrepancy measure | x i -x j | is

Pr{
( 1) } , where ρ ij is the correlation coefficient between the presumed normal sampling pdfs of x i and x j ; the covariance between x i and x j is σ ij = ρ ij σ i σ j , and Z denotes a variable with standard normal distribution N(0, 1) [6, Sec. 3.2]. A posterior predictive p-value p P close to zero suggests that the difference between x i and x j is larger than what can be expected from the normal statistical consistency model N(1μ , D). That is, the measured values x i and x j do not seem to have the same expected value and hence they are not mutually statistically consistent.

Concept of Statistical Consistency Does Not Apply to Results Based on the GUM
A result of measurement determined according to the GUM consists of a measured value together with its associated standard uncertainty. Suppose [x 1 , u(x 1 )], …, [x n , u(x n )] are n results of measurement for a common measurand, where x 1 , …, x n are the measured values and u(x 1 ), …, u(x n ) are the corresponding standard uncertainties. According to the GUM, a measured value x i and its associated standard uncertainty u(x i ) represent a state-of-knowledge pdf attributed to the measurand, for i = 1, 2, …, n. Following the GUM, we use the symbol X i for a quantity as well as for a variable with a state-of-knowledge pdf about the quantity X i represented by the result [x i , u(x i )], for i = 1, 2, …, n. The measured value x i is regarded as the expected value E(X i ) and the standard uncertainty u(x i ) is regarded as the standard deviation S(X i ) of the pdf of X i , for i = 1, 2, …, n. The mainstream GUM requires knowledge of only the expected value E(X i ) and the standard deviation S(X i ) of a state-of-knowledge pdf of X i . The GUM does not require that the state-ofknowledge pdf of X i be completely known. When the state-of-knowledge pdfs of X 1 , …, X n are correlated, the correlation coefficients are assumed to be known. Following the GUM we denote the correlation coefficient R(X i , X j ) between the state-of-knowledge pdfs of X i and X j by the symbol r(x i , x j ). Note that {x 1 , …, x n }, {u(x 1 ), …, u(x n )}, and {r(x 1 , x 2 ), …, r(x (n -1) , x n )} are symbols for known values.
For many years, metrologists have used the Birge test as 'a rule of thumb' to assess the consistency of the measured values by treating the squared standard uncertainties u 2 (x 1 ), …, u 2 (x n ) as the known variances σ 1 2 , …, σ n 2 of the presumed normal (Gaussian) sampling pdfs of the measured values x 1 , …, x n ; see, for example [8]. The guideline for the analysis of key comparisons developed by the BIPM Director's Advisory Group on Uncertainties recommends the use of Birge chi-square test to assess the consistency of measured values by treating the squared standard uncertainties as the known variances of the presumed sampling pdfs of the measured values [9]. The consistency of the measured values from CIPM key comparisons and supplementary comparisons is almost always assessed using the Birge test [10].
The squared standard uncertainties u 2 (x 1 ), …, u 2 (x n ) cannot in any logical sense be identified with the known variances σ 1 2 , …, σ n 2 of the presumed normal (Gaussian) sampling pdfs of the measured values x 1 , …, x n . The standard deviation of a sampling pdf represents possible dispersion from random variation in contemplated replications of the measurement procedures. A standard uncertainty expresses the dispersion of a state-of-knowledge pdf which could be attributed to the measurand based on all available statistical and non-statistical information. A standard uncertainty includes all significant components whether arising from random effects or from corrections applied for systematic effects. All available statistical and nonstatistical information is used to evaluate a standard uncertainty. In measurements done in high echelon laboratories, the component of uncertainty arising from random effects is generally a very small part of the combined standard uncertainty. Treating the squared standard uncertainties u 2 (x 1 ), …, u 2 (x n ) determined according to the GUM as the known variances σ 1 2 , …, σ n 2 from random variation (in contemplated replications of the measurements) is a misuse of the standard uncertainties. Also, as noted earlier, the state-of-knowledge pdfs represented by the results [x 1 , u(x 1 )], …, [x n , u(x n )] may not be completely known. Therefore the Birge test and the concept of statistical consistency motivated by the Birge test do not apply to the results of measurement determined according to the GUM.

Based on the GUM
A measured quantity value [3, definitions 1.19 and 2.10] is a product of a numerical value and a measurement unit. The measurement unit implies that the measured value is traceable to a reference for that measurement unit. A result of measurement (measured value together with its associated standard uncertainty) is traceable to a reference only if the result can be related to a practical realization of that reference through a  [3, definition 2.46]. Metrological comparability does not imply that the measured values have similar magnitudes. Thus, for example, distance between my apartment and my office expressed in meters is metrologically comparable to the distance between my apartment and the moon also expressed in meters. The concept of metrological compatibility discussed in the next section applies only to those results of measurement for a common measurand which are metrologically comparable. That is, the results must be traceable to the same reference.
The concept of statistical consistency can be applied to any set of numerical values which have similar magnitudes. They do not have to be measured values. Thus, for example, one can test statistical consistency of deviations (or relative deviations expressed as percentage) from a benchmark value. Although a metrologist is expected to assess consistency of only those measured values which have the same measurement unit, it is not a requirement of statistical consistency.
All n results [x 1 , u(x 1 )], …, [x n , u(x n )] for a common measurand must be traceable to the same reference for them to be metrologically comparable [3, definition 2.46]. The VIM3 concept of metrological compatibility is defined for two results of measurement at a time. The following definition is an elaboration of the succinct definition given in VIM3 [3, definition 2.47].
Definition of metrological compatibility: Two metrologically comparable results [x 1 , u(x 1 )] and [x 2 , u(x 2 )] for the same measurand are said be metrologically compatible if (5) for a specified threshold κ, where r (x 1 , x 2 ) is a symbol for the correlation coefficient R(X 1 , X 2 ) between the variables X 1 and X 2 . The quantity in the denominator of (5) is the standard deviation of the state-of-knowledge pdf for X 1 -X 2 , which may be incompletely deter-mined. When the pdfs represented by [x 1 , u(x 1 )] and [x 2 , u(x 2 )] are uncorrelated, then R(X 1 , X 2 ) = 0 and (5) reduces to (6) A set of metrologically comparable results [x 1 , u(x 1 )], [x 2 , u(x 2 )], …, [x n , u(x n )] for the same measurand is said be metrologically compatible if for every one of the n(n -1) / 2 pairs of results [x i , u(x i )] and [x j , u(x j )] we have (7) for a specified threshold κ [3, definition 2.47]. The VIM3 does not discuss how the threshold κ should be determined. A conventional value of κ is two.
The concept of metrological compatibility can be used to assess the differences between the results of measurement based on the GUM for the same measurand. The concepts of metrological comparability and compatibility do not require that the state-ofknowledge pdfs represented by the results [x 1 , u(x 1 )], [x 2 , u(x 2 )], …, [x n , u(x n )] be completely known. Thus they fit the GUM. When the set of results [x 1 , u(x 1 )], …, [x n , u(x n )] is metrologically compatible, we can say that the differences between the measured values x 1 , …, x n are insignificant in view of the uncertainties u(x 1 ), …, u(x n ).
To assess metrological compatibility of results based on the GUM using the criteria (5), (6), or (7), the threshold κ needs to be specified. A proper choice of κ is to a large extent a matter of agreement because it requires accepting the economic consequences of that choice. Although a conventional value of κ is two, depending on the application, the interested parties could agree on a different value for κ. Once the value of the threshold κ is set the conclusion of a test of metrological compatibility based on the VIM3 definition is dichotomous, either a set of results is metrologically compatible or incompatible. The concept of metrological compatibility is being used by metrologists who are familiar with it; see for example [11,12].
The VIM3 definition of metrological compatibility can be easily extended to metrological compatibility of a set of results and a reference result [x R , u(x R )], where x R is the reference value with standard uncertainty u(x R ). Suppose the pdfs represented by the measurement results are uncorrelated with the pdf represented by the reference result. A set of results [x 1 , u(x 1 )], …, [x n , u(x n )] metrologically comparable with a reference result [x R , u(x R )] is compatible if (8) for i = 1, 2, …, n [13]. Similarly a set of results [x 1 , u(x 1 )], …, [x n , u(x n )] metrologically comparable with a combined result [x C , u(x C )], where x C is the combined value (such as arithmetic mean or a weighted mean) with standard uncertainty u(x C ) is compatible if (9) where r (x i , x C ) denotes the correlation coefficient between the pdfs represented by [x i , u(x i )] and [x C , u(x C )], for i = 1, 2, …, n [13].

Concluding Remarks
For many years, metrologists have used the Birge chi-square test as 'a rule of thumb' to assess the differences between two or more measured values for the same measurand by pretending that the squared standard uncertainties were the known variances of the presumed normal sampling pdfs of the measured values. This is misuse of the standard uncertainties based on the GUM. The Birge test and the concept of statistical consistency do not apply to the results of measurement based on the GUM. As discussed in this paper, the VIM3 concept of metrological compatibility can be used to assess the differences between the results of measurement determined according to the GUM. Thus metrologists can start using the VIM3 concept of metrological compatibility in place of the Birge test to assess the differences between multiple results of measurement of the same measurand.
The following is a pertinent question. Could the conclusions (about mutual agreement of results) based on the VIM3 concept of metrological compatibility and the Birge test (based on treating squared standard uncertainties as the known variances of sampling pdfs of measured values) differ? It is difficult to directly compare the Birge test and a test of metrological compatibility because the former is defined for an arbitrary positive integer n > 1 and the latter is defined for only two results at a time. For pairwise comparisons (n = 2), the Birge test statistic R 2 = Σ i w i (x i -x W ) 2 / (n -1) reduces to (10) which is square of (x 1 -x 2 ) / √ (σ 1 2 + σ 2 2 ). Under the null hypothesis that the presumed normal sampling pdfs of x 1 and x 2 have the same expected value, the distribution of (x 1 -x 2 ) / √ (σ 1 2 + σ 2 2 ) is normal N(0, 1). Therefore when n = 2, the normal distribution can be used to assess the absolute difference | x 1 -x 2 |. The square of a normal N(0, 1) variable has the chi-square distribution χ 2 (1) with degrees of freedom 1. Therefore the square of the (1 -α / 2) × 100-th percentile z [1 -α / 2] of normal N(0, 1) distribution is equal to the (1 -α) × 100-th percentile χ 2 (1) [1 -α] of χ 2 (1) distribution. Thus the realized value of (10) being less than χ 2 (1) [1 -α] is equivalent to the ratio (x 1 -x 2 ) / √ (σ 1 2 + σ 2 2 ) being less than z [1 -α / 2] . It follows that declaration of Birge statistical consistency when the classical p-value p C of the Birge test (2) is less than 0.05 (for example) is equivalent to the realization that (11) We note from (6) and (11) that if the threshold κ for metrological compatibility is set as κ = 2 then the conclusion of a check of metrological compatibility between a pair of results [x 1 , u(x 1 )] and [x 2 , u(x 2 )] would be identical to the assessment of statistical consistency between x 1 and x 2 based on the Birge test by (wrongly) treating u 2 (x 1 ) and u 2 (x 2 ) as σ 1 2 and σ 2 2 , respectively (and treating the correlation coefficient R(X 1 , X 2 ) as ρ 12 which is zero in the Birge test). Therefore a pairwise Birge test of statistical consistency and a test of metrological compatibility do not conflict.