Combining Results From Multiple Evaluations of the Same Measurand

According to the Guide to the Expression of Uncertainty in Measurement (GUM), a result of measurement consists of a measured value together with its associated standard uncertainty. The measured value and the standard uncertainty are interpreted as the expected value and the standard deviation of a state-of-knowledge probability distribution attributed to the measurand. We discuss the term metrological compatibility introduced by the International Vocabulary of Metrology, third edition (VIM3) for lack of significant differences between two or more results of measurement for the same measurand. Sometimes a combined result of measurement from multiple evaluations of the same measurand is needed. We propose an approach for determining a combined result which is metrologically compatible with the contributing results.


Introduction
A function of various calibration laboratories, measurement standards organizations, national metrology institutes (NMIs), and international organizations such as the International Bureau of Weights and Measures (BIPM), the International Organization for Standardization (ISO), the International Organization of Legal Metrology (OIML), and the International Electro-technical Commission (IEC) is to ensure that the differences are insignificant between different measured values for the same measurand determined in various places, at various times, and by various measurement procedures. Without this assurance, the world's commerce, trade, manufacturing, engineering, and scientific research would be chaotic.
The old-time thinking concerning the uncertainty in measurement based on statistical error analysis are inappropriate for the rapidly advancing science and technology of measurement. Therefore the world's leading authorities in metrology developed a new concept of uncertainty in measurement. This concept is described in the Guide to the Expression of Uncertainty in Measurement (GUM) [1] and extended in the International Vocabulary of Metrology, third edition (VIM3) [2]. In accordance with the GUM and the VIM3, a result of measurement is generally expressed as a pair of values: a measured quantity value and its associated standard uncertainty. The measured value and the standard uncertainty together represent a range of values being attributed to the measurand [2,Sec. 2.9]. Suppose [x 1 , u(x 1 )], …, [x n , u(x n )] are n different results of measurement for a common measurand believed to be sufficiently stable, where x 1 , …, x n are the measured values and u(x 1 ), …, u(x n ) are the corresponding standard uncertainties. In the GUM concept of uncertainty, a measured value x i and its associated standard uncertainty u(x i ) are regarded, respectively, as the expected value and the standard deviation of an incompletely determined stateof-knowledge probability density function (pdf) attributed to the common measurand, for i = 1, 2, …, n [1].
Since the era of error analysis, metrologists have used the Birge chi-square test of statistical consistency to decide whether the differences between two or more measured values x 1 , …, x n are insignificant (Fig. 1). The Birge test is based on regarding the measured values x 1 , …, x n as realizations of random variables drawn from normal (Gaussian) sampling pdfs with unknown but equal expected values and known standard deviations [3]. When the measured values are correlated they are regarded as realizations of a random vector drawn from a joint n-variate normal distribution with a known variance-covariance matrix, referred to a normal consistency model. To assess statistical consistency of a set of measured values x 1 , …, x n , a common practice is to pretend that the standard uncertainties u(x 1 ), …, u(x n ) are the known standard deviations of the presumed normal sampling pdfs of x 1 , …, x n . It has previously been pointed out [4] that the Birge test and the concept of statistical consistency motivated by it do not apply to the results of measurement based on the GUM.
Recently, the VIM3 [2] introduced the idea of metrological compatibility, which can be used to assess the significance of the differences between two or more results of measurement for the same measurand (Fig. 2). As noted in [4] the concept of metrological compatibility fits with the GUM and it can be used to assess the significance of differences between results based on the GUM for the same measurand. In Sec. 2, we discuss the VIM3 definition of metrological compatibility and its consequences in more detail than done in [4]. In this paper we propose an approach for determining a combined result which is metrologically compatible with the contributing results whether or not the results as available were compatible. When a set of results for the same measurand turn out to be incompatible, the seemingly anomalous results must be investigated. In Sec. 3, we discuss the importance of Fig. 1. Illustration of the classical approach to statistically evaluating and testing consistency of multiple measurements of the same measurand presuming a randomly disturbed measurement process. Symbols: Y -(joint) measurand, X i -indicated quantities, ξ -possible values of the quantities X i , q 1 ,…,q n -vectors of the repeated observations q ij where q i = [q i1 , …,q ik ] documenting information which may be needed in such investigations. Sometimes multiple evaluations of the same measurand need to be combined. A legitimate combined result must be metrologically compatible with the contributing results. In Sec. 4, we propose an approach for determining a combined result which is metrologically compatible with the contributing results. In Sec. 5, we illustrate the proposed approach using published data from an interlaboratory evaluation of the same measurand. A brief summary is given in Sec. 6.

The VIM3 Concept of Metrological Compatibility
Generally, the measurand (quantity intended to be measured) is a property of a material or of a phenomenon. In many scientific, industrial, and commercial measurements, the measurand is sufficiently stable between multiple evaluations. Our primary interest is in such applications. Suppose two or more measurement procedures are used to measure the same measurand. The measurement procedures may be (i) applications of the same method of measurement at different times or (ii) different implementations of a given method in different places or (iii) different methods.
A measured quantity value is a number together with a metrological reference (unit of measurement) expressing the magnitude of the quantity [  We assume that all results [x 1 , u(x 1 )], …, [x n , u(x n )] for a common measurand are traceable to the same metrological reference and hence they are metrologically comparable. Following the GUM [1], we use the symbol X i for a variable with a state-of-knowledge pdf represented by the result [x i , u(x i )], for i = 1, 2, …, n. The measured value x i is regarded as the expected value E(X i ) and the standard uncertainty u(x i ) is regarded as the standard deviation S(X i ) of the pdf of X i for i = 1, 2, …, n. In the mainstream GUM, the pdf of X i is incompletely determined; the only thing reliably known about the pdf of X i is the expected value E(X i ) = x i and the standard deviation S(X i ) = u(x i ), for i = 1, 2, …, n.

Metrological Compatibility of Two Particular Results
Metrological compatibility is defined for two results at a time. In the mainstream GUM, the difference X 1 -X 2 is a variable with an incompletely determined state-of-knowledge pdf for the difference between the values attributed by the two results [x 1 , u(x 1 )] and [x 2 , u(x 2 )] to the common measurand. The expected value and the standard deviation of the pdf of X 1 -X 2 are, respectively, E(X 1 -X 2 ) = x 1 -x 2 and S(X 1 -X 2 ) = √[u 2 (x 1 ) + u 2 (x 2 ) -2r(x 1 , x 2 )u(x 1 )u(x 2 )], where r(x 1 , x 2 ) is the correlation coefficient between X 1 and X 2 . Following the GUM, we use the symbol u(x 1 -x 2 ) for the standard deviation S(X 1 -X 2 ).
According to the VIM3 [2, Sec. 2.47], two metrologically comparable results [x 1 , u(x 1 )] and [x 2 , u(x 2 )] for a measurand, supposed to be stable, are metrologically compatible if |x 1 -x 2 | ≤ κ × u(x 1 -x 2 ) for a chosen threshold κ. According to the VIM3 [2, Sec. 2.47, Note 1], if two measurements for a common measurand, thought to be constant, are not metrologically compatible then there are two possibilities: (i) one or both of the measurements are incorrect (e.g., one or both of the measurement uncertainties are assessed as being too small) or (ii) the measurand changed between measurements.
We can use the VIM3 concept of metrological compatibility as a criterion to assess the significance of the differences between metrologically comparable results of measurement for the same measurand. In the mainstream GUM, the state-of-knowledge pdf represented by a result [x i , u(x i )], for i = 1, 2, …, n, is incompletely determined. Therefore, we need a quantitative measure for the difference between two fixed known results [x 1 , u(x 1 )] and [x 2 , u(x 2 )], each consisting of a measured value with standard uncertainty. Let us define a ζ-function, denoted by ζ (Δ), as (1) The value ζ (Δ) is a measure for the significance of the difference Δ. Even when a complete state-of-knowledge pdf of Δ is assumed, the metric (1) can be used to judge on the significance of the difference. Based on this metric we can restate the VIM3 definition of metrological comparability as follows [4]: Definition: Two metrologically comparable results [x 1 , u(x 1 )] and [x 2 , u(x 2 )] for the same measurand are said to be metrologically compatible if (2) for a chosen value of some threshold κ, where and r(x 1 , x 2 ) is the correlation coefficient between the variables X 1 and X 2 with state-of-knowledge pdfs represented by the results [x 1 , u(x 1 )] and [x 2 , u(x 2 )].
In definition 1, the value of κ is a chosen threshold for declaring metrological compatibility (lack of significant difference) of two results. Values for ζ (x 1 -x 2 ) larger than κ are regarded as significant. The results are compatible, when the difference between the measured values x 1 and x 2 is insignificant in view of the standard uncertainties u(x 1 ) and u(x 2 ).
The VIM3 does not discuss how the threshold κ should be determined. A proper choice of the threshold κ is to a large extent a matter of agreement because it requires accepting the economic consequences of that choice. A conventional value of the threshold κ in metrology is two.
If one would agree on a larger value for κ then small differences are not detectable any more. This would be a disadvantage for applications when detecting small differences is important. But if we would agree on a smaller value for κ then a lot of small differences become significant even though they might be only a consequence of noisy measurements and the economic consequences are suffered by the metrological community trying to provide compatible measurement systems.

Metrological Compatibility of a Set of Results
According to the VIM3 [2, Sec. 2.47], a set of com- , for i, j = 1, 2, …, n and i < j, is metrologically compatible. We can use expression (2) in this case by replacing x 1 with x i and x 2 with x j .
If for all pairs of results the values of ζ (x i -x j ) are smaller than or equal to a chosen threshold κ then the set of results [ We can say that the differences between the measured values x 1 , …, x n are insignificant in view of the uncertainties u(x 1 ), …, u(x n ).
Note 1: A conventional idea that if the number n of the measured values x 1 , …, x n is large, it is natural to expect one or more of them to be significantly different from the rest comes from the theory of sampling from probability distributions having long tails which extend, for example, beyond two standard deviations. If the measurement procedures are properly carried out and the results of measurement are properly evaluated according to the GUM taking into account all important influence quantities, then a set of results for the same measurand should be metrologically compatible. When some results of measurement seem anomalous, they require explanation rather than acceptance. Often, anomalous results are consequence of missing important influence quantities.

Metrological Compatibility With a Reference Result
Suppose that in addition to the n measurement procedures, which yield the comparable results where n ≥ 2, the same measurand is measured by a higher echelon measurement procedure (or laboratory) yielding the reference result [x R , u(x R )], where x R is the reference value with standard uncertainty u(x R ). Alternatively, the common measurand may be a certified reference material of reference value x R with standard uncertainty u(x R ), which are not revealed before all n results of measurement are reported. We will use the symbol X R for a variable with a state-of-knowledge pdf represented by the result [x R , u(x R )]. In general, the uncertainty u(x R ) associated with the reference value x R is smaller than the uncertainties u(x 1 ), …, u(x n ) associated with the measured values x 1 , …, x n .
If for all differences between the results x i and value x R , the values ζ (x i -x R ) are smaller than or equal to a chosen threshold κ then the set of results [ is metrologically compatible with the reference value x R . We can say that the differences between the measured values x 1 , …, x n and the reference value x R are insignificant in view of the uncertainties u(x 1 ), …, u(x n ) and u(x R ).
One should not confuse the difference ζ ( with E n -values which do not seem to be uniquely defined. 1

Metrological Compatibility With a Combined Result
, where x C is the combined value and u(x C ) is the standard uncertainty associated with x C . We will use the symbol X C for a variable with a state-of-knowledge pdf represented by [x C , u(x C )]. In accordance with the GUM, the combined variable X C for a value of the measurand should be defined as a measurement function of the input variables X 1 , …, X n . Often, X C is set as a convex linear combination of X 1 , …, X n with non-negative weights a 1 , …, a n which sum up to one. Thus often a measurement function for X C is of the form (4) where a i ≥ 0 and Σ i a i = 1, for i = 1, 2, …, n. Since (4) is a linear function in X i the expected value E(X C ) of X C is the combined value x C , where (5) and the standard deviation S(X C ) of X C is the standard uncertainty u(x C ) where  If a i = 1/n for i = 1, 2, …, n, then X C reduces to the arithmetic average X A = (1/n) Σ i X i . The expected value E(X A ) is x A = (1/n) Σ i x i and the standard deviation S(X A ) denoted by u(x A ) can be determined from (6). If the pdfs for X 1 , …, X n are uncorrelated, then If a i = w i / Σ i w i , where w i = 1/u 2 (x i ) then X C reduces to the weighted mean X W = Σ i w i X i / Σ i w i with weights inversely proportional to the variances u 2 (x 1 ), …, u 2 (x n ). The expected value E(X W ) is x W = Σ i w i x i / Σ i w i and the standard deviation S(X W ) denoted by u(x W ) can be determined from (6). If the pdfs for X 1 , …, X n are uncorrelated, then (8) If for all differences between the results x i and the combined value x C , the values ζ (x i -x C ) are smaller than or equal to a chosen threshold κ then the set of results [x 1 , u(x 1 )], [x 2 , u(x 2 )], …, [x n , u(x n )] is metrologically compatible with the combined value x C . Then we can say that the differences between the measured values x 1 , …, x n and the combined value x C are insignificant in view of the uncertainties u(x 1 ), …, u(x n ).
In evaluating u(x i -x C ) the correlation coefficient between X i and X C must be included because the pdfs of X i and X C are always correlated, for i = 1, 2, …, n. For example, if the pdfs for X 1 , …, X n are uncorrelated, then the variance, If a i = 1/n, for i = 1, 2, …, n, then x C reduces to the arithmetic average x A = (1/n) Σ i x i and the uncertainty u(x i -x C ) given in (9) reduces to u( If a i = w i / Σ i w i , where w i = 1/u 2 (x i ), for i = 1, 2, …, n, then x C reduces to the weighted mean x W = Σ i w i x i / Σ i w i and the uncertainty u(x i -x C ) given in (9) reduces to If the uncertainties u(x 1 ), u(x 2 ), …, u(x n ) were all equal to u(x), say, then x W reduces to x A and u 2 (x W ) reduces to u 2 (x A ) = u 2 (x) / n. Then both (10) and (11) reduce to (12) Note 2: Sometimes, the standard uncertainties u(x 1 ), u(x 2 ), …, u(x n ) are not all reliably determined. Also, the standard uncertainties are frequently inappropriate bases for assigning the weights a 1 , a 2 , …, a n to the measured values x 1 , x 2 , …, x n to determine a combined result. Therefore the weighted mean x W may be inappropriate for combining the values. Thus, in our view, the arithmetic mean x A should be regarded as a default combined value.

Information Needed to Determine Sources of Incompatibility
A purpose of assessing metrological compatibility is to demonstrate lack of significant difference between the results of measurement for a common measurand. If a set of results turns out to be metrologically incompatible then the measurement procedures and calculations underlying the seemingly anomalous results should be investigated. Every result of measurement should have supporting documents which include the measurement function (measurement equation) and complete uncertainty budget. If the influence quantities, uncertainty components, and correlation coefficients identified in the uncertainty budget are reasonable then in search of the possible sources of incompatibility one must look into potential influence quantities not included in the uncertainty budget.
Investigations to determine the sources of incompatibility are generally done in retrospect long after completing the measurements. Therefore investigators need = u x n n particular application of the measurement procedure. In the absence of such documentation it may be difficult to determine possible sources of incompatibility.
Note 3: We hope that in the not too distant future, metrologists and information technology experts would collaborate to develop tools which make it easier for metrologists to document in real time the actual measurement procedure while the measurements are being done. Such documentation should be helpful in identifying all potentially important influence quantities.

Determination of a Combined Value and Its Associated Uncertainty
Even when the common measurand is sufficiently stable, the results [x 1 , u(x 1 )], …, [x n , u(x n )] can exhibit large variation. Metrological incompatibility occurs when some or all results (measured values or standard uncertainties) are improperly determined. Frequently, improper results are consequence of missing important influence quantities. For example, in many chemical measurements, the measurand is the amount of one component in a sample of multi-component material. The other components can interfere with the measurements. Frequently, it is impossible to know all potential interferences. Therefore, it is difficult to be sure that all significant influence quantities have been accounted for in determining the measured values and uncertainties.
For a combined result [x C , u(x C )] to be legitimate it should be metrologically compatible with the contributing results of measurement [x 1 , u(x 1 )], …, [x n , u(x n )]. Therefore we propose the following principle.
Principle for combining multiple results for the same measurand: Determine the combined result [x C , u(x C )] from the expressions (5) and (6)  This approach was first proposed in [5] and has recently been used in [6]. Thus we define variables Y 1 , …, Y n with corrected state-of-knowledge pdfs for the common measurand as follows (13) where δX 1 , …, δX n are correction variables. Then a measurement function for the combined variable Y C is (14) where a i ≥ 0 and Σ i a i = 1, for i = 1, 2, …, n, and the pdfs for the correction variables δX 1 , …, δX n are mutually independent and independent of the pdfs for X 1 , …, X n . The pdfs assigned to the correction variables δX 1 , …, δX n express the limits of knowledge. Thus, we assign zero expected values and the same variance u 2 (δ) to each of the correction variables δX 1 , …, δX n . Thus the expected value E(δX i ) is zero and the variance V(δX i ) is u 2 (δ), for i = 1, 2, …, n. It follows from (13) that the expected value y i and the variance u 2 (y i ) of the pdf for Y i are  then each of the corrected measured values y 1 , …, y n would be metrologically compatible with the combined measured value y C . If the measured values x 1 , …, x n are compatible with the combined measured value x C then each of the n quantities in the curly parenthesis of (24) are negative and u 2 (δ) = 0. In that case the measurement function (14) reduces to (5) and the uncertainty associated with the combined measured value x C is given by (6).

Arithmetic Average
If a i = 1/n, for i = 1, 2, …, n, then x C reduces to the arithmetic average x A and from (24), The combined value y C reduces to y A = (1/n) Σ i y i = (1/n) Σ i x i = x A . To assure that the measured values y 1 , …, y n are compatible with y A one can check that (26) where as shown in the appendix (27) Expressions for u 2 (x i -x A ) and u 2 (δ) are given in (10) and (25), respectively. The uncertainty associated with y A is from (7) (28) If u 2 (δ) = 0, then (28) reduces to (7).

Weighted Mean
Since the variance associated with y i is u 2 (y i ) = u 2 (x i ) + u 2 (δ), a weighted mean with weights inversely proportional to the variances of the results y 1 , … , y n is y W = Σ i w i y i / Σ i w i , where y i = x i , and w i = 1/u 2 (y i ) = 1 / [u 2 (x i ) + u 2 (δ)] for i = 1, 2, …, n. The measured values y 1 , …, y n are compatible with y W if (29) for all i = 1, 2, …, n. Analogous to (11) The variance u 2 (δ) is the smallest value which would make the measured values y 1 , …, y n compatible with y W . Such a value for u 2 (δ) can be iteratively determined using the value of u 2 (δ) from (25) as a starting value.  (In the GUM, the same symbol Y is also used for a quantity with a state-of-knowledge pdf for the common measurand.) If the measurand is defined in extensive detail, a true value Y true may be essentially unique. If the measurand is defined in less detail, then a range of values may be commensurate with its definition and any one of them qualifies as a true value Y true of the measurand. The concept of metrological compatibility relates to the observed differences between the measured values x 1 , …, x n rather than to the unobservable differences between the measured values and a true value Y true of the measurand. Therefore, regardless of whether the measured values x 1 , …, x n are compatible or incompatible with the combined value x C , the measured values alone provide no information about the difference between x C and Y true . In particular, metrological compatibility does not imply that the difference between x C and Y true is not significant. However, there is no factual knowledge about potential significant difference between x C and Y true . Therefore, a correction applied to x C for its potential significant difference between x C and Y true and enlargement of the uncertainty u(x C ) determined from (6) as discussed in [7] would be arbitrary.

Combined Result From an Interlaboratory Evaluation
The Columns 2 and 3 of table 1 reproduce from [8, Table 3] the measured values, c Lab , and the corresponding standard uncertainties, u(c Lab ), for the amount content of lead (Pb) in natural river water as determined by the eight laboratories 2 identified in column 1 of table 1. We will use these data to illustrate calculation of a combined result. Suppose the arithmetic average c Avg = 62.79 nmol / kg is used as the combined measured value. The associated standard uncertainty based on the expression (7) (given in column 2 of table 1) and the arithmetic average c Avg = 62.79 nmol / kg along with the corresponding expanded uncertainty intervals (for coverage factor k =2). In Fig. 3, the expanded uncertainty intervals are based on the standard uncertainties as reported in [8] and reproduced in column 3 of table 1; in particular, the standard uncertainty u(c Avg ) associated with c Avg is u(c Avg ) = 0.26 nmol / kg. In Fig. 4, the expanded uncertainty intervals are based on the adjusted (enlarged) standard uncertainties displayed in column 5 of table 1; in particular, the standard uncertainty u(c Avg ) associated with c Avg is u(c Avg ) = 0.46 nmol / kg.  [8] is the final report of the CIPM international key comparison CCQM-K2. In this paper, we have used data from [8] to illustrate calculation of a combined result. We do not address data analysis of a key comparison to determine the key comparison reference value (KCRV) and the degrees of equivalence (DOE).

Table 1.
The measured values c Lab for the amount content of Pb in natural river water and their associated standard uncertainties u(c Lab ) in nmol / kg units as reported in [8]. Also shown are the differences ζ (c Lab -c Avg ) based on the reported uncertainties and the adjusted (enlarged) uncertainties In both Figs. 3 and 4, the expanded uncertainty intervals (for coverage factor k = 2) for the measured values overlap with the expanded uncertainty interval for the arithmetic average c Avg . However, not all of the eight results in Fig. 3 are metrologically compatible with the combined result [c Avg , u(c Avg )]. This shows that there is no direct correspondence between the overlap of the expanded uncertainty intervals (for coverage factor k = 2) and the VIM3 concept of metrological compatibility.