Compatibility verification of certified reference materials and user measurements

A problem that frequently occurs in metrology is one of assessing compatibility of data obtained by a user laboratory with the specified values and uncertainty estimates from the certificate of analysis. The user's data are summarized by an estimated measurand value and a confidence interval, which is typically based on a repeatability standard deviation, but may include other variance or bias components. If the lab's interval and the certificate interval do not overlap, or more generally when the ‘no-bias’ hypothesis is rejected, the user may seek guidance on how to confirm this lack of compatibility or how to rectify it. The suggested two-stage statistical approach demonstrates a confidence interval whose width is similar to that of the certificate, and a compatibility test of guaranteed power for the given bias magnitude. Practical computationally simple formulae for each stage sample size are provided.


Introduction: CRM incompatibility problem
Certified reference materials (CRMs) are well-characterized materials which are certified for one or more physical, chemical or biological properties, and are important to ensure the accuracy and compatibility of measurements. They are produced and sold in large and continually growing quantities throughout the industrialized world. 'CRMs are used for calibration, quality control and method validation purposes, as well as for the assignment of values to other materials . . . and to maintain or establish traceability to conventional scales' [12]. Metrological traceability of a measurement result is often achieved by using calibrations whose quantity values are themselves traceable. However, the calibration CRM should not be subsequently used for trueness control [4].
The National Institute of Standards and Technology (NIST) produces at least 1285 individual standard reference materials (SRM is the NIST trade name for its CRMs), covering products in the major categories of chemical composition, physical properties and engineering applications and selling approximately 33 000 units per year. To certify its SRMs, NIST as well as other National Metrology Institutes reports summary statistics with associated uncertainties leading to a Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. coverage interval. The certified value represents a resourceintensive estimate of the 'true' value of the measurand, with the interval designed to bracket this value and to indicate its uncertainty. Often users are inclined to treat the certified value as a calibration point, while the certified uncertainty is ignored. Analysts who do try to use certified uncertainty sometimes regard the interval as a target band into which the user's values must fall in order to be compliant.
The frequently cited NIST Handbook for SRM Users [24] does not give sufficient guidance for compatibility testing or bias removal as a corrective action. Although there is a large, rapidly expanding literature on the subject of CRM certification [14], complaints about lack of clear guidance on the use of CRMs continue, e.g. [18]. Indeed the problem of formally judging the degree of compatibility (also called 'conformance', 'trueness assessment' or 'bias determination') between a CRM certified value with associated uncertainty and a user's best estimate with its uncertainty does not seem to be fully solved.
One of the most frequently fielded queries by NIST's Measurement Services Division is related to the situation when the user's interval does not intersect the CRM interval. The usual interpretation of nonoverlapping intervals is that the measurements are not CRM compatible (i.e. the hypothesis discussed in section 3 is convincingly rejected). Non-overlap, indeed, can be taken as one of most serious indications of incompatibility, indicating the presence of possibly substantial bias.
Disjoint intervals will result in rejection of the compatibility hypothesis for virtually all existing statistical tests. More importantly, overlapping intervals do not always indicate the absence of bias.
According to the GUM [11], any significant bias should to be corrected. There is an increasing number of publications on the problem of bias removal and formal assessment of the degree of compatibility between a CRM certified value and user's best estimates, e.g. [22,27,28,33]. To correct for bias, independent estimates of the bias correction uncertainty are required. Such estimates are hard or impossible to obtain in the context of just one CRM comparison where a mere replacement of the lab's mean by the certificate value should raise apprehension. This issue is touched upon in section 5.
The recent review [5] discusses more than 30 publications on the subject of measurement uncertainty and compatibility assessment over the last 15 years, among them the guidelines of ISO 10576-1 [13] and EURACHEM/CITAC [6]. See also [17]. These guides do not explicitly formulate the 'no bias' hypothesis in statistical terms. It is specified in section 3 without imposing restrictions on a lab's repeatability.
The main novel feature of ISO 10576-1 standard [13] is to recommend a two-stage stage approach to trueness assessment. If one fails to accept compatibility at stage 1, repeat the measurement and check at stage 2 by pooling measurement results from the two stages. Acceptance/rejection then depends on acceptance/rejection of the combined sample average value and its standard error. The central part of this work is the CRM experiment planning in section 3 where we follow a similar approach. Explicit formulae are given for the sample size needed in principle for each of two objectives described there: one to attain a confidence interval of the width proportional to that of CRM's interval, another to get a test which rejects compatibility with a high probability when the true bias is large.
The second stage sample sizes and the testing methodology are discussed in section 4. This procedure does not necessarily recommend to perform a compatibility test at the first stage but accomplishes one of the goals above for the combined data using the standard deviation of the first sample.
We apply the suggested methodology to two examples in section 6. However, before the methods can be explored, there are minimal requirements the user's lab must meet to show its readiness to compare their results with certified values. These issues are considered in the next section.

Laboratory preparation
For the purposes of applying the suggested procedure, the laboratory should have beforehand or develop an understanding of both statistical characteristics of the information contained in the certificate of an appropriate CRM and of its own measurement performance with that CRM (the estimated measurand and the uncertainty). As we will see in the next section, it is imperative that the user has a fairly good idea about the relative uncertainty of the lab's measurements with regard to that given in the CRM certificate. This can be achieved only if the measurement procedure is under statistical control.
Irrespective of calibration, traceability and quality control issues, the CRM's role is to confirm the trueness of a user's measurement results. Before using a CRM for such purpose, a lab should decide if a standard test method will be implemented. If so, the method likely has previous precision statements obtained by labs participating in its assessment. These can be compared with the CRM expanded uncertainty. However, even limited in-house validation is desirable. If there is no published method, the lab's results must be contrasted with similar off-the-shelf methods or with the work carried out for other relevant techniques. Failing that, the lab can derive some guiding characteristics from customers' specifications such as minimum allowed quantities, relative length of a specific interval, number of significant digits required in reported results, etc.
Once good repeatability is ensured, measurement precision over longer time periods and, if appropriate, among multiple analysts or different instruments should be evaluated using real samples having typical or representative analyte levels [9]. Validation studies must also establish reliable estimates of the limits of detection and quantification. Attempts to evaluate measurement trueness before fully characterizing its precision are unlikely to yield reliable statistical conclusions.
When the lab decides that it is ready to use CRMs, the next task is to choose an appropriate CRM and to employ it correctly. This topic is outside the scope of this paper. It suffices to say that the chosen CRM should have the uncertainty of certified concentrations small relative to the uncertainty for intended use. It must be reasonably matched with the customarily analysed samples and analyte concentrations. These issues are discussed, for example, in [23,24,32].
We focus now on statistical aspects of a CRM experiment, in particular, how to choose the number of replicates needed to detect a bias of the given magnitude, when testing the hypothesis of no bias.

Sample size determination: noncentral t -distribution
Commonly the specifications indicated in a CRM certificate provide the estimated measurand µ crm , i.e. the certified value, and the expanded uncertainty U crm = U . Thus, µ crm ± U crm is the uncertainty interval for the measurand µ. Traditionally used in metrology is an expansion factor 2, so that U crm = 2u where u denotes the standard uncertainty (which commonly includes uncertainties resulting from systematic effects). For the assumed here large degrees of freedom on which the CRM interval is based, the expanded uncertainty of (1 − α)100% coverage interval is z α/2 u, where z β denotes the (1 − β)-percentile of the standard normal distribution. When 1 − α = 0.95, z α/2 = 1.96 ≈ 2. If the degrees of freedom are small, the factor z α/2 should be replaced by a critical value of a t-distribution which can lead to much wider intervals.
The user's replicated measurements, say, x 1 , . . . , x n , are summarized by the valuex which is the best estimate (typically the sample mean) of the measurand quantity, and s which estimates the repeatability standard deviation. The number n of the measurements represents lab's sample size.
We suppose that s does not depend onx although the relationship between the sample mean and s can be quite complicated. Sometimes this independence can be approximately achieved by a suitable transformation of the x's. In particular, when s is proportional tox (as happens for some chemical measurands), the following results typically hold if x i is replaced by the logarithmic transform, log x i .
Thus we accept here the simplest setting with x i being a realization of a Gaussian random variable with some mean µ crm + , where represents an unknown bias, and some unknown standard deviation τ .
In this model the compatibility ('no bias') hypothesis H 0 means = 0. Following the tradition we denote by α the type I error or the significance level, i.e. the largest probability of false H 0 rejection. The type 2 error occurs when the null hypothesis is wrong but is not rejected. If β represents the probability of accepting H 0 when is a non-zero bias, which is the type 2 error, 1 − β is called the power at . A good statistical test first of all has a small significance level not exceeding α for all = 0, but also has a large power at least for sufficiently large | | [20]. Under The probability that this interval covers µ crm + is 1 − α.
Fairly often in practice this interval and the CRM interval µ crm ± U crm do not overlap, refuting compatibility. Mathematically this fact can be expressed as When (1) holds, the lab decides to reject the compatibility hypothesis. Such a procedure is recommended by ISO 10576-1 [13]. The significance level using (1) is always smaller than α. An implication is that for small | | the type 2 error is fairly large. Despite its intuitive appeal, the underpowered test (1) has a poor chance to detect a bias when it is there. It is important to realize that overlapping intervals do not imply that the lab's mean coincides with µ crm . Reference [30] suggests different formulations of compatibility hypothesis in metrology and provides numerical power comparisons of various procedures. In this work we concentrate on the following t-test for two reasons. First of all this test is the most commonly used technique. The second reason is that the properties of the two-stage procedure discussed in section 4 generally do not hold for other tests.
The classical t-test rejects compatibility when For = 0, the probability of false rejection using this test is exactly α. Clearly the right-hand side of (2) is always smaller than the right-hand side of (1), so each time (1) rejects, (2) rejects as well. Therefore, the probability of false acceptance under (1) is larger than that for (2), and the latter test is more powerful. If = 0, the distribution of the ratio √ n(x − µ crm )/s is known as a noncentral t-distribution with the degrees of freedom ν and the noncentrality parameter √ n /τ . In addition to controlling for the type 1 error (α), one would like to have the type 2 error (β) as small as possible. Towards this end the user may specify the minimum non-zero value for the bias, c , that is of concern. The choice of c in practice can cause difficulties. Indeed for the bias to be deemed significant, this critical value cannot be smaller than U crm , but realistically c should not be taken very large. We recommend to limit the values of c to the range U crm c 3U crm , which can be motivated by the equations below.
The larger c , the smaller is the sample size n needed to attain a given type 2 error, β, at c . The balance between α, β, n and c can be achieved only if there is some information about the unknown τ . Indeed in our problem for a fixed α the probability of the type 2 error is a function of the noncentrality parameter √ n /τ . If τ were known, one could solve for n in the equation, type 2 error = β, to get the needed sample size n, n ≈ (z α/2 + z β ) 2 τ 2 / 2 c . If τ is given, one can even construct a coverage interval of any width 2h by taking n ≈ z 2 α/2 τ 2 /h 2 . A lab may want to consider its interval having the width proportional to that of the certificate. The ratio of (expected) widths of these intervals, C m = width(CRM interval)/width(user interval), is known in quality control problems as the measurement capability index. See [25] for a discussion of other capability characteristics.
With u representing the standard uncertainty we suggest to take h = z α/2 uC −1 m for a given value of C m . Then no matter what is τ , both of the above formulae for the sample size n coincide if Since in the formulae for the necessary sample size τ is unknown, it must be estimated. For this purpose the comparison of τ and u is helpful. Of course one should anticipate that τ is larger than u. Take τ = Bu, where the corresponding factor B, say, 1 < B 5, may be determined from the lab's preparatory work. To put it in a somewhat different way, let Bu be the lab's best guess about τ .
The factor B used in [30] can be described via the mentioned measurement capability index C m , B = [ √ nz α/2 /t α/2 (ν)]C −1 m . Thus B merely is a multiple of C −1 m which takes into account the error probability α and is adjusted for different sample sizes. Indeed the user's interval for very large values of τ/σ crm is practically useless. According to the rule of thumb, B should be about 2 [15]. The smallest recommended value of C m in problems involving compliance testing via tolerance zones is 1.3 [1,3,17], which requires fairly large sample sizes n. In our situation one may expect that C m 1. The lab is on equal footing with the CRM in terms of its interval width when C m = 1.
Returning to the issue of controlling the type 2 error, by taking τ = Bu, one gets the estimated value of the noncentrality parameter, √ n c /(Bu), so that the numerical evaluation of the smallest n such that the inequality, type 2 error β, becomes feasible. If c is chosen as in (3), the noncentrality parameter becomes t α/2 (ν)(1 + z β /z α/2 ) with typical values between 4 and 6.
Denote by n m (α, β, d) the minimal sample size n such that the test (2) of significance level α has the power at least 1 − β at d = c /τ . In biostatistics d is called the effect size. The modern statistical software, in particular the publicly available R-language, offers several routines to determine n m (α, β, d) numerically for any given values of α, β and d.
Here is an R-example when sig.level = α = 0.05, power = 1 − β = 0.9 and d = c /τ = 2: where a denotes the least integer that is greater or equal to a [7]. The origin of (4) is the asymptotic expansion of the noncentral t-distribution function in powers of ν −1 [16, chapter 31, section 5]. In the example above (4) gives the correct answer, n m = 5. See also table 1. Excel users may want to use RExcel, an add-in for MS Excel which allows to call R-functions as worksheet functions [8]. There are several web sites (http://hedwig.mgh.harvard. edu/sample size, http://homepage.stat.uiowa.edu/˜rlenth/ Power, http://calculators.stat.ucla.edu/powercalc) allowing necessary sample size calculations intended mainly for clinical trials. The procedure which tests the equality of means by checking the overlap between two intervals in such studies is criticized in [31].
In the next section we will see how the user can derive an 95% confidence interval of any given length when τ is unknown. For this purpose one can employ the formula where α is the error probability [19]. Thus, under the traditional α = 0.05 error, n ≈ 1.55BC m . Equation (5) provides the optimal choice of the sample size n for h = z α/2 uC −1 m with a given value of C m as explained in [19]. This formula gives the same answer, n = 5, in the example above when BC m = 3.21.

Two-stage procedure
If the CRM interval and the user interval do not overlap or, say, the compatibility hypothesis is rejected by (2), the lab may decide to follow a sequential two-stage approach to its testing recommended by [2,10,13]. However, none of these references specifies the necessary sample sizes. The lab's motivation may be the fact that the test (2) does not have a good power unless c /τ is fairly large [30], or it may feel that its confidence interval is very wide.
In mathematical statistics there is a method (Stein's two-stage procedure [20, p 198]) to choose the second (random) sample size m, so that when τ is unknown, the intervalX ± h, h > 0, has a guaranteed coverage probability which is at least 1 − α, say, 95%. HereX is the total sample mean (based on both the first stage n observations and the second stage m observations). The formula for N = n + m is If with a given measurement capability index C m , h = z α/2 uC −1 m the lab will have its interval's width proportional to that of the CRM's interval at the expense of m additional measurements, m = max The second sample is not needed at all when .
Thus by claiming a small uncertainty, the lab deprives itself of the chance to re-examine its coverage interval. If the new user interval and the CRM interval still do not overlap, the lab may choose to declare its lack of compatibility or to attempt a bias correction performing further measurement runs. A motivated laboratory could perform a fully sequential sampling scheme by making measurements one at a time until for the current value of s 2 , s 2 nz 2 α/2 u 2 /[C 2 m t 2 α/2 (ν)]. Stein's two-stage procedure also can be used in the hypothesis testing context so that for a particular bias c one can construct a test whose power for all τ is at least 1−β at this critical value. The two-stage t-test rejects the compatibility hypothesis when √ N |X − µ crm | s t α/2 (ν).  30  22  17  13  11 8  6  5  4  3  2  2  n m  44  32  24  19  16  13 10  8  7  6  5  5  4  Equation (4) 44  32  24  19  15  13 10  8  7  6  5  4  4 If for a constant d, , this procedure has the significance level α. Its smallest power, 1 − tcdf(d + t α/2 (ν), ν) + tcdf(d − t α/2 (ν), ν), corresponds to τ → ∞. If it is 1 − β, then the test has the power at least 1 − β for all τ . A natural choice is with n m from (4). Then additional observations are needed when and only when the first sample size n is smaller than n m (α, β, c /s), which is an estimator of the desirable theoretical quantity n m (α, β, c /τ ).
Another approximation via the modification of (4) suggests to take with the second stage sample size, (10) Equations (7) and (10) are approximately equal when which is possible only if s 2 > 0.5z 2 α/2 u 2 /C 2 m . Then about the same number of additional measurements m is required for the lab's interval to have the half-width z α/2 uC −1 m as for the guaranteed power test of compatibility.
Simulations show that in terms of power the Stein procedure with N a performs better than for N c especially for small/medium n-values. The total sample size (9) is therefore recommended. The smallest power of this test, 1 − tcdf(z α/2 + z β + t α/2 (ν), ν) + tcdf(z α/2 + z β − t α/2 (ν), ν), is very close to 1 − β and considerably exceeds that of N c in (8) for ν 6.

Bias uncertainty interval
The NIST Special Publication 829 [23] addresses the same issue as two previous sections, namely the design of a CRM experiment using the approximate formula for the necessary sample size. Here as before, β is the desired type 2 error at c , d = c /τ . This formula is obtained from the approximation of the noncentral t-random variable by the sum of a central t-random variable and the noncentrality parameter. Since n m enters in both sides, an iterative process is required to determine its value. As the following example shows, this process may not be very accurate.
A part of table 1 in [23] for α = 0.05, β = 0.1, when d = c /τ varies from 0.5 to 3, along with exact answers obtained from the R-code and the approximate formula (4), is reproduced here. All numbers in the original table present insufficient sample sizes, while formula (4) is remarkably accurate, differing by one from the exact n m just for two values of d (d = 0.9 and d = 2.5). The exact n m value when d = 2 is 5, while the table's value 3 is far from [t 0.025 (2) + t 0.1 (2)] 2 /4 = 5.8. Thus this part of [23] should not be used in practice.
Reference [23] suggests to treat u crm as a fixed offset. A bias uncertainty interval for is then derived from the user interval by increasing its half-width by this amount, The probability that the unknown bias is within these two limits is at least 1 − α, but the interval may be excessively wide for this purpose. It is closely related to the conservative test (1) as that test accepts the compatibility hypothesis if and only if the interval contains the origin. The bias corrected interval suggested in [28] which has the end-points The asymmetric interval (µ, µ) was found to be one of the best bias removal procedures reviewed in [22]. However, this interval may not correspond to an interval defined by a coverage factor and was criticized for this reason [21].
In our problem represents the short term bias which can be estimated byx − µ crm with the corresponding uncertainty σ 2 crm + s 2 /n [26]. However, for justifiable bias removal a typically missing independent estimate of this uncertainty is required. If compliance is rejected, a lab may want to pursue the bias correction by taking more measurements involving the same or different CRM and/or using other available precision information mentioned in section 2 including reproducibility conditions.

Two examples
We start with the following illustrative example. A laboratory purchased a CRM consisting of West Virginia coal ash, whose certified mass fraction of gallium is 58 mg kg −1 , with the standard measurement uncertainty 2 mg kg −1 evaluated on 95 degrees of freedom.
The laboratory measured this reference material as means to verify its measurement protocol, and obtained 74 mg kg −1 with the standard deviation 6 mg kg −1 evaluated on five degrees of freedom. The user 95% confidence interval which ranges from 59 mg kg −1 to 89 mg kg −1 has a small overlap with the CRM interval, 58 ± 4.
All other compatibility hypotheses tests considered in [30] also reject. According to (7), if C m = 1, Thus by taking m = 16 − 6 = 10 additional observations, the user would get a 95% coverage intervalX ± 4. If X 58 + 2.57 × 6/ √ 15 ≈ 61.85, this lab may try to seek a bias correction or just state its incompatibility with this SRM.
The lab's choice for n should have been n = 1.55B, which suggests that in this example, B = τ/u > 3.8. In this situation if c = 2u = 4 is the critical bias, then according to (7) when this bias is present, additional m = 20 measurements for the Stein test would allow to reject H 0 correctly with probability 1 − β = 0.8. The value from (11) is c = 6, and the second sample size is m = 10, as above. If c = 4u = 8, then the second sample size is reduced to m = 5.
We use the data for environmental SRM 1974a Organics in Mussel Tissue [29] as the second example. These data come from 16 laboratories participating in a performancebased study over a period of several years. All these labs have used n ≡ 3, so that ν ≡ 2.
Out of 222 cases (14 compounds), the t-test rejected the compatibility hypothesis 91 times. In contrast, 41 user intervals did not overlap with the certificate interval. The three main causes, pyrene, PCB 118 and PCB 153, each contributing five instances of non-intersection, were followed by fluoranthene and 4,4 DDT with four cases each.
As an example, consider PCB 153 with the corresponding CRM interval (145.2 ± 7.6) µg kg −1 . Since t α/2 (2) = 4.303, lab 10 interval, (189.0 ± 10.89) µg kg −1 , does not overlap with the CRM interval, and the t-test rejects the compatibility hypothesis (table 2). This lab would need four additional measurements to establish a coverage interval of the same width as the CRM interval (i.e. to reduce its half-width from 10.89 to no more than 7.6.) However, this shorter interval has a poor chance to overlap with the certificate interval. The same  (7) and (10)  lab would require the second sample size 10 to obtain power 0.9 when c = 2 × 7.6 = 15.4. Similarly, the lab 14 has the half-width of its reported interval, 7.20, smaller than that of CRM, so that its interval cannot be altered by a two-stage procedure. The fact that the lab 16 has a large sample standard deviation s = 15.26 leads to a very large additional sample size 45 for the guaranteed power test at c = 15.4, and to completely unrealistic 72 new measurements for the coverage interval of half-width 2u = 7.6. This example shows that the two-stage procedure cannot be useful in the extreme cases when the first sample standard deviation is very small or very large.

Conclusions
The Stein procedure offers the promise of shorter uncertainty intervals and of compatibility verification which is simultaneously more powerful and more fair to the user. This promise cannot be fulfilled by any fixed sample size statistical method. The calculations involving the necessary second sample size after (6) or (8) are sufficiently simple that end users of CRM certificates could employ them when designing a two-stage validation procedure. These formulae serve two different goals: one to attain a confidence interval of a width comparable in terms of the measurement capability index to that of the certificate, another to derive a test rejecting conformity with a high probability for the prescribed bias value. The two objectives are not incompatible; in contrast, they coincide when the critical bias value is given by (11) which can be taken as the default bias.
However, our approach needs some information about the lab's relative uncertainty with regard to CRMs. It cannot be helpful if this uncertainty is very large or very small. Neither the formula (5) nor the exact R-language calculations can be employed without certain distributional assumptions which may or may not be met in a particular situation. Effectively combining the message of the CRM certificate with the lab's data is possible only if the lab's measurement process is under statistical control. Indeed good repeatability is a precondition for any contemplated bias correction which by itself requires much more information.
references and discussion were provided by the author's NIST colleagues D L Duewer and S Leigh. Useful insight by J Sieber into laboratory preparation and critical remarks of two referees are also acknowledged. Contribution of National Institute of Standards and Technology, not subject to copyright in the United States.