Evaluation of CRM homogeneity in cases of insufficient method repeatability: Comparison of Bayesian analysis with substitutes for ANOVA based estimates

Insufficient method repeatability is a problem characterising the evaluation of certified reference materials (CRMs). In investigating the homogeneity studies of 216 certified parameters from 36 CRMs released by the European Commission’s Joint Research Centre (JRC) over the last four years, it was found that in 1/3 of the cases, the method repeatability (sr) was too high to calculate the standard deviation between units (sbb) by classical analysis of variance (ANOVA). It was also found that the application of the repeatability requirement stated in the ISO Guide 35:2017 is not feasible since it would require unrealistically low repeatability standard deviations or an impossibly high number of replicates per unit. Evaluation of the uncertainty of homogeneity (ubb) as evaluated by ANOVA using both the maximum of sbb and 0, the maximum of sbb and u∗bb, the uncertainty hidden by method repeatability, the maximum of sbb and sbb/√n and Bayesian analysis, using both informative and diffuse priors, as well as the standard deviation of the unit means, were compared using simulated homogeneity studies with repeatabilities of 1–8% and sbb between 0.2 and 2.8%. It was found that using the maximum of sbb and sbb/√n as an estimate of ubb guards against severe underestimation but usually results in a severe overestimation of the between-unit variation. Using the maximum of (sbb, 0) shows the least average bias but results in a severe underestimation of ubb in a high fraction of cases. Using the maximum of (sbb, u∗bb) limits, but does not completely eliminate cases of a severe underestimation. Also, it leads to average results biased towards high values. For the range of sbb and sr investigated, Bayesian analysis performed worse than max (sbb, u∗bb) in limiting severe underestimation of ubb, but limited the average bias towards high results. A risk-based approach to cases of insufficient method repeatability is proposed where, after evaluating the other contributions to the uncertainty of certified values, the effect of severe over- and underestimation of ubb is evaluated, and an appropriate approach is chosen based on this analysis.


Introduction
Homogeneity of selected parameters within a batch of material is one of the two defining properties of reference material (RM) [1]. Although ISO Guide 30 does not indicate any specific degree of homogeneity, it, however, states that the degree of homogeneity must be "fit for its intended use in a measurement process". For certified reference materials (CRMs), ISO 17034 [2] requires the inclusion of an uncertainty contribution from between-unit homogeneity, except when negligible compared to the other uncertainty contributions [3]. This between-unit variation, together with uncertainty contributions from characterisation and stability, leads to the assigned uncertainty of the certified value. Therefore, a quantitative assessment of the variation between different units of a CRM is necessary to either determine the variation directly or to be able to demonstrate that the variation is indeed negligible compared to other uncertainty contributions.
A homogeneity study aims to determine the variation of the certified property between the different units of a CRM batch. A quantitative assessment of homogeneity usually comprises one or more replicate measurements on a representative number of units of a material. The material tested is usually the candidate material itself but can also be a very similar material which can, depending on the validated processing procedures and material properties, serve as a proxy for the homogeneity of the candidate material. The variation of the certified property is a property of the CRM and independent of the chosen measurement method. The variation of the certified property can have different forms like a trend over the whole batch, a few units with outlying values, a sudden change in the certified property due to some change in the manufacturing process etc. However, in many cases, the variation of the average value of each item across the batch can be modelled as Gaussian distribution around the certified value (m CRM ). In this model, the average properties of the different units m i follow a normal distribution around m CRM with the variance s 2 bb (Equation (1)). The aim of the homogeneity assessment is to estimate the true between-unit variation s 2 bb . (Note that ISO Guide 35 [3] refers to the between-unit variation as "between-bottle variation", thus using the index "bb". For consistency reasons, this notation is used in this manuscript).
A typical homogeneity study consists of performing n replicate measurements of the certified property on N units of the CRM. Any individual measurement x ij on unit i of the CRM will give a result that depends on the true average of the batch (m CRM ), a potential constant method bias (ε m ), the deviation of the value of the unit tested from the average of the batch (ε i ) and a random error reflecting method repeatability (ε ij ) (Equation (2)) Usually, only one method is used for all measurements in a homogeneity study. This makes ε m constant for the given study. The grand mean of each homogeneity study (m) can, therefore, be written as m CRM þ ε m, which simplifies Equation (2) to Equation (3).
It is often assumed that the random error ε ij is normally distributed with a mean of zero and the repeatability variance s 2 r [3]. In many cases also, ε i is assumed to follow a normal distribution with an average of zero and a variation of s 2 bb .The variation between units is usually small, so the repeatability variance is the same for all units, making ε ij statistically independent from ε i . The i units tested are regarded as a typical sample of the whole batch and the between-unit effect ε i is modelled as a random factor, which means that the basic model, which is given in Equation (3) is a random effects model. In this case, the observed variation of the measurement results (u 2 c,bb ) comprises the variation between units (s 2 bb ) and the measurement variation (s 2 r ) as shown in Equation (4) [4]. In this case, From the data of the homogeneity study, s bb is calculated and is then used as estimate of the uncertainty due to between-unit variation (u bb ). It should be noted that a constant method bias is irrelevant, as it effects all measurements in the same way and will therefore not contribute to the observed variation of results.
Traditionally, s bb is calculated from the observed variation of all results using analysis of variance (ANOVA) [3] as shown in Equation (5). The basic assumptions of ANOVA are homoscedasticity (homogeneity of variances in each group), normality (values within each group are drawn from a normal distribution) and randomness and independence (the results are random and the units are independent of each other). s bb ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi MS between À MS within n r

Equation 5
With MS between and MS within being the mean squared errors between and within groups from ANOVA and n the number of replicate measurements performed per unit, s r is calculated as √MS within . Equation (5) is derived from an expression that substitutes the estimates MS between and MS within for the expected values of these mean squares in the equation that relates the expected variances s 2 bb and s 2 r with the mean squares. This generally works well as long as s 2 bb is large compared to s 2 r /n. For very homogeneous materials (s 2 bb is very small) and/or methods with a high repeatability variance (s 2 r /n is large), the argument in the equation can become negative, which makes a calculation of s bb impossible. Several solutions have been suggested in this case, either by replacing the estimate for s bb from Equation (4) with a non-negative value, by selecting a method with a sufficiently small s 2 r or by performing a sufficiently high number of replicate measurements n per unit that ensures that the argument in the square root does not become negative in the first place or recently by using Bayesian analysis.

Approach 1: choosing a non-negative value
The classic response in frequentist statistics to a negative argument in Equation (5) is to set s bb to zero. However, if the measurement method lacked sufficient precision, this can lead to a severe underestimation of the between-unit variation. It has, therefore, been argued that the relation between s 2 bb and s 2 wb is of no interest in reference material production [4]. Non-negative estimators for the between-group variance have been proposed by Federer [5]. Alternatively, an attempt has been made to find an approximate estimate for the inhomogeneity that could be hidden by the method repeatability (named u* bb ). This led to an equation based entirely on the estimated method repeatability and its degrees of freedom. The degrees of freedom, in turn, depend on the total number of units tested N and the number of and the number of replicates performed per unit (n) [6] (Equation (6) It was suggested to use the larger value of s bb and u* bb as an estimate of the uncertainty of homogeneity [6]. This approach is given as an option in the ISO Guide 35 [3] and is followed by several reference material producers. The rationale for the procedure is that if no inhomogeneity is detected, one should rather use the potentially undetected inhomogeneity as a conservative estimate for the undetected inhomogeneity. This approach has recently been criticised not only on fundamental terms but also as being an underestimation of the inhomogeneity that could really be hidden [7]. As an alternative and a more conservative approach, using the larger of s bb and s r /√n as an estimate of the between-unit variation has been proposed [8]. In the case of a perfectly homogeneous material (s 2 bb ¼ 0), the variation between unit means will entirely be determined by the random variation s r /√n. In the absence of any detected variation, the variation of unit means of perfectly homogeneous material is taken as an estimate of the between-unit variation.
It should be noted that regardless of the value chosen, replacing the nominal value for s bb with a non-negative value introduces a bias. The rationale for using these estimates, nevertheless, is that such a positive bias is preferable to the possibility of a severe underestimation of the heterogeneity.
1.2. Approach 2: ensuring that the argument does not become negative ISO Guide 35:2017 [3] and ISO 13528 [9] suggest a different solution to this problem by setting minimum requirements for the repeatability standard deviation (s r ) of the methods used for homogeneity assessment. If these requirements are met, and no variation can be calculated, s bb is set at zero. The rationale is the minimum requirements should ensure that MS within -MS between in Equation (5) does not become smaller than zero. If MS within -MS between in Equation (5) becomes smaller than zero nevertheless, then the s bb is so small that the very fact of a negative argument in the square root proves that the uncertainty contribution from homogeneity is negligible compared to other uncertainty contributions. To this end, ISO Guide 35 recommends that s r /√n should be smaller than 1/3 of the target uncertainty of the certified value. ISO 13528 has an equivalent criterion, albeit calculated for n ¼ 1. This approach is much more established in proficiency testing, where already, the 2005 edition of ISO 13528 stipulated the same criterion but is new for reference materials, where it was first formalised in the 2017 version of ISO Guide 35. The reason might be that the target is easier to reach in proficiency testing than for CRMs: for proficiency tests, the target uncertainty is usually the size of the reproducibility standard deviation of the results submitted by the participants or larger, which makes it relatively easy to find methods that fulfil this criterion.
In contrast, for CRMs, in an ideal case, the uncertainty of the certified value is negligible compared to the measurement uncertainty of the measurement result obtained by the user of the CRM. This leads to much smaller target uncertainties than in proficiency testing. The applicability of this approach for CRMs has, therefore, not yet been tested.

Approach 3: Bayesian analysis
A third and completely different approach to solving the problem of negative arguments in the square root of ANOVA is to abandon the calculation algorithm of the classical ANOVA and use Bayesian analysis instead. It should be emphasised that the basic model remains the same, namely a hierarchical random-effects model. The only difference is in the algorithm in which the model parameters are estimated (prior probability density functions in the Bayesian analysis versus sole reliance on the measured data in the frequentist model). This approach has the advantage that s 2 bb can be defined as non-negative; hence, eliminating the problem from the onset. The result of Bayesian analyses depends on the choice of priors, therefore, a careful choice is necessary. Van der Veen was the first to apply Bayesian analysis to the evaluation of homogeneity studies reference materials [10]. His analysis of the homogeneity studies of gas mixtures showed the applicability of Bayesian analysis to the evaluation of homogeneity studies. In the examples he investigated, s bb tended to be larger than the u* bb calculated according to Ref. [6], and he also suggested that u* bb might not be a good measure for hidden inhomogeneity.
One advantage of using Bayesian analysis is that, unlike the point estimates for s bb from ANOVA or approaches 1 and 2, it automatically gives a probability density function of s bb which may reflect the poor information about s bb better than one number. This could lead to a more realistic assessment of the uncertainty of the certified value, especially in cases where also the uncertainty contributions for characterisation and stability are available as probability density functions. In this case the distribution functions rather than point estimates could be propagates as described in supplement 1 to ISO Guide 98e3 [11] An evaluation of this is, however, beyond the scope of this manuscript.
When assessing the merits of each of these approaches, a balance between two conflicting goals needs to be reached. On the one hand, an ideal approach would never significantly underestimate u bb , as this would lead to unrealistically small uncertainties of certified values. On the other hand, u bb should also not be significantly overestimated to avoid inflating the uncertainties of certified values needlessly. However, the average performance of an approach, i.e. whether it consistently over-or underestimates u bb is of little or even no importance to the user of a CRM: each user uses only a relatively small number of CRMs. Consequently, a consistent but slight overestimation of the uncertainty is irrelevant compared to on average correct but in some cases, widely deviating results. There is another reason for the irrelevance of small differences: It is recognised that uncertainties are not perfectly known. Therefore, few CRM producers state more than two significant figures for the uncertainties of the certified values. For non-nuclear CRMs produced by the European Commission's Joint Research Centre (JRC), uncertainties are rounded in a way to ensure that the error introduced by the rounding is between 3 and 30% of the final expanded uncertainties. Slight deviations, therefore, would be covered by the rounding error and are of no consequence to the user. Therefore, individually widely deviating results are of higher importance than the average performance of an approach.
For real materials, as the s bb is unknown, one can only compare whether one approach gives a higher estimate for s bb than another. The situation is even more complicated in cases where no homogeneity is detected and where it is not clear whether this is because an approach underestimates the inhomogeneity or whether the other approach overestimates it. Simulated data where an artificially introduced variation between units is known seems an effective way to circumvent this problem: Apart from mere comparison of the results of the different approaches relative to each other, comparison of the estimates with the true values also allows a more solid statement whether an approach over-or underestimates the uncertainty.
In this manuscript, two questions in connection with homogeneity studies of certified reference materials are investigated. The first part uses real data to check whether there is a significant number of homogeneity studies where MS between -MS within is negative and whether a repeatability limit as proposed by ISO Guide 35 is realistic and implementable. The second part compares different approaches to dealing with cases where s bb is equal or smaller than s r /√n, i.e. cases where it might be impossible to estimate s bb from classical ANOVA because the data contain little information on the between-unit variation. To this end, three nonnegative replacements for s bb and the results of Bayesian analysis with informative and diffuse priors are compared simulated homogeneity studies with known between-unit variations.

Need for an uncertainty estimation for alternative homogeneity evaluations and practicality of a repeatability limit
Homogeneity studies of 36 different certified reference materials comprising 216 certified parameters released by the JRC between 2014 and 2018 were investigated. These materials cover a wide range of matrix/measurand combinations. They include mass fractions of trace elements in chocolate, algae and copper, particle sizes of nano-and microparticles, mass fractions of polycyclic aromatic hydrocarbons and polybrominated flame retardants in water and sediment, enzyme activities of pure enzyme preparations and protein mass fractions in serum. A complete list of the CRMs used, the measurands investigated, and a link to the certification reports containing the data is given in the supplementary electronic materials. Most of these materials are matrix CRMs, and the measurement of the analytes in question usually involves more or less complex sample preparation. Each study consists of two to four determinations of the measurand in question in 10e20 units. 117 of these studies concerned element analysis, 56 small organic molecules, 19 physical properties, 9 operationally defined properties, 11 proteins and 4 measurands based on polymerase chain reaction (PCR). In the first step, 38 studies that could not be evaluated by standard ANOVA because they contained outliers or trends in the filling sequence were excluded. While trends in the filling sequence could be included both in frequentists and Bayesian models, an investigation of such models is beyond the scope of this manuscript. The exclusion of these resulted in 181 studies that could be evaluated by standard ANOVA. The distribution of the selected studies was virtually unchanged by this exclusion: 101 elements, 43 small organic molecules, 13 physical properties, 8 operationally defined properties, 10 proteins and 4 PCR-based measurands.
For each of the studies selected, s bb and s r as calculated from ANOVA were taken from the certification reports, as were u* bb and u char . The number of cases where s bb could not be calculated as MS between was smaller than where MS within were counted and the number of replicates that would have to be performed to fulfil the criterion s r /√n u char with u char being the uncertainty of characterisation. Subsequently, these data were summarised as median values to minimise the effect of single extreme values.

Data generation for simulated homogeneity studies
Sets of 20 homogeneity studies comprising 15 units analysed in triplicate were generated. The simulated s r were 1%, 3%, 5% and 8% and the applied s bb were 0.2%, 0.5%, 1% and 2.8%. Sets of 900 normally distributed random numbers with an average of 100 and a standard deviation of 1%, 3%, 5% and 8% were generated using the website www.random.org. These 900 numbers were grouped in 300 sets of three results, each set representing the three replicates of one unit. 300 random numbers with an average of 1 and standard deviations of 0.002, 0.005, 0.010 and 0.028 were generated using the website www.random.org. These random numbers represent the deviation of each unit from the average measurand level of 100. Each of the 900 random numbers of each set was then multiplied by the variation factor to simulate the measurement results on this unit (Equation (7)).
x ij ¼ z ij *y i Equation 7 x i,j simulated result z i,j individual random number with i ¼ 1 to 300 (unit number) and j ¼ 1 to 3 (replicate number) y i with i ¼ to 300: fraction of each unit mean of the average of the study The results from i ¼ 1 to i ¼ 15 are homogeneity study 1, i ¼ 16 to i ¼ 30 are homogeneity study 2 etc.
As seen in Equation (7), a multiplicative bias is applied rather than an additive one (the factor for the between-unit variation is multiplied with the random numbers for the replicates). This was done in line with the observation that for many methods, the relative variations are constant. Due to the relatively small bias (0.2e2.8% between-unit variation), the differences between multiplication and addition are negligible.
The complete set of random numbers is given in the supplementary electronic material, and an overview of the calculation is given in Fig. 1. The combination of s r and s bb used are shown in Table 1.

ANOVA and non-negative estimates of s bb
The simulated data were evaluated using classical one-way ANOVA calculating s r , s bb and u* bb . Three approaches to dealing with cases where MS between < MS within were applied. In one case, u 2 bb was estimated as the maximum of s 2bb and 0, i.e. setting u bb equal to zero in case MS between < MS within . The second case was to set u bb as the maximum of s bb and u* bb , i.e. setting a lower limit and the third case was to set u bb as the maximum of s bb and s r /√n.
Throughout the manuscript, the following codes are used: "A-max(s bb ,0)" for evaluation by one-way ANOVA where s bb was set to zero for all cases where MS between < MS within . "A-max(s bb , u* bb )" for evaluation by one-way ANOVA where s bb was set to the maximum of s bb and u* bb as proposed in Ref. [6]. "A-max(s bb , s r /√n)" for evaluation by one-way ANOVA where s bb was set to the maximum of s bb and s r /√n as proposed in Ref. [8].  (1)). The individual measurement results (z[i] in the Winbugs code) follow a normal distribution around the respective unit mean with a standard deviation s r . As Winbugs has no integrated function for the Cauchy distribution, the fact that if x i~N (0, B) and k~G(0.5, 0.5), then x i /√k follows a Cauchy distribution with the scale parameter B, which was exploited as was done in Ref. [12].
The following priors were chosen: Grand mean (gm in the code): Normal distribution with m ¼ 100 and s ¼ 31. The rationale for this prior is that the data of the studies were chosen such that the grand mean should be 100.
The s ¼ 31 of grand mean is in the view of this a rather diffuse prior and was only chosen to ensure that the calculated grand mean is within this range. unit means i (group.mu[i] in the code): normal distribution with m ¼ grand mean and s ¼ s bb . This prior directly follows from the model where the unit means are assumed to be normally distributed around the grand mean. s r uniform prior between 1 and 10% for repeatabilities of 1%, 3% and 5% and a uniform prior between 1% and 20% for the tests with a repeatability of 8%. The goal was to have a mildly informative prior for s r . The choice of this prior reflects real life where one usually has a reasonable idea about the repeatability standard deviation of the method from the method validation study. However, due to changes in instrument performance, etc., the repeatability standard deviation on a single day may be worse, so the limit was increased. The prior itself is of little importance, as sufficient data points are available to provide a good estimate of s r . s bb : Half Cauchy distributions were used as prior for s bb, as suggested in Ref. [12,13] and applied in Ref. [10]. Three variants of this prior were applied. The first one was a diffuse prior with a scale parameter lambda of 25. This prior allows virtually all values for s bb even to very high ones. Furthermore, two informative priors were applied. For nominal s bb of 0.2%, 0.5% and 1%, a scale parameter of 1 was applied, which severely limits the probability of high s bb . For nominal s bb of 2.8%, a scale parameter of 5 was applied, which means that most values chosen for s bb are <5. The relevant parts of the probability density function of half Cauchy distributions with scale parameters of 1, 5 and 25 are shown in Fig. 2.
For each study, 100000 iterations were performed. A thinning of 20 was applied to reduce the autocorrelation in s bb , resulting in 5000 valid iterations per study. In all cases, the median of the 5000 s bb was used as an estimate of u bb . The choice of the median rather than the average as an estimate for s bb, results in slightly lower values, especially in cases where s bb is equal or smaller than s r because in these cases, skewed distributions for the posterior are obtained. It was found that the difference is about 3 relative % in case MS between ¼ MS within ¼ 3% and about 20 relative % in the most extreme case where s bb ¼ 0.2% and s r ¼ 8%.
Throughout the manuscript, the following codes are used: "B-uninf" for the results of the Bayesian analysis applying a scale parameter of 25 for the half-Cauchy distribution of s bb "B-inf" for the results of the Bayesian analysis with an informative prior for s bb using scale parameters of 1 (for nominal s bb of 0.2%, 0.5% and 1%) and 5 (for nominal s bb of 2.8%), respectively.

Validation of the model
The basic model was validated by comparing the data for s bb ¼ 2.8 and s r ¼ 1 to the results from classical ANOVA. This is a situation where, for all data used, s bb can be evaluated by ANOVA, and the Bayesian analysis should give the same result. For each of the 20 tests, the ratio between the s r obtained by Bayesian analysis and by ANOVA was calculated. This ratio was on average, 1.03 reflecting the difference between median and mean. Similarly, for each of the tests, the ratio between the s bb obtained by Bayesian analysis and s bb,real was calculated. This ratio was on average, 0.98 for the uninformed prior and 1.01 for the informed prior. This comparison shows that the model and its implementation are an appropriate implementation of the hierarchical model chosen.

Investigation of the influence of the priors
The influence of the scale parameter of the half Cauchy distribution for s bb is part of the manuscript, as one of the datasets generated is a diffuse prior with a scale parameter of 25 and one is an informative prior with scale parameters of 1 and 5, respectively.
The goal of this manuscript is to investigate cases where the repeatability standard deviation is larger than s bb . The influence of the other priors was therefore investigated on a dataset with an s bb ¼ 2.6 and an s r ¼ 5, as well as on a dataset with s bb ¼ 1% and s r ¼ 5% for standard deviations of the grand mean of 10 and 3.1, including a less informative prior for s r, namely a uniform distribution between 0 and 20. The results are shown in Table 3.
The results show that a less informative prior for s r has no influence on the result, neither for s r nor for s bb . This is an expected  outcome since the number of individual results is high enough to dominate the posterior. Also, a reduction of the standard deviation of the grand mean does not change the result, as long as the standard deviation, which is chosen is larger than the combination of s bb and s r /√n (¼ u c,bb of Equation (4)): A standard deviation of 10 does not change the results but a standard deviation of 3.1, in combination of an s bb of 2.6 and an s r of 5 would result in a lower estimate for s bb . Based on this, it is concluded that the priors are adequate for the purpose.

Comparison of approaches
Real between-unit standard deviation (s bb, real ): For each simulated study, the real between-unit standard deviation of the study, defined as the standard deviation of the 15 y i of the respective study was calculated. This standard deviation was multiplied with 1.018, the correction factor for 15 data points to obtain an unbiased estimate for the standard deviation [14].
For each study, the ratio between the estimated u bb /s bb,real was calculated, and the following parameters were derived: Average estimate: For each combination of s bb,real and s r, the average of the 20 ratios u bb /s bb,real was plotted against the ratio s r /s bb,real . For a perfect evaluation method, this ratio would be one for all ratios s r /s bb,real . In practice, even for a completely unbiased evaluation, this ratio will scatter around 1 due to the effect of s r on the estimate. Systematic deviations from unity indicate a bias in the evaluation approach. Maximum overestimation: For each combination of s bb,real and s r, the upper single-sided 95% confidence limit was calculated from the 20 ratios u bb /s bb,real . This parameter accounts for the fact that users are affected by an unnecessarily large uncertainty of the CRM they purchased. Whether an evaluation performs well on average is of minor importance. The maximum overestimation was plotted against the ratio s bb,real /s r . Table 2 WinBUGS® Program for the evaluation of homogeneity studies.  Maximum underestimation: For each combination of s bb,real and s r the lower single-sided 95% confidence limit based was calculated from the 20 ratios u bb /s bb,real . As nobody will ever introduce a negative between-unit standard deviation, the lower confidence interval was capped at zero, meaning that values < 0 were replaced by 0. The rationale for this parameter is the same as for the maximum overestimation, namely that a user is affected by an underestimation of the uncertainty of a specific CRM bought and not by an average bias. The maximum underestimation was plotted against the ratio s bb,real /s r . Fraction of significant underestimations: For each combination of s bb,real and s r, the fraction of the studies with u bb < 0.7*s bb,real was calculated. Whereas the maximum overestimation and maximum underestimation investigate extreme events, this parameter quantified the frequency where u bb is severely (here defined as more than 30%) underestimated.

Need for an uncertainty estimation for alternative homogeneity evaluations and practicality of a repeatability limit
The results of the compilation of homogeneity studies are given in Table 4. In 13e46% of the cases, it was impossible to calculate s bb as MS between < MS within . For method defined properties and PCR based methods, s bb could not be calculated in 13 and 17% of the cases, respectively. For all the other fields, the fraction of studies where s bb could not be calculated ranged from 30 to 46%. This is not because of the selection of inferior methods that show high repeatability standard deviations. The median repeatability standard deviation for elements is 3.4% and for small organic molecules, 5.0%, which is close to the optimum that good laboratories can achieve for a larger number of samples. Smaller repeatability standard deviations for elements can be achieved by, e.g. thermal ionisation isotope-dilution mass spectrometry of elements. However, this method is generally not suitable for the determination of the 20e40 samples required for homogeneity studies. The reason for not being able to calculate s bb is rather the good homogeneity of the materials, where median s bb ranges from 0.5 to 1.8%. This means that the problem of negative arguments under the square root in ANOVA is real and present in a large fraction of homogeneity studies of matrix materials.
Despite the good repeatabilities of the methods used in the studies, very few of them would have met the criterion for method repeatability stipulated in ISO Guide 35. The median number of replicates to fulfil this requirement for chemical analyses ranges from 11 to 40. Physical properties form an outlier on the low side, as the methods often are exceptionally repeatable and PCR-based measurands form an outlier on the high side, as here, the assigned values often are not based on PCR itself. Consequently, the requirement combines rather small u char with the large standard deviations from PCR. Clearly, performing 10 to 40 replicate measurements on each unit in a homogeneity study is not practical and often impossible due to the limitation of sample per unit. Even if the amount of sample per unit was sufficient to perform this number of measurements and also, if one could commit the resources needed for this large number of measurements (several hundred measurements per study), it is easily possible that the criterion would still not be fulfilled as instrument drift, the time required for this number of sample preparations etc. could increase the observed repeatability standard deviation.
These data show that the repeatability criterion stated in ISO Guide 35 is not achievable for a wide range of matrix CRMs, as either unrealistically small repeatability standard deviations would have to be achieved or an impractically high number of replicates would have to be performed.

Average estimate
The plot of the average estimates of the five evaluation approaches is shown in Fig. 3. The graphs show that using A-max(s bb , s r /√n) is for high s r /s bb,real ratios, which is significantly biased towards high results. This is expected and known, especially for higher s r /s bb,real ratios where the standard deviation of the means only reflects the random variation of the averages due to repeatability. Also, as expected, A-max(s bb ,0) shows a very low average bias. This is because for a perfectly homogeneous material (s bb,real ¼ 0) in half of the cases, MS between < MS within and the value 0 is in these cases the correct estimate. Using Bayesian statistics with an informative prior (B-inf) performs on average, similarly well as A-max(s bb ,0). The on the average high positive bias of Bayesian analysis with a diffuse prior is somewhat surprising. On average, the positive bias is comparable to A-max(s bb, u* bb ), which only reflects measurement repeatability and the study setup. As is evident from the equations, the average positive bias for Amax(s bb, u* bb ) is smaller than for A-max(s bb , s r /√n) due to the inclusion of the term reflecting the degrees of freedom of s r .
The finding that the results from Bayesian analysis (both using diffuse and informative priors) are lower than when using the maximum of s bb and u* bb seemingly contradicts [10]. This publication stated, for the examples investigated that "Bayesian analysis gives a value for the between-unit standard deviation, which is not even close to u* bb ". Several factors explain this apparent contradiction: On the one hand, the ratios of s r /s bb explored in Ref. [10] range from 0.4 to 3.4 which is significantly lower than the ranges explored here and in the majority (7 of 9) cases, s bb was lower than s r . In addition, 6 replicates per unit were performed. This means, as shown by the fact that s bb could be evaluated in all cases by ANOVA, that there was not a single case where u* bb was larger than s bb . A comparison of the estimate for s bb and u* bb is not relevant, as u* bb should not be used as an estimate of u bb in such cases.

Maximum overestimation
In general, both CRM producers and users are less interested in the average bias than in the bias introduced for a specific CRM. The evaluation of the maximum overestimation by the five approaches is shown in Fig. 4. The general trend for the maximum overestimation is the same as for the average estimate discussed before: A-max(s bb , s r /√n) results in the highest and B-inf in the lowest maximum overestimation. However, it is noticeable that the differences between the various approaches is less pronounced than in the average estimation. Especially for s r /s bb ratios below 6, the results from the various approaches are rather close together. This can be explained by the fact that s r also generates a random variation between units. Whenever this variation coincides with the real between-unit variation (i.e. the average of units with a higher average increases even more by random fluctuations), a too high estimate for s bb will be given and cannot be reduced by any evaluation approach. The severity can only be mitigated by performing more measurements per unit or by measuring more units: more measurements per unit makes the measured unit mean a better estimate of the real unit mean while measuring more units reduces the probability that this enhancement occurs for all units.

Maximum underestimation
The maximum underestimation observed for the homogeneity studies investigated is shown in Fig. 5. In this graph, values below 1 show that a particular approach is expected to underestimate s bb,real for a certain s r/ s bb,real combination in 5% of the cases. Values above 1 show that for a given s r / sbb,real combination u bb should in 95% of the cases overestimate s bb, real . The graph shows that using A-max(s bb , s r /√n) is of all the investigated approaches, the least  prone to severe underestimations of s bb,real . The minimum ratio u bb / s bb,real was 0.61, i.e. an underestimation by only 39%. This safe estimate, of course, comes at a price: Not only is the average result biased (see above), but also from s r /s bb,real ratios of 5 results closest to the real value still overestimate s bb,real by a factor of about two or more.
The other extreme is as expected, A-max(s bb ,0): Only for s r / s bb,real ratios of 1 or below was MS between for all 20 simulated studies, which are larger than MS within , leading to small to modest underestimations. However, at higher ratios for s r /s bb,real , MS between was at least in one of the 20 simulated studies smaller than MS within , which resuled in an estimate of u bb of 0 and hence a complete underestimation of s bb,real . This results in lower confidence limits of 0 for all s bb /s r ratios above 1.7.
The lower confidence limit for A-max(s bb ,u* bb ) is between 0.4 and 0.6 ratios of s r / sbb,real below 5, indicating that underestimations by 40e60% are possible. The notable exception is the combination, s bb ¼ 3 and s r ¼ 8, where the lower confidence limit is only 0.07, corresponding to an underestimation of 93%. However, it should be pointed out that a single overestimation of s bb by a factor of 3 causes this lower confidence limit, which increased the standard deviation and hence resulted in broad confidence bands. The lowest value observed was 0.64 (corresponding to an underestimation of 36%). In most of the s r /s bb,real combinations, the calculated s bb was still larger than u* bb . This means the underestimation was caused by the compensation of the between-unit variation involving the random fluctuations due to measurement repeatability. In all of these cases, the underestimation was still slightly less severe than when applying Bayesian statistics (see below). The findings that u* bb sometimes underestimates s bb,real partly supports the criticism of the approach raised in Ref. [7] that u* bb may underestimate the potentially hidden homogeneity. While these simulated data confirm this worry, they also indicate that this is only a problem for low s r /s bb,real combinations, i.e. for either very inhomogeneous materials or very repeatable methods. The maximum underestimation observed in these cases was 50%, which can have a substantial impact on the assigned uncertainty (see section "Potential impact of the under/overestimation of u bb ). On the other hand, for all s r /s bb,real ratios larger or equal than 8, even the lower confidence limit of A-max(s bb ,u* bb ) is an overestimation of the s bb,real , which is driven by u* bb . This indicates that for higher ratios of s r /s bb,real u* bb is a conservative estimate of s bb,real .
As the between-unit variation is in many of the investigated scenarios hidden by the random variation, the performance of Bayesian analysis with respect to the maximum underestimation is mixed. At s r /s bb,real ratios above 20, using the diffuse prior results in all cases in moderate overestimations, whereas the use of the informative prior does not lead to an overestimation. On the other hand, using a point estimate for s bb from Bayesian analysis cannot prevent significant underestimation of u bb either. In all cases, the maximum underestimations of u bb of the Bayesian approaches were worse than A-max(s bb ,u* bb ): u bb was underestimated by up to about 65% by both informative and diffuse priors. The reason here is the opposite of the one discussed under "Maximum overestimation": Whenever the random fluctuation of unit means, caused by method repeatability, compensates for the real variation of unit means, u bb will be underestimated. As in the case of maximum overestimation, also the occurrence of underestimation can only be reduced by either performing more measurements per unit or performing measurements on more units.
As noted above, in cases where s r /s bb >1, the posterior distributions from Bayesian analysis are very broad, and it could be argued that using the median is a poor point estimate for s bb . However, investigation of other point estimates is beyond the scope of this manuscript.

Fraction of significant underestimations
The fractions of cases where u bb was underestimated by more than 30% are shown in Fig. 6. As expected from the discussion above, the extremes are again formed by A-max(s bb ,0) and Amax(s bb , s r /√n): Using A-max(s bb , s r /√n) resulted in none of the 320 simulated homogeneity studies in an underestimation of u bb by more than 30%. This confirms the suggestion made in Ref. [8] that the s means is a safe estimate of u bb . On the other hand, using max(s bb ,0) as an estimate for u bb resulted in half of the cases for s bb,real of 0.2 and 0.5 of underestimations of at least 30%. The outcome was slightly better for s bb,real ¼ 1, but even there, 35e45% of u bb were underestimated by more than 30% when s r was !3%. This shows that using max(s bb ,0) as an estimate of u bb results, in many cases, in significant underestimations of the uncertainty unless the requirement for method repeatability of ISO Guide 35 is met. It should be noted that the respective section of ISO Guide 35 (section 7.7.3) refers to this requirement, albeit only in a note. Seeing the high probability of underestimating the assigned uncertainty, making this requirement more prominent would seem a reasonable suggestion for a future revision of this guide.
A-max(s bb ,u* bb ) resulted in 11 of the 320 simulated studies in underestimations of u bb by at least 30%. This low rate is because the studies investigated have rather high ratios of s r /s bb,real where at high s r , u* bb exceeds the assumed s bb,real . The situation would have been less positive for lower ratios of s r /s bbreal where s bb can usually be calculated, but u* bb becomes too small to account for the potentially hidden inhomogeneity. This rather low fraction of severe underestimations relativizes the concern of u* bb not being a sufficiently conservative estimate of the potentially hidden inhomogeneity raised in Ref. [7]: while the data confirm that u* bb may underestimate the true hidden inhomogeneity, they also indicates that the underestimation is limited and rather rare for the conditions studied.
Using Bayesian analysis does not mitigate the risk of severely underestimating u bb . There was hardly any difference in the percentage of significant underestimation between the use of informative and diffuse priors. For some combinations of s r /s bb,real , u bb was underestimated by more than 30% in 40% and more of the studies. For all combinations of s r /s bb,real investigated, Bayesian analysis performed worse than A-max(s bb ,u* bb ) using this measure, with underestimations in 26 (diffuse prior) and 54 (informative prior) of the 320 simulated studies.

Potential impact of the under/overestimation of u bb
The discussion above focussed on over-and underestimation of u bb . However, u bb in isolation is only relevant for non-certified reference materials where the material information sheet must indicate the degree of homogeneity. For certified reference materials, u bb is always combined with the u char and the uncertainties of long-term and short-term stability (u lts , u sts estimated either from previous knowledge or from dedicated stability studies, as described for example in Ref. [15]) to obtain u CRM , which is calculated as shown in Equation (8).
Depending on the sizes of u char ,u bb , u lts and u sts relative to each other, an over-and underestimation of u bb may have a significant effect or no effect at all on u CRM . In investigating the effect of a false estimation of u bb on certified uncertainties in a real-life situation, the effect of an increase/decrease of the u bb used for the certified values investigated in the section " Need for an uncertainty estimation" by 30% and 50% was investigated. In this case, u lts , u sts were not included in the calculation as different CRM producers use widely different approaches to account for these uncertainties. The results, depicted in Fig. 7, seems at first glance to indicate that the effects of overestimation of u bb are more pronounced than of underestimation. This is because in most cases, u bb is smaller than u char , so a further reduction has hardly any effect. For the certified values investigated here, even changes of u bb by ± 50% have a rather small effect. For more than 75% of the values, the change of u CRM is below 10%. In reality, the effect is even smaller as the JRC has, in line with the requirements of ISO 17034, the policy to include quantitative estimates of u lts and u sts into u CRM . This seemingly higher importance of overestimation, however, needs to be contrasted with the rounding of certified uncertainties. The policy for CRMs produced at the JRC is never to round down uncertainties (e.g. 0.112 is rounded up to 0.12) and to round uncertainties such that the rounding error is within 3 and 30% of the uncertainty. This means that for the cases studied, an overestimation of u bb by 30% would virtually never have resulted in an overestimation of u CRM by more than the accepted rounding error. Also, an overestimation of u bb by 40% would only in a few cases have led to such an overestimation. On the other hand, underestimation of u bb, even by 30%, would have in 12% of the cases resulted in underestimations by more than 15%. Fig. 6. Fraction of cases where u bb was underestimated by more than 30%. Fig. 7 shows the aggregate result over all analytes investigated, but the result is the same when limiting the investigation to certified element mass fractions (data not shown).

Conclusions
Given the results of the evaluation of real and simulate homogeneity studies, the following conclusions are taken: Even using state of the art methods, the actual homogeneity of matrix reference materials, combined with the limitations on the amount of material per unit and the resources that can be put into a single homogeneity study, estimation of u bb by one-way ANOVA will in many cases not be possible. Also, the limit for method repeatability as given in ISO Guide 35:2017 is, in many cases, impossible to meet. There is, therefore, a real need to find reliable approaches to deal with cases where MS between < MS within . While all the approaches investigated in this work have their specific strengths and shortcomings, it must be borne in mind that this study deliberately investigated cases where the analytical noise overpowers the signal from the between unit variation, hence examining cases where there was not enough "data" in the numbers to perform good statistics.
Setting u bb zero when MS between < MS within results in many cases in severe underestimations of u bb . It seems appropriate that future revisions of ISO Guide 35 should strongly emphasise the fact that this estimate is only valid if the repeatability limit is fulfilled, otherwise the maximum of (s bb ,0) is not a suitable estimate for the between-unit variation.
Estimating u bb as A-max(s bb , s r /√n) was, as found in all cases investigated here, to guard against severe underestimations of u bb . This, however, comes at the cost of severe overestimations at already rather moderate ratios of s bb /s r , which diminishes the value of the CRMs for the users.
Estimating u bb as the maximum of (s bb , u* bb ) was found in some cases, especially at low s bb /s r ratios to underestimate s bb , hence confirming the suspicion raised in Ref. [7]. On the other hand, it mitigated the occurrence and extent of severe underestimations. Also, this advantage comes at the price at on average positive biases towards too high values, albeit lower ones than seen for A-max(s bb , s r /√n).
Using the median as point estimates from Bayesian analysis is not a universal solution to the problem of estimating s bb when s r » s bb , because neither using informative nor uninformative priors prevented severe underestimations of u bb . It might be that other point estimates could distribution mitigates the risk of underestimations. This is an interesting option, which should be investigated further. The use of informative priors, however, limit the positive bias of the estimation.
A solution to cases where s r » s bb could be implementing a riskbased approach as advocated in ISO 17034: After estimating u bb (by whatever approach), u lts , u sts and u char it can be determined what the effect of a severe over-or underestimation of u bb on u CRM could have. If u bb is a minor component, a large relative bias of u bb will not have a significant effect on u CRM . If u bb is significant, the reference materials producer can (preferably, considering past experience) decide which risk of over/underestimation the actual uncertainty is acceptable and choose an evaluation approach that limits this effect.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.