The study of statistical methods for evaluating the comparability of routine chemistry analytes among 3 routine laboratory measurement systems in China

Background Clinical laboratory tests are important for clinicians to make diagnostic decisions, but discrepancies may directly lead to incorrect diagnosis. We would like to introduce some statistical methods to evaluate the comparability of chemistry analytes while comparing the performances of different measurement systems. Methods We used a panel of 10 fresh-frozen single donation serum samples to assess assays for the measurement of glucose and other 13 analytes. Statistical methods used in this article include traditional statistical analysis, robust statistics, regression analysis and differences on medical decision levels (MDL). All the statistical analysis results would be evaluated. 20 Chinese tertiary hospitals accredited to ISO 15189 took part in this work. The commercial random access platforms included: Olympus (8 labs), Hitachi (6 labs) and Roche (6 labs). To compare the acceptable rates, Chi square test was used. Results The statistical analysis results are as follows: (1) Coefficient of variations are between 2.8 and 3.9 %, with the slopes and intercepts of regression functions between 0.928 to 1.064 and −0.174 to 0.630, respectively. (2) The percentage of robust z-scores between −2 and 2 is bigger than 90 %. (3) The total percentages of differences on all the MDLs are: less than optimal was 31.7 % (19/60); less than desirable was 60.0 % (36/60); less than minimum was 65.0 % (39/60); more than minimum was 35.0 % (21/60). In this study, 2 laboratories (Nos. 8 and 16) were considered as poor performance by z-scores. 10 laboratories (Nos. 4, 5, 7, 8, 9, 10, 11, 14, 16 and 19) have unacceptable measurement errors on MDLs. 10 laboratories (Nos. 1, 2, 3, 6, 12, 13, 15, 17, 18, 20) can achieve mutual recognition of serum glucose testing results, including: 5 (5/8) Olympus, 2 (2/6) Hitachi and 3 (3/6) Roche. There was no significant difference among acceptable rates of the three measurements systems for the serum glucose assay. Conclusions Traditional statistical analysis, robust statistics and robust z-score, fitting linear regression equations and calculating differences on different MDLs can be used on studying the comparability and mutual recognition of clinical chemistry analytes among hospitals or laboratories in China. The mutual recognition and interchangeability of results remains jeopardized even among tertiary hospitals in China. More works and efforts should be done for improvement of the current situation of interchangeability of results in clinical laboratories in China.


Background
The results of clinical laboratory tests can reflect the health status of patients, which are critical diagnostic evidences for clinicians. Performing accurate and precise measurements that are comparable over time and location and across assays is essential for ensuring appropriate clinical and public health practice (Stepman et al. 2014). Mutual recognition in clinical laboratory field is an agreement by which two or more laboratories agree to recognize one another's test results of the same patient in a relatively short period, it is an aim of the health system. Before implementing mutual recognitions of clinical test analytes in China, patients were subjected to the same measurements repeatedly in different hospitals in a short period. These redundant measurements were not only a waste of time and medical expense, but also stressed patients due to repeated blood or other human sample collections. If the test results of different laboratories could achieve mutual recognition, analytes would not need to be duplicated measured in a reasonable short period. Mutual recognition is a regional agreement by which two or more laboratories agree to recognize one another's testing results. In this article, 20 Chinese tertiary hospitals attended this study, and 14 clinical chemistry analytes were included. These analytes are (in serum): Alanine aminotransferase, Aspartate aminotransferase, Alkaline phosphatase, Glutamyltransferase, Lactate dehydrogenase, Creatine kinase, Urea nitrogen, Creatinine, Uric acid, Glucose, Total protein, Albumin, Cholesterol and Triglycerides. The analyte of serum glucose was used as the example for the statistical methodology study. In this article, the statistical methods and parameters, which may be useful for inter-laboratory comparison in China, would be calculated and analyzed, include: traditional statistical analysis of raw data, robust statistics and robust z-score, fitting linear regression equations and differences on medical decision levels (MDL). The interchangeability of results of serum glucose would be evaluated as example.

Ethics statement
The study involved use of leftover patient samples which were all de-identified during the collection. It was also ensured that appropriate amount of serum was collected from each patient sample so that a certain volume was left for possible repetition of measurement. The use of patient samples in the present study has been reviewed and approved by the Ethics Committee of Beijing Hospital and Shanghai Zhongshan Hospital. The authors and the related laboratories staffs confirmed that all subjects had given their consent to participate in this study even if the samples were the leftover samples from outpatient department. Our study adhered to strict ethical guidelines as set out by committee on publication ethics (COPE, http://publicationethics.org/).

Laboratories and samples
We performed this study with 10 fresh-frozen single donation serum samples obtained from Shanghai Zhongshan hospital. Serum was collected according to the CLSI protocol C37-A without filtration and with 2 U/mL human thrombin (Sigma-Aldrich) added to the serum to facilitate clotting at room temperature (Wayne 1999). The individual blood donations were tested and found negative for anti-HIV I/II, anti-hepatitis C virus, and hepatitis B surface antigen. Immediately after 2-mL portions of the sera were aliquoted into polypropylene vials, the sera were stored at −70 °C and kept under these storage conditions until shipment on dry ice to the participating laboratories. The samples were required to be kept frozen until analysis. The participants (20 tertiary hospitals) each received 1 aliquot of the 10 samples, which was sufficient for analysis of the 14 analytes twice. The manufacturers/test systems used by participants were: Olympus AU (the serial numbers of labs were 1, 2, 11,12,15,16,17 and 19,n = 8),Hitachi (3,4,5,7,8 and 18,n = 6),and Roche Cobas (6,9,10,13,14 and 20,n = 6). The homogeneity and stability of the samples were guaranteed by National Center for Clinical Laboratories (NCCL) which prepared the samples as fresh frozen blood and had been approved by China National Accreditation Service for Conformity Assessment (CNAS) for ISO 17043. To compare the acceptable rates among measurement systems, Chi square (χ 2 ) test was used. A p < 0.05 was considered significant.

Traditional statistics and data treatment
All numerical results were converted to SI units. After calculating the median, arithmetic mean, standard deviation (s), coefficient of variation (CV), minimum and maximum, test results exceeding the range of arithmetic mean ± 3 times of s were considered as outliers and eliminated.

Robust statistics analysis
Robust statistics (International Standard Organization 2005) are statistics that emulate popular statistical methods, but they are not affected by outliers or other small departures from model assumptions. Robustness is a property of the estimation algorithm, not the estimates it produces; therefore it is not strictly correct to call the averages and s calculated by such an algorithm robust. In order to avoid the use of excessively cumbersome terminology, the robust average and robusts should be understood in ISO 13528 as "mean estimates of the population mean" or "mean estimates of the population standard deviation calculated using a robust algorithm". The robust estimates average and s were derived from an iterative calculation by updating the values of average and s several times from the modified data until the process converged.
The algorithm of robust average and robust s of robust statistics could be concisely described as below (International Standard Organization 2005): Denote the submitted results of one lot, sorted into increasing order, by: Denote the robust average and robust s of these data by x * and s * .
Calculate initial values for x * and s * as: x 1 , x 2 , . . . , x i , . . . , x p x * = median of x i (i = 1, 2, 3, 4 . . . , p) Update the values of x * and s * as follows. Calculate: For each x i (i = 1, 2, 3, 4…, p), calculate: Calculate the new value of x * and s * from: where the summation is over i. The robust estimates x * and s * may be derived by an iterative calculation, i.e. by updating the value of x * and s * several times using the modified date, until the process converges. Convergence may be assumed when there is no change from one iteration to the next in the third significant figure of the robust s and of the equivalent figure in the robust average.

Creating regression equation
The median of the results from each sample lot was treated as the independent variable (X), while the arithmetic average of the two test results was treated as the dependent variable (Y). Each regression equation for each laboratory was based on the median and its test results, Y = k * X + b; where k is the slope, and b is the intercept.

The differences of each MDL
The differences were calculated according to the regression equation and the MDLs of serum glucose, namely 2.50, 6.67 and 10.00 mmol/L (Statland 1987). The MDLs treated as independent variable (X) were brought into the regression equation to calculate the dependent variable (Y): the differences (%) = (Y − MDL)/(MDL) × 100 %. Compared to the desirable, optimum and minimum allowable differences derived from biological variation data (Joana et al. 2014), the inter-laboratory test results comparability and the differences on MDLs would be evaluated comprehensively.

Calculating robust Z-score for every laboratories
Z-score is the standardized measurement of laboratory bias, which is calculated using the assigned value and the standard deviation for proficiency assessment. In this article, robust z-score (International Standard Organization 2005) was derived from the robust averages and robust s, and the formula for the robust z-score in this article is thus z = (x − X)/σ = (x -x * )/s * , where x is the test results, X is the averages of x, σ is the s, x * is the robust averages, and s * is the robust s. Two or more robust z-scores of these 10 lots of one laboratory above 2 or below −2, shall be considered as poor performance and cannot be recognized with others.

Traditional statistics
Only two single outliers were determined in lot 10, and the rest of the results were all in the range of arithmetic mean ± 3 times of s, for details please see Table 1.

The regression equations and differences of each MDL
After creating regression equations and substituting each MDL into the equations, the differences of each MDL for each of the 20 laboratories were calculated. The allowable differences of optimal, desirable and minimum of serum glucose were 1.17, 2.34 and 3.51 %, respectively, which were calculated from CVw (within-subject biologic variation) and CVg (between-subject biologic variation) of 2014 (Statland 1987). There were only 31.7 % (19/60) differences less than the optimal allowable bias, and 60.0 % (36/60) differences less than the desirable allowable differences and 65.0 % (39/60) for the minimum, suggesting there were more than 1/3 (21/60) differences that failed to meet the minimum allowable differences, see Table 2 and Fig. 1 for details. 10 laboratories (Nos. 4,5,7,8,9,10,11,14,16 and 19) have unacceptable measurement errors on MDLs.

Robust statistical results and robust Z-scores
Robust statistical results and the range of robust z-scores were listed in Table 3. The range of robust averages of all lots of samples were from 5.126 to 14.434 (mmol/l), robust standard deviations were from 0.179 to 0.433 with the change in the robust averages, and z-scores were from −2.895 to 5.356, except that lot 1 was only 80 % (16/20) of robust z-score in the range of −2 to 2 and others were no less than 90 % (18/20). The laboratories of Nos. 8 and 16 have two or more robust z-scores out of the range of −2 to 2.

Discussion
All hospitals and laboratories are eager to perform well. In order to achieve performing accurate and precise measurements, one method is to use assays that are metrologically traceable to a higher-order reference measurement system or harmonized by use of internationally recognized procedures (Vesper and Thienpont 2009;Miller et al. 2011). In Europe, the European Union Directive on in vitro diagnostic medical devices   analyte-serum total calcium, the additional cost of calibration error was 0.06-0.199 billion US dollars (Michael). Duplicated clinical laboratory tests within a short period will result in the rising dissatisfaction of the patients to the healthcare providers and discrepancies in test results between different hospitals or from different instruments may directly lead to incorrect diagnosis or medical disputes. The CV is a normalized measure of dispersion of a quantitative data. In this study the CVs were quite small which means the dispersion and differences among test results of all the laboratories were small. But currently there is no evaluation standard for evaluating the CVs of test results of more than one laboratory. The smaller CVs are, the better consistency of the test results are. The slopes (the closer to 1 the better) were used to evaluate the ratio errors of measurement system and intercepts (the closer to 0 the better) for evaluating the systematic errors. The performance of laboratory cannot be evaluated only by slope and intercept; the difference on the MDL which can be calculated by the regression equation is a useful indicator of performance statistics for laboratory. In this study, 10 laboratories (Nos. 4,5,7,8,9,10,11,14,16 and 19) have unacceptable measurement errors on MDLs. There were 3 labs for Olympus AU, 4 for Hitachi and 3 for Roche Cobas. There was no significant difference among acceptable rates of the three measurements systems for the serum glucose assay. Not same as the previous study in development country (Stepman et al. 2014). 10 laboratories (Nos. 1,2,3,6,12,13,15,17,18,20) can achieve mutual recognition of serum glucose testing results and their differences on MDLs can be accepted.
In this study, arithmetic mean, median and robust average had little difference compared with each other in every lot of sample, which means that the dispersion of the results was small, and the distribution of the results was relatively concentrated.
When a participant reports a result that gives rise to z-score above 2 or below −2, then the result shall be considered to give a "warning signal". In this study, two or more "warning signals" in one laboratory, shall be taken as evidence that an anomaly has occurred that requires investigation. In this study, 2 laboratories (Nos. 8 and 16) were considered as poor performance by z-scores.
To achieve the mutual recognition of clinical test results, the clinical laboratories should perform their daily work and management according to ISO 15189 as much as possible, internal quality control and external quality assessment should be performed timely, and the reference intervals should be the same with each other (Wang et al. 2011). On such basis, fresh clinical samples could be used to study the comparability of the results between the laboratories under the mutual recognition. In this article, we introduced the statistical methods and analyses including traditional statistics, robust statistics and robust z-scores, to create regression equations and calculate the differences of each laboratory in MDLs. The traditional and robust statistics described the basic information of the raw data, with the latter hardly affected by the outliers; robust z-scores described the locations of each test result in the overall data; the regression equations calculated the systemic errors of every laboratory in MDLs (the medical decision levels were substituted into the regression equation to evaluate the differences in MDLs). The methods described here were relatively intuitive and simple, easily applicable to office software or statistical software.