Performance testing of dimensional X-ray computed tomography systems

X-ray Computed Tomography (XCT) has become a common tool for dimensional analysis as it allows for non- destructive internal and external measurements. Manufacturers often specify the accuracy of dedicated metrology XCT systems as a maximum permissible error ( MPE ) statement determined using workpieces, test lengths and positions that are significantly different between manufacturers. The VDI/VDE 2630-1.3:2011 guideline provides specification and respective test methods to verify performance against these MPE state- ments and has been applied in this study to evaluate four different commercial XCT systems in a uniform way. For this examination a multi-sphere test object was developed that explicitly complies with these guidelines and maximises the use of the whole measurement volume, which has then been scaled to test both high and low magnifications. The study consisted of two parts: scans according to a protocol to allow for a fair comparison between all the systems and free scans where manufacturers could show the best capabilities of their system. With these particular objects no system complied with its own MPE statement, however sub voxel accuracies were found. The maximum error in terms of voxels ranged between 0.16 voxel to 0.43 voxel for low magnification (voxel size of 100 μ m) and 0.49 voxel to 0.82 voxel for high magnification (voxel size of 14 μ m) between systems. This indicates the need for greater standardisation and transparency on how accuracy statements are determined, and more directed protocols for testing the performance of a system. In particular due to the extreme range of measurement volumes and voxel sizes/resolutions XCT are capable of, it is demonstrated that it maybe reasonable to consider an MPE dependent on voxel size.


Introduction
X-ray Computed Tomography (XCT) has become a common imaging tool in different research fields from materials science and geosciences to biology due to its unique ability to inspect the internal geometry of an object without disassembly or destruction. Measurements on these volumes have been performed since inception as a medical tool to quantify the size of tumours [1]. In the last few decades the accuracy and uncertainty of measurements has been thoroughly investigated in attempts to designate XCT as a metrological tool [2][3][4][5][6]. Since their introduction, the dedicated metrology industrial XCT systems on the market have improved their precision by, for example, adding temperature control to the cabinets, high-precision manipulators and varying complexity in calibration processes [7]. This broadens industrial XCT applications from visual inspection to dimensional measurements ranging from fitting geometries and measuring distances and alignment, to more complex computer-aided design (CAD) surface comparisons [8].
There is a lack of standardisation in how to use XCT as noted by a few authors [9,10] and this is made abundantly clear in inter-laboratory comparisons. An early study showed sub-voxel accuracy as low as 0.05 voxel for uni-and bi-directional length measurements is possible with X-ray CT systems but this was only achieved by a few participants, with some as poor as 5 voxels at the other end of the scale [11]. Beyond this study there have been many more round robin tests with specialised test objects, examples include: industrial items [12], texture and additive manufactured (AM) parts [13], virtual assembly [14] and metal AM objects [15]. All show the operator has influence on the measurement results by the choice of acquisition parameters. Beyond operator error, the accuracy and alignment of the CT system components have a significant impact where a sizeable research effort is being made to reduce and correct these errors. It is important to know the source-object-distance (SOD) and source-detector-distance (SDD) accurately to calculate the correct magnification and voxel size, and have an accurate reconstruction model. This means alignment of all components is key to reduce geometrical errors for example: source and detector being perpendicular aligned [16,17], the motion of the object stage [18] and it's rotation axis perpendicularity and wobble [19], detector alignment of roll, pitch and yaw [20], detector flatness/waviness [21], focal spot drift [22,23] and temperature induced target displacement [24]. All these studies are aimed at reducing and compensating for the errors to improve the accuracy of XCT systems. This research is often performed on commercial systems, but there are examples of developing home-built systems such as METAS-CT [25], a system that is developed for sub-micrometre precision and high long term stability. These improvements and corrections will most likely be implemented in future commercial XCT systems or software updates to increase the accuracy.
A metric for the metrological performance is a Maximum Permissible Error (MPE) statement, which differs from the evaluation of the measurement itself [26]. ISO 14253-1:2017 [27] and the International Vocabulary of Metrology (VIM) [28] define the MPE as the "Extreme value of measurement error, with respect to a known reference quantity value, permitted by specifications or regulations for a given measurement, measuring instrument, or measuring system". Manufacturers provide such statements to give users a specification for their measurement. These statements can be verified using the VDI/VDE 2630-1.3:2011 guideline for the application of ISO 10360 for CMM with CT sensors [29]. The ISO 10360-11 for CT is in the making (currently ISO 10360-11 DIS [30]). However the standards allow for a broad interpretation and clearly don't take into account the XCT capability of internal measurements [31].
When considering an XCT metrology system it would be natural to immediately compare their stated MPEs. With this said users should be wary that there is no common standard that prescribes how manufacturers have to determine the metrological performance of their XCT systems. This makes comparing machines by different manufacturers harder as the critical reader would assume that a particular system would be tuned/calibrated to provide the best MPE result for their workpiece that complies with VDI/VDE 2630-1.3:2011 [29]. Further, the level of detail in the MPE statements, particularly the operating conditions for which it is valid, varies significantly as this is also not prescribed.
There are a few publications that show MPE test results [32,33] but they do not verify the performance of multiple systems tested with the same object, and the objects used are designed by the system's manufacturers with a selection shown in Fig. 1. In both these publications the systems use a configuration of spheres shown in Fig. 1b and d, and they pass the verification. Moroni and Petrò [34] perform a verification test on two different systems but the metrology system does not have an official MPE statement and the other system is not a metrology graded system. The international comparison by Carmignato [11] compares a few measurements against provided MPEs. Only 2 out of 6 participants were within the margin but no full verification test was performed and not all MPE statements were defined according to a standard.
This study quantitatively compares the performance of four different "metrology grade" XCT systems according to VDI/VDE 2630-1.3:2011 [29] using the same workpieces. The aim of which is not to single out a specific supplier, but to inform the community of the current capability when held to the same standard with the tests performed in the same way. It is quickly identified that all systems do not meet their MPE according to this particular VDI/VDE 2630-1.3:2011 [29] compliant setup. What is important however is that the results are good, where measurements are within fractions of a voxel. Following an interrogation of the results the authors suggest routes forwards for manufacturers and users in the future development of these particular systems.

Method
In this study four XCT systems had their MPE statements tested according to the VDI/VDE 2630-1.3:2011 guideline [29] using the same reference object to compare their performance. This section details the object design, the tested systems and the test procedure used.

Performance testing
An MPE according to VDI/VDE 2630-1.3:2011 [29] is stated in the following format: where A and B are constants to be determined for the system, and L is the measured length in mm. Manufacturers can specify this further for specific measurement tasks, for example by limiting the dimensions or the material of the object. Formally the E MPE is based on the length measurement error, which holds for bidirectional measurements. According to VDI/VDE 2630-1.3:2011 [29] the length measurement error is calculated by: with L ka the measured length, L kr the calibrated length, P S probing error size and P F probing error form. The probing error size is the difference between the calculated and measured diameter and the probing error form is the range of distances between points on the surface and the center of the calculated sphere. The length measurement error should be determined using bidirectional test lengths, but sphere distance measurements are unidirectional as center-to-center measurements between spheres do not consider probing errors. This results in the following formula for the sphere distance error (VDI/VDE 2634-2:2012 [36]): An MPE for S D 's is stated as SD MPE . Although unidirectional measurements don't take into account the probing error, they have the advantage of being insensitive to material influences [2].
VDI/VDE 2617 Part 13 and VDI/VDE 2630 Part 1.3 are guidelines for the application of ISO 10360 standards for CMMs with CT sensors [29]. They describes the requirements to perform an acceptance test to determine whether the machine meets its specified characteristics. An ISO 10360 for CT is currently in development (ISO 10360-11 DIS [30]). The VDI/VDE 2630-1.3:2011 guideline describes the specifications to which the 35 test lengths required by DIN EN ISO 10360 have to comply. In seven spatial directions, five (approximately) uniformly spaced test lengths have to be measured three times each resulting in 105 measurements. The longest length should be at least 66% of the space diagonal and the shortest length should be below 30 mm. This should be performed at least at two distinctly different magnifications, preferably at the extremes of system operation.
The ISO 10360-11 for XCT verification has been in development for nearly a decade and at the time of writing the most recent draft (ISO 10360-11 DIS [30]) contains some similar ideas in terms of the number of measurement lengths, repeats and magnifications. The requirements of the workpiece for required lengths and arrangements have changed a number of times over the years, so it won't be discussed detail here as it will likely change again in the future prior to publication. There are however a few mainstay conditions that are more stringent than VDI/VDE 2630-1.3:2011 [29]; the longest test length is ≥ 85% of the diagonal of the measurement volume, the shortest test length ≤ 20% of the diagonal of the measurement volume, and the height of the 3D reference standard should be ≥ 1/3 of the measurement volume diameter. It is not certain these requirements will be in the ISO 10360-11 FDIS.

Object design
A broad range of test objects have been used as reference standards, where an overview can be found in Obaton et al. [37]. It shows a whole range of reference objects ranging from step cylinders and multi-sphere standard to cylindrical assemblies and hole plates, developed for different aims but not all can be used alongside the VDI/VDE 2630-1.3:2011 standard [29]. For example, a simplistic non-compliant standard worth highlighting for its minimal construction and cost is a tetrahedral design with four aluminium spheres by Léonard et al. [38] where the center-to-center distances are known as the spheres sit in contact. Typically, the objects are some derivative of ball plates [34], hole plates [2] or 3D multi-sphere standards [5] as has been chosen for this study.
To test and compare the performance of the different systems a new test object was designed and manufactured following the VDI/VDE 2630-1. This arrangement can be scaled to create objects specifically for different measurement volumes, where it is intended that the test lengths are contained in a region that occupies the whole diameter of the volume and is 80% of a square detector height. Note that the measurement volume is not the whole detector height due to the cone beam resulting in missing information in the corners of the detector/volume with the size dependent on the source detector distance and the opening angle of the source. Although this varies per system often the height of the measurement volume is approximately 80% of the detector height. Under this assumption (and all scalings) the object occupies the entire measurement volume height, and the shortest and longest test lengths are approximately 10% and 85% of the measurement volume diagonal respectively. In contrast to previously published work [32,33,35], this object covers a larger measurement volume and has both shorter and longer test lengths than other workpieces (for example Fig. 1c and d). However the use of longer sphere stem lengths could result in worse calibration uncertainties with a CMM probe. Therefore in this design ceramic rods have been chosen over carbon fibre rods as the Young's modulus (a measure for stiffness) is higher in order to minimise the bending during the CMM measurement with a low force probe.
This object was manufactured in two sizes to be able to perform tests at two extreme magnifications and therefore measurement volumes. The small object is intended for a 28 mm diameter measurement volume (sphere diameter 2 mm) and the large object is intended for a 200 mm diameter measurement volume (sphere diameter 15 mm). Both are sealed with a PMMA enclosure as seen in Fig. 1d to prevent damage and maintain calibration. The spheres are mounted at a sufficient height such that the aluminium base is not in the field-of-view. Both objects adhere to the VDI/VDE 2630-1.3:2011 [29] requirements for the maximum and minimum test lengths, and also the draft of ISO 10360-11 at the time of writing (DIS) [30]. The shortest and the longest length respectively being 28 mm and 203 mm for the large object and 4.5 mm and 28 mm for the small object respectively. The VDI/VDE 2630-1.3:2011 requirement of the test lengths covering the measurement line in approximately uniform steps is stretched with this design with steps ranging from 4 mm-6 mm and from 30 mm to 60 mm. The thermal expansion coefficient of aluminium is relatively high with 23.4 ⋅ 10 − 6 • C − 1 [5], but it only has an effect on the horizontal plane, for the longest lengths this results in an increase of 0.45 μm • C − 1 and 3.36 μm • C − 1 and for the shortest lengths 0.04 μm • C − 1 and 0.34 μm • C − 1 for the small and large objects respectively. Both objects were measured using a CMM at 20 • C to provide reference length measurements and so the uncertainty in the reference length measurements is an order of magnitude smaller than the voxel sizes resulting from the CT scans. The CMM measuring accuracy was verified in accordance with ISO 10360-2:2009 [39] with an expanded measurement uncertainty (k = 2) of 1.0 μm + (L/mm)/625 μm as stated by the manufacturer.

Systems
This study includes five different dedicated metrology XCT systems, all systems have a source that can reach at least 225 kV. The systems are manufactured by Diondo, Nikon, Waygate Technologies, YXLON and Zeiss. To concentrate on how statements relate to the performance of the object and to remove any manufacturer bias, they are anonymised for the remainder of the paper.
The detector width and whether the SDD is variable or fixed can be found in Table 1, with the main difference being the detector width. Two of the systems come with a calibration tool which can be used to automatically calibrate the geometry with a pre-scan; System C has a ruby ball bar consisting of two ruby spheres on a carbon rod and System D has a VDI/VDE 2630-1.3:2011 [29] compliant ruby ball plate object. Table 1 also shows the specifications in terms of MPE values for the bidirectional length measurement error E (according to VDI/VDE 2630-1.3:2011) and the sphere distance error S D . Observe that the SD MPE specification is not a VDI/VDE 2630-1.3:2011 compliant statement, but could be observed frequently used by manufacturers. In this study the SD MPE specifications have been tested, the E MPE has not been evaluated. Only one manufacturer limited the operation of the system for when these MPE's are valid, restricting the largest measured lengths.  Table 1 Accuracy statements of the tested systems, System E was excluded from the data analysis.

Test procedure
A single copy of both objects was sent to the different manufacturers in a round-robin style with instructions on how to perform the scans. They were asked to perform two tests on each object: test 1 with approximately prescribed parameters for a fair comparison between different systems and (the optional) test 2 with full freedom to demonstrate the capability of their system. Within each test, each object is scanned three times as per the VDI/VDE 2630-1.3:2011 guidelines [29] and all motor axes must be moved to the extremes between acquisitions. The motor movement between the acquisition is a specific requirement of ISO 10360-11 DIS [30]. Not all systems have an air-conditioned enclosure, but all machines were operated in a temperature controlled room (20 • C ± 1 • C) and objects were kept for at least 24 h in the these rooms before scanning to avoid thermal expansion due to temperature differences during transport. The protocol for test 1 had the following main requirements: • The object was positioned and aligned in the scanner in the same way. • The detector has to be truncated to 2000 × 2000 pixel mode if it's larger than this, with no pixel binning, resulting in voxel sizes of approximately 14 μm and 100 μm for the small and large object respectively. • Calibration tools were allowed but only if that is part of the normal process for each scan, e.g. full calibration that is only performed periodically was not allowed. • Scan parameters such as voltage, power, exposure and filtration were guided with a standard scanning protocol that optimises the image quality and acquisition time [40]. This avoids unnecessarily long scans, and is more representative of regular scanning in industry. • The acquisition mode had to be continuous mode and not step and shoot. • Only purely axial scans are allowed, no helical scanning for example.
In test 2 manufacturers could for example use a 3000 × 3000 pixel detector mode, increase the number of averages and set acquisition parameters such as voltage, power and exposure according to their own protocol. Every time the samples were returned the CMM reference measurements were verified before sending to the next manufacturer. After verification, test lengths relating 1 sphere of the large object and 3 spheres of the small object have been disregarded in the further analysis. Lengths involving these spheres had S D 's that indicated significant movement compared to the reference measurements with differences of at least an order of magnitude greater than S D 's of similar measurement lengths. While this isn't ideal and is arguably no longer compliant with VDI/VDE 2630-1.3:2011 [29], it is the nature of shipping the objects internationally and the best possible within time constraints.

Data analysis
The reconstructed data was analysed with the Avizo for Industrial Inspection 2020.2 XMetrology Extension (Thermo Fisher Scientific. MA, USA), software for visualisation of and computation on 3D data sets, without any additional pre-processing. The surface was determined using the adaptive sub-voxel algorithm, then sphere geometries were fitted using the Least-Square method to the 22 spheres. All 231 centerto-center distances between spheres were measured rather than just the 35 required for the performance test, with distances relating to spheres that failed re-verification ignored as discussed in section 2.4. With this considered the small object scans resulted in 28 selected test lengths and a total of 172 lengths, the large object 33 selected test lengths and a total of 210. The three repeated measurements for each length were averaged and compared against the CMM measurement and the error determined according to equation (3). The test 2 results of System C are included but both objects were only scanned once instead of the prescribed three times.
An expanded uncertainty with a confidence level of 95% (k = 2) has been calculated for every averaged length measurement. Following Villarraga et al. [5] the combined standard uncertainty is given by: where u p is the repeatability, u w is the uncertainty related to the workpiece (here thermal expansion due to temperature variations) and u b the standard uncertainty of the bias. These are calculated by: with t 1− a,ν the coverage factor of 1.32 for a coverage level of 68.3% of the Student's t-distribution with 2 • of freedom, s x the standard deviation of the length measurements and n the number of repeats.

Results
The full scan parameters for each system per test and object can be found in Table 2. The results of test 1, the prescribed parameters, are shown in Fig. 3 and test 2, with full freedom, in Fig. 4. Within these figures the MPE plots of error against test length with are on the left with lines for the SD MPE as provided by the manufacturers. To the right is a histogram of the measurement error distributions of all measured lengths. Fig. A1 in the appendix provides additional graphs with only the small object so the errors can be seen more clearly. Plotted data points that are filled represent one of the 35 test lengths required in the VDI/ VDE 2630-1.3:2011 [29] MPE verification test and unfilled data points represent an additional test length on the object but not required for verification. The compliance rates are shown in Table 3 and Table 4, separated for the two test objects, where the latter takes into account the calculated expanded uncertainty resulting in higher compliance rates. It was not possible to perform remeasurements for data points outside the specification.
To compare the verification results a "best fit" MPE line excluding test value uncertainty error has been plotted, calculated by a modified method of least squares where the distances of the points to this line were minimised with the restriction that the line encompasses all points relevant to the verification test and must have a gradient that is equal or larger than zero. In Table 5 these fitted SD MPE 's can be found together with an alternate line of fixed gradient 1/50 that again encompasses all points, chosen as it is the least strict gradient in the specification provided by manufacturers. Both fits are based on the absolute values averaged results of the selected test lengths of the large object and test value uncertainty was not included.
The results of System E were excessively poor with measurement errors exceeding the voxel sizes and it was not possible to obtain replacement scans. These results are likely due to poor alignment of the system geometry and highlights the importance of maintaining and confirming the system calibration. Therefore the results of System E are discounted from the analysis as results would clearly be rejected in any basic quality check, and does not add to the outlook of current metrology capabilities that are commercially available. The results of System E can be found in Fig. A2 in the appendix.  Table 3 gives the compliance without test value uncertainty and Table 4 includes this uncertainty.

Performance test
As stated, none of the systems passed the VDI/VDE 2630-1.3:2011 [29] performance test for both objects. When considering the small object at high magnification, System B and C are within specification, and while Systems A and D do not conform they are close when the test value uncertainty is taken into account. The differences between systems are more pronounced for the test with the large object where they all failed verification; System D has the best result with a compliance 91% and System C the worst with 36%.
It has to be noted, especially with the disparity in results between both work pieces, that the large object measurements have greater influence on the compliance taking into account all measurement lengths. Systems A, B and D have a much tighter specification compared to System C, about two times, however while the compliance of System C is high for the small object, it is ranked last for the large object. Likely because the specified S D gives an accuracy of 9 μm at best as per the manufacturer stated S D intercept, which is more than 60% of the voxel size compared to 36% or less of the voxel size for the other systems.
Another option to compare the systems are the maximum absolute errors. The largest absolute error of a selected test length is 43 μm at a test length of 202 mm (System C), followed by 22 μm at 202 mm (System B) and 19 μm at 155 mm (System A). System D performs the best in this context both with a maximum absolute error of 16 μm at a test length of 67 mm for all lengths and 14 μm at a test length of 88 mm for the selected verification test lengths. All these lengths are from the large test object and the errors are all below half a voxel. Fig. 3 visualises the results of the different systems. It shows most systems have some bias, but it is arguably minimal in System D where the histograms for the small and large object are centred around 2 μm and 4 μm respectively. System C shows a significant bias and remarkably is positive for the small object and negative for the large object. System B shows a bias in the large object but nowhere as consequential as the spread of System C. Further the data of the small object does not indicate any bias. The histograms of System A have a lot of overlap, but the peaks have minute opposite biases and a greater spread of values than the similarly well centred histograms of System D. This would indicate that the calibration tool and procedure used in the standard process of System D is very effective and that System C was far from optimally calibrated at the time of this test.
To compare the verification results of the system two fits of the large object data in Fig. 3 were made: one that is a best fit and one that has a fixed gradient of 1/50, both encompassing all points and minimising the distance from data points to the line. The resulting linear functions are given in Table 5 for comparison against the manufacturer specified MPE. The test value uncertainty has not been taken into account for the fitting. The most noticeable for the best fit S D s is the range of gradients as one would predict from the varying degrees of measurement bias previously identified. System D by the best fit is arguably independent of the length in the measurement volume, where as the largest gradient from System C gives an increase of 18 μm per mm of the test length. When comparing the results based on the fixed gradient fit System D still performed the best, although the intercept and therefore minimum error is still a factor Table 2 An overview of the scan settings used for each system per test (T1 or T2) and each object (large or small). Data of rejected System E not shown. System C has a calibration (10 min) performed before test 2 and System D used an Automated voxel|calib ( Table 5.  Table 5. 3 higher than specified by the manufacturer. The results imply a gradient of 1/50 or lower as stated by manufacturers is only feasible when the intercept value is increased. This may be viewed as poor when considering the smaller resolutions of a system since voxel sizes could be smaller than the intercept, but further discussion on this interpretation is left for post-analysis. The acquisition settings only show a few differences between the systems outside of the prescribed settings, the main one being the SDD. System A and C use a comparable SDD of about 1200 mm, System B a longer SDD of 1448 mm and System D a shorter distance of 782 mm. This has two implications: a shorter SDD could increase cone beam artefacts and thus decrease the measurement volume as percentage of the detector height. This would be an advantage to System B compared to the other systems and a disadvantage to System D but it is not likely to have had any significant effect on the results. The photon intensity at the detector is inversely quadratically proportional to the SDD, which impacts acquisition times. System D has the shortest acquisition time: 6 min for the large object and 11 min for the small object, and interestingly the most accurate when considering the best fit SD MPE . Note this excludes the automated voxel calibration which takes approximately 12 min, but is only performed once before each test. The scans of System B took the longest time in test 1 with 1 h 52 min for the large object and 3 h 34 min for the small object, taking 1.75 times longer than the next nearest time (System A). The SDD of course only partially accounts for the time differences, but is the most immediately obvious reason from the information provided. The important take away for users is that there doesn't seem to be a linear relation between acquisition time and performance.

Operator influence
All systems followed a protocol for test 1 to have a fair comparison while maintaining a normative scanning process that should be valid for their S D statements (since no restrictions were stated). For all four systems the measurement errors derived in test 1 exceed their specified accuracy with these differences increasing in test 2 as seen in Fig. 4. One would expect that the manufacturer operator specifying the parameters in test 2 would increase accuracy or at least maintain it, but it appears the issues of inter-operator repeatability mentioned earlier still exists. This is considered in some detail below.
The results of test 2 for System A are shown in Fig. 4a and do not show an improvement compared to test 1. The main differences between them is the use the 3000 × 3000 pixel detector mode decreasing the voxel sizes to 12 μm and 69 μm for the small and large object respectively. Greater compliance has been achieved in the large object (from 70% to 76%), but the bias for the smaller object has been exacerbated as shown by the histogram.
The operator of System B has chosen to use the 3000 × 3000 pixel detector mode in test 2 but (suprisingly) also 2 × 2 binning, effectively a 1500 × 1500 pixel detector mode which reasons for the lower number of projections. This leads to lower magnifications with voxel sizes of 18.6 μm and 135 μm. Again, the manufacturer selected parameters have led to notable decrease in accuracy: errors approach the voxel size for the longest lengths of the small object. The test value uncertainty was a larger than for the previous test and other systems due to more variation between the measured values between the three repeats.
For System C the results of test 2 are based on one dataset for each Table 3 Fraction of the selected test lengths within the specification, in brackets the fraction of all measured lengths within the manufacturer specification. The test value uncertainty was not considered.   Table 5 SD MPE 's fitted to the S D 's of the selected test lengths of the large object; first a best fit, second with a fixed 1/50 gradient. System C T2 based on a single scan. object instead of three; this was a choice of the operator even though 3 scans per object were requested. The calibration tool that is supplied with this system was not used for test 1 for reasons unknown while this is part of their standard operation procedure -potentially its use is not fully automated as is a requirement to be used in test 1. After some of the least compliant results in test 1, one would expect a significant improvement. Unfortunately in test 2 where the calibration tool was used the bias has neither decreased, nor has the gradient of the fitted S D lines and again overall the results are worse. The operator of System D is the only one that has chosen to switch from continuous scanning mode to step and shoot mode for test 2.
Combining this with increasing the number of averages, it results in a factor five increase of the acquisition time (32 min and 53 min for respectively the large and small object) but still fast compared to the other systems. The calibration tool was used once before performing all scans as in test 1. The extra time to acquire the projections did not results in now passing the performance test.

Errors as function of voxel size
MPE statements in millimetres are used when assessing these metrology CT systems to make it easily comparable and compatible with the language of CMMs in ISO 10360. The reality however is more complex as the voxel size and therefore geometric resolution of CT as a measurement system can vary significantly compared to CMMs -for a fixed detector size the voxel size is directly proportional to the measurement volume diameter. While it continues to be true that shorter lengths in general have a small absolute error, what is a "short" test length in one measurement volume will likely have a smaller absolute error in a more compact measurement volume. This is because in this volume where it is a comparatively "long" test length the voxel size will be smaller. This is assuming the position of the two measurement volumes are equally well calibrated, which is arguably unlikely as small measurement volumes are more sensitive to alignment errors. Further because large measurement volumes use a lower resolution but can have a greater range of measurements, this visually skews the magnitude of the errors on an MPE graph such that the largest measurement volume dominates the results. This also holds for the E MPE and SD MPE calculation, with the larger measurement volumes largely control the bounds. Therefore by considering measurement errors and test lengths in terms of the relative error, in voxels, this geometric resolution bias is removed. Fig. 5 shows the previous results scaled by voxel size. The first positive observation is that it shows all systems can give sub-voxel results. Where previously System D performed the best in terms of largest absolute error for the selected test lengths, this analysis shows that System B had the lowest relative error. Before, both in terms of compliance and absolute error the small object gave the best results, now in the context of voxel size it is the large object. Comparing the maximum relative error for the large object System D has the lowest with 0.16 voxel, followed by 0.25 voxel and 0.27 voxel for System A and B respectively, and the highest range is System C, with more than twice the best result, with 0.43 voxel. The smaller object found much greater maximum errors with the smallest being System D at 0.49 voxel. The other maximum relative errors are 0.66 voxel for System D, 0.7 voxel System A and 0.82 voxel for System C.
The conversion to voxel size especially allows a fairer comparison of test 1 and test 2 where some manufacturers used larger detector areas and different voxel sizes. The voxel size decreased for System A and increased for System B with effective detector widths of 3000 and 1500 pixels respectively, which results in the elongated and compressed test length axes in the figure. In both cases the relative error for the large object was roughly the same, System A a slight increase from 0.25 voxel to 0.35 voxel and for System B a slight decrease from 0.27 voxel to 0.23 voxel, but in both cases the relative error increases notably for the small object to almost a voxel: System A from 0.68 voxel to 0.97 voxel and System B from 0.49 voxel to 0.96 voxel.

Discussion
All systems failed to pass the performance tests for both objects in this study, despite manufacturers similarly testing their SD MPE according to VDI/VDE 2630-1.3:2011 [29] with their own objects. Although it needs to be taken into account that the systems didn't have their verification and periodical calibration shortly before the test. The ZEISS object (Fig. 1d) is similar to the design used in this study; it has the same number of spheres (22) and they are arranged in 3 concentric circles around a center sphere but the relative height differences are smaller and thus the object covers a lower proportion of the measurement volume height at low magnification. The same holds for the YXLON design (Fig. 1c); this object consists of 16 spheres with 1 central sphere and the other spheres at two different heights arranged around it. The Nikon object (Fig. 1b) most likely covers the whole measurement volume at low magnification. However, when scanning these objects at higher magnifications only part of the object will be in the field-of-view, reducing the number of possible measurements and creating further issues. For the Zeiss and YXLON objects while the proportion of the measurement height covered will be increased, the shortest test length will be a higher proportion of the measurement volume diagonal. In the case of the Nikon object all spheres will be in a single plane at high magnification as only spheres on the top plate will be in view. The plate of Waygate Technologies (Fig. 1a) has the same pattern of 11 ruby spheres at two different scales. This allows for the same coverage of measurement volume at both low and high magnification, but the plate design means all measured lengths will be in the same plane. The test object of this study was manufactured at two different scales to cover the full measurement volume at both high and low magnification which could explain why the systems failed to pass the verification test for this particular set of test objects. A reason why the ZEISS and YXLON designs do not have greater height differences between spheres could be the possible worse calibration uncertainties which do not outweigh the advantages of covering a larger proportion of the measurement volume.
All provided MPE statements set high expectations. Considering a measurement volume of 200 mm × 170 mm in combination with a 2000 pixel detector this would give a 100 μm voxel size. Following their SD MPE s a 210 mm sphere distance measurement in this volume (80% of the diagonal) should have an error smaller than 7.1 μm, 8.7 μm, 13.2 μm and 5.9 μm across systems A-D, respectively. This is 6-13% of a voxel which is extremely small bound, and only achieved in the very best case scenario of a multitude of studies covered in the introduction. It is easy to argue that such a bound does not allow for any uncertainty of measurement. The fits with a 1/50 gradient determined from our experiments suggest these values should be 20.6 μm, 21.9 μm, 42.8 μm and 15.8 μm: 2-3 times greater. Conversely smaller measurement volumes are held to a less strict standard proportionally to voxel size. A measurement volume ten times smaller (10 mm × 17 mm) with a voxel size of 10 μm, a 21 mm sphere distance measurement should be within an error of 5.4 μm, 4.9 μm, 9.4 μm, 4.0 μm; 40-94% of the voxel size. This accounts likely for the higher compliance rate of the small object.
It also can be observed for all systems and both tests the higher magnification always resulted in a worse error in terms of proportion of voxels. A potential reason could be due to the magnification, and therefore voxel size, changing faster as the object moves closer to the source since magnification is SDD/SOD. This means the precision, repeatability and calibration of the manipulator becomes even more critical for high magnifications. While the precision and repeatability of the manipulator is out of the control of the operator, non-uniform sampling points could be used in its calibration with more closer to the source due to the inverse relationship. It is not clear whether manufacturers take this into account for their calibration processes.
System C and D use pre-scanning calibration tools to try and improve the known SOD. While there are likely also other issues, the outcomes of both systems are quite different. This could be related to the earlier point that both the calibration as well as the verification measurements should use as much of the measurement volume as possible. Dewulf et al. [41] use a calibration tube with 49 spheres that covers the full measurement volume and is performed at the same magnification as the other measurements, this reduced their S D errors significantly. From the description of tools it is reasonable to assume the tool of System D covers the measurement volume better compared to the tool of System C which could explain (partly) the disparity between both systems.

Conclusion
In this evaluation the performance of four metrological CT systems of different manufacturers (with a fifth disregarded) has been compared against their stated SD MPE using two very different sized reference standards of the same design. These tests used the same test protocol. All systems failed to pass this evaluation on both magnifications but considering the proportion of voxel size as error, it is clear the results can still be very good. The best fixed gradient fit found, without expanded test value uncertainty or security margin taken into account, was 11.1 + (L/mm)/50 μm and the best maximum error in terms of voxel size was 0.49 voxels when considering both objects (the large object performed better with 0.16 voxels).
The easiest direct comparison between different dimensional XCT systems is comparing the MPE statements since it uses a language that is already common and well understood. Therefore it is commercially attractive to have the best E MPE /SD MPE and to set it very tight. With no standard process or common object to test against, it has led to manufacturers creating their own and creating system improvements (cynically) solely to improve the specifications against their own object. When comparing the provided accuracy statements with the best fits and 1/50 gradient fits it is clear both the intercept and the gradient could and probably should be increased. At the time of writing, a first step towards standardisation has been made by Thompson et al. [26] that show a statistical method to determine the MPE in accordance with the ISO 10360-13:2021 [42]. Together with ensuring an object that covers the entire measurement volume, a truer outlook on MPE could be achieved.
Manufacturers do not restrict their accuracy statements in terms of magnification and only one of those evaluated restricted the maximum measurement length. This study concludes that they should consider adding some, particularly if their testing is constrained to a proportion of the measurement volume. Further, it could be reasonable to consider a resolution dependent intercept (and gradient?) as in the prior discussion. Manufacturers should publish how they verify their systems and the test object(s) used so users can make better judgements on compliance. They should provide information in terms of magnification/voxel size, measurement volume and the dimensions measured. If there is any reason to do the above, it has become clear from this work that the freedom in test object design by VDI/VDE allows for objects which can lead to failure of verification, although that was not the original intention of its design.
In XCT the scan parameters are still very operator dependent and, as shown in this study, can have a influence on the acceptance test results. Test 2 intended to facilitate manufacturers to show the best capabilities of their system, but it showed that all their different approaches did not lead to better results compared to our normative directed parameters (test 1) arising from the process described in Ref. [40]. On the contrary more systems had increased errors and it emphasised the current operator influence in performance testing. As far as the authors are aware manufacturers do not provide strict protocols on how to scan to achieve results within their accuracy. Generally the scan protocol to determine acquisition parameters needs more standardisation for the purpose of acceptance testing as VDI/VDE 2630-1.3:2011 [29] does not have a requirement on how the scan is performed.
Currently VDI/VDE 2630-1.3:2011 [29] is the go-to guideline for verifying metrological specifications of XCT systems, but with the ASME standard from 2020 and the upcoming ISO 10360-11 [30] that is currently drafted this will likely change. The ASME B89.4.23-2020 [43] was published while this study was on going and it was not taken into account with the object design. This standard is not compatible with the ISO and VDI/VDE 2630-1.3:2011 guidelines for performance testing [29] with a quite different and stricter approach, both in test lengths and materials. ASME requires the test object to have 8 coplanar spheres made from plastic, aluminium or steel arranged to have 28 test lengths in six independent measurement lines. This object has to be scanned in three measurement planes: horizontal, vertical and inclined along approximately the diagonal of the measurement volume. These new standards introduces the option that manufacturers following even more different approaches in the determination of their system accuracy.
What is clearly different with this object design and the test lengths in this work is that they typically cover a much greater area of the detector than others provided by manufacturers, which may be the reason why no system has been compliant. Overall, this study indicates that the SD MPE statements provided by the manufacturers are too tight (incorrect?) and/or should have some statements in the restriction of how this should be verified. It is worth reminding the reader that a manufacturer SD MPE should include a margin such that all measurements are easily within bounds but it was difficult to observe evidence of this consideration. Finally we emphasise that one should consider if the error is reasonable for that voxel size arising from the size of the measurement volume. Anticipating all measurements to be within 0.06 voxels as per the large object of a particular system's SD MPE is likely unreasonable, but we encourage manufacturers and operators to define these expectations in the form of explicit and realistic specifications.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.