Comparison of form measurement results for optical aspheres and freeform surfaces

Comparing form measurement data for aspheres and freeform surfaces is an important tool for ensuring the quality and functionality of the devices used to take such measurements and may also allow the underlying measurement methods to be evaluated. However, comparing the highly accurate form measurements of such complex surfaces is a demanding task. It is difficult to analyze measurement results whose accuracies are in the range of several tens of nanometers root-mean-square, especially when comparing data with different, and anisotropic distributions of the 3D measurement points on the surface under test. In this paper, we investigate eight different 3D measurement point distributions that are typical of highly accurate measurement systems currently in use and demonstrate the effects of these distributions on the comparison results by using virtually generated data and applying different evaluation strategies. The results show that, for the examples investigated, the different 3D measurement point distributions can yield different levels of accuracy for the comparison. Furthermore, an improved evaluation procedure is proposed and recommendations on how to significantly reduce the influence of the different 3D measurement point distributions on the comparison result are given. A method of employing virtually generated test data is presented that may be generalized in order to further improve and validate future comparison methods.


Introduction
The manufacturing of optical surfaces has reached an advanced stage, allowing even complex surface forms such as asphere and freeform surfaces to be realized with high accuracy. Parallel to this development, form metrology for these surfaces has also improved greatly, with various measurement systems available that utilize different measurement principles * Author to whom any correspondence should be addressed.
Original Content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15]. Comparing the measurements of selected asphere and freeform surfaces is important in order to allow the performance of these measuring systems to be assessed. Measurement comparisons at the highest available level of accuracy with resolutions in the range of 10 nm and below have been organized by the Competence Center for Ultraprecise Surface Manufacturing (CC UPOB e.V. [16]). The measurement results of these round robin comparisons have been evaluated by the Physikalisch-Technische Bundesanstalt (PTB) [17][18][19][20][21]. This paper presents the findings and insights gained from the evaluation of these measurement comparisons.
If the asphere/freeform-surface measurement results have an accuracy level of several tens of nanometers (up to 100 nm, depending on the quality of the specimen) root-mean-square (RMS), and if the required accuracy of the comparison result is correspondingly high, then a comparison of the measurement results is challenging from many points of view: the different instruments deliver different (and, most importantly, anisotropic) 3D measurement point distributions; some measurement systems have different spatial resolutions (or integration areas on the surface that are different in terms of size and shape [22]); the respective areas measured on the samples can be different in terms of their size; and the measurement systems may have differently oriented coordinate systems if their sample orientations are different. In addition, because robust measurement uncertainties for the data sets are often unavailable, additional information about the (location-dependent) uncertainties of the 3D measurement data cannot be used for tasks such as weighting the measurement data. Another difficulty is that, because there are no absolutely measured, and very well characterized standard surfaces available at the accuracy level mentioned here, it is sometimes difficult to distinguish whether an effect in the measurement data is attributable to the measuring instrument or to the sample.
In this work, we first present the evaluation procedure used in recent comparisons to account for the aspects mentioned above; we then investigate the influence of different 3D measurement point distributions as generated by different measurement systems, all of which claim to describe the same surface shape, on the comparison results to which end the described evaluation procedure is used.
As mentioned above, many measurement systems exist that use different measurement principles. For example, certain instruments apply optical [6][7][8][9] or tactile [4][5][6] point sensors that allow different scanning paths on the surface to be used. Often, spiral scans [7], circular scans [2] or line scans [4,5] are used, depending on the construction of the machine and on the sensor paths chosen by the customer. Areal interferometers are also used [10][11][12][13][14] that usually acquire the measurement data by means of high-resolution cameras that generate measurement data on either equidistant or non-equidistant grids, e.g. when using stitching methods [11][12][13].
For an adequate and fair measurement comparison of such data, possible systematic errors of the comparison results caused by an insufficient analysis of different 3D measurement point distributions should be avoided or at least reduced.
Therefore, in this work, we discuss how comparison results are influenced by evaluation strategies that have different levels of sophistication by investigating virtually generated data sets representing the 3D measurement point distributions found in typical measurement approaches. We demonstrate our findings for two selected specimen shapes. Furthermore, we give recommendations for reducing the influence of different 3D measurement point distributions on the comparison results and for improving and validating future comparison methods by employing virtually generated test data.
The paper is organized as follows: in chapter 2, the basic procedure for evaluating common measurement comparisons of measurements of asphere and freeform surfaces is explained, the challenges of comparing high-accuracy measurements are detailed and a procedure for overcoming these challenges is proposed. Chapter 3 defines the setting used to virtually generate test data in order to represent the 3D point distributions of typical measurement systems and to demonstrate their influence on the accuracy of the comparison results for selected example specimen forms. The outcome of this investigation is presented in chapter 4. Finally, chapter 5 discusses the findings of the investigation and presents conclusions.

Comparison concept for aspheres and freeform measurements and challenges entailed by this concept
While the comparison of the results of form measurements for flat or even spherical surfaces is rather straightforward, several challenges arise for a highly accurate comparison of the results of form measurements for asphere or freeform surfaces: due to significant variability in the local slope of the specimen, it is very important to accurately align the data sets generated in such measurements to each other. Furthermore, because no highly accurate traceable reference specimens or measurement systems exist, the true form of the specimen is not known. To make the differences between the different measurement results more visible, the design topography is usually removed after the measurements have been aligned with the design. Additionally, the different measurement systems usually have large uncertainty contributions for the measurement of the spherical components of the surface such as the best-fit sphere (BFS) [23]. For some optical applications, the spherical contribution can be compensated if an optical system is mounted or adjusted in a certain way. Therefore, it is also useful to compare the non-spherical components of the different measurement systems; here, the individual spherical contributions must be removed. Especially with regard to the removal of the design function and the spherical contributions from each single data set, challenges for an adequate comparison arise, as the 3D measurement point distributions of the different measurement methods must be taken into account.

Comparison procedure
In the following, the concept of data comparison within the scope of a typical round robin comparison of measurements of aspheres and freeform surfaces is briefly described. For this purpose, we limit ourselves to procedures that have recently been used [18][19][20][21]23]. In most cases, the absolute form measurement data sets (design and deviation from the design) are analyzed in terms of Cartesian coordinates. The data sets of each individual measurement are preprocessed to achieve optimal data alignment and to calculate the characteristic data that will be compared later for each data set. A flowchart of the comparison procedure is shown in figure 1.
In the first step, the data sets are aligned to each other and transformed into the same common coordinate system. For this purpose, each measurement data set (the absolute measurement data of each contributing partner organization) is aligned with the design topography. For the alignment, the difference between the design topography and the measured data points is minimized in a least-squares sense (describing the design deviations in z-direction), allowing shifts and rotations of the measurement point cloud along the three Cartesian coordinate axes. The permissible degrees of freedom depend on the symmetry of the specimen's design. After the alignment, the design topography is removed to make the differences between the measurement systems more visible. The residual data sets after removal of the design topography are called residual data in the following.
In the second step, i.e. after the removal of the design topography, an individual BFS is removed from each residual data set because the residual data sets usually differ significantly in terms of the spherical contributions [18,23]. In addition to these different spherical contributions, the comparison is also designed to reveal the non-spherical differences between the different residual data sets measured. The BFS is determined using the Levenberg-Marquardt algorithm [24, chapter 5.2] as implemented in MATLAB ® [25]. The resulting data is called non-spherical residual data.
Depending on whether the design specimen is symmetrical, these non-spherical residual data sets may need to be aligned with each other by rotating them around the z-axis. This is achieved by determining the maximum correlation to an arbitrarily chosen reference non-spherical residual data set when rotating the individual non-spherical residual data sets around the z-axis. Depending on which low-and mid-spatial frequencies are present on the specimen's surface, the frequency components with the most significant structures should be used for this alignment step. In this paper, the interpolated midspatial frequency structures of the non-spherical residual data (after subtracting Zernike polynomial functions of up to order n = 18) are used.
After this preprocessing of the single-measurement data sets, the resulting reduced data sets are compared. For this purpose, all reduced data sets are interpolated to a common regular grid on a chosen aperture. Then, for each grid point, the median value of all available reduced interpolated data sets is calculated. This pointwise median topography will then act as a reference. Finally, the pointwise difference between each reduced interpolated data set and this so-called virtual reference topography (VRT) is determined and some statistical values (e.g. the RMS value and the median absolute deviation (MAD)) of these difference topographies are considered. This type of reference topography is used because, in the comparisons, the uncertainties of the individual instruments are typically not given by all participants. The effectiveness of this concept has been demonstrated in [18].
In the past, other groups have also conducted comparisons of different form measurement results for aspheres and freeform surfaces [1,2,6]. In these cases, the different measurement results were compared by plotting the results of each device individually and calculating some characteristic values (e.g. the RMS value, the peak-to-valley value, or Zernike coefficients for the complete topographies) instead of calculating a pointwise difference between the results or the pointwise difference between each result and a reference topography. Nevertheless, a pointwise comparison is a much better method for revealing the local differences between the different measurement results. To our knowledge, the effects discussed in this paper have not been addressed before.

Challenges and proposals
In the procedure described above, it is very important that the individual preprocessing steps be performed for each data set in the same way and on the same aperture of the specimen. However, this condition is not easy to fulfill, since all measurement results are initially reported using different coordinate systems that measure the form of the specimen at very different 2D measurement data points. Furthermore, in such a comparison, not all measurement systems are able to measure the form of the specimen over its complete aperture; even if all measurement systems were able to do so, the alignment of the data sets to each other would lead to small shifts between them. Therefore, a proper evaluation aperture radius r eval over which the final comparison is performed must be chosen. For the preprocessing, a larger aperture of all data sets must be available to account for shifts during the alignment procedures.
Nevertheless, the necessary alignment of the data sets with the design topography, as well as the individual BFSes during the preprocessing, should be determined using the same aperture radius for all available measurement data. If, for this purpose, only the original x-, y-measurement data points contained within a certain aperture were chosen, the maximum xand y-values of the original measurement data points would still be different between the different measurements, since different sampling schemes and point distances exist for each data set. Furthermore, both the 2D measurement point distributions and the sampling scheme itself influence the alignment and fitting procedures used in the comparison procedure. Here, especially for the least-squares fitting procedures, certain regions of the specimen have a greater influence on the fits if the local data-point density of these regions is larger than the data-point density of other regions of the specimen. For example, this is the case for circular and spiral scans, whose local 2D measurement data-point density is usually larger in the center of the specimen.
To analyze these different data sets, we decided to use data interpolated to a common regular grid (the same grid as that used for the final comparison of the data) to determine the necessary alignment to the design topography and to determine the individual BFSes. Note that, for the removal of the design topography and for the removal of the BFS, the original 3D measurement data points were used.
In this paper, we analyze the effects of using interpolated data for these steps and compare these effects to a situation in which interpolated data is not used. For this purpose, we present comparison results for virtually generated test data by applying four different evaluation strategies.

Study of effects using virtually generated data sets
One can assume that the effects described depend to a great extent on the design of the specimen and on the deviation from this design, since the fitting processes are influenced by the design form and by the size and structure of the design deviations. In this paper, the influence of different 3D measuringpoint distributions on the comparison result is demonstrated using virtually generated data sets for two different example specimens that are similar to those used in the CC UPOB e.V. comparisons. For this purpose, four different 2D measurement point patterns (called sampling schemes in the following) are investigated with two different total numbers of data points (leading to different data-point densities) for each scheme to allow each specimen to be compared with eight different virtually generated data sets; the VRT calculated from each comparison is based on these eight different data sets. The typical sampling schemes chosen are produced by the measurement instruments available, e.g. spiral scans, line scans, circular scans and equidistant scans.

Specimen designs
The first specimen is a convex asphere, which is a typical, commercially available asphere. The design function of the asphere is based on the definition of a standard asphere [26]: where r = √ x 2 + y 2 is the distance from the center of the specimen, R is the vertex radius of curvature, κ is the conic constant, and A i are further coefficients describing the asphericity. The following parameters have been used: aspherical coefficients: and vertex radius R = 20.2 mm. The asphere has a diameter of 30 mm.
The second specimen is a convex toroidal surface that has two different radii along the two orthogonal axes. These design radii are r v = 40 mm and r h = 42 mm; the specimen has a diameter (clear aperture) of 50 mm. The mathematical description of the toroidal surface is given by

Design deviations
The realistic design deviations investigated are of the same type as the specimens used in the CC UPOB e.V. comparisons. The design deviations chosen are similar to real measurements because Zernike polynomial functions of up to order n = 20 based on real measurement data are selected. The Zernike representation is necessary to generate a smooth, clearly defined surface that can be sampled at any point. In this paper, we want to demonstrate the effects that occur solely due to different sampling schemes and data-point densities. Therefore, we want to exclude additional effects such as measurement errors and positioning errors. For this purpose, the Zernike coefficients for the 'piston' and 'tilt' parameters are set to zero. The design deviation for the asphere is defined using an aperture radius r seg of 13.5 mm and is shown in figure 2 (left). The asphere corresponds to one that is manufactured with very high accuracy; the RMS value of the design deviation is only 59 nm. For the toroidal surface, the design deviation is defined using an aperture radius r seg of 22.0 mm. Figure 2 (right) shows the design deviation investigated for this specimen type. In this case, the design deviation is very large and the RMS value amounts to 2318 nm.

Sampling schemes
In the following, the data is generated using four different sampling schemes as well as two different total numbers of data points (related to data-point densities) for each sampling scheme.
For each of the x-, y-sampling point pairs, the z-data for the two specimens is calculated from the definition of the specimens (design function and design deviation described by the Zernike polynomial functions). Note that only perfectly sampled data sets are investigated here. Other typical measurement errors such as positioning errors, effects of the surface registration, drifts during pointwise measurement procedures, effects of the spatial bandwidth of an instrument, and measurement noise are not included to allow solely the effect of the different sampling methods on the comparison result to be investigated. Other errors of concrete instruments can be studied by virtual measurement instruments, as shown for example in [27][28][29].
The example grid patterns investigated and their relation to typical measurement approaches are defined in detail below. The subscript L stands for a low total number of data points in the following, while the subscript H stands for a high total number of data points. Note that other sampling schemes also exist. For example, in areal measurement systems based on stitching methods [11][12][13] or on illumination by means of several light sources (e.g. tilted-wave interferometry [10]), unequally distributed 3D measurement data point distributions also frequently exist. These distributions depend to a large extent on the surface form and are not explicitly realized in this study. Nevertheless, because the goal of this study is to show how the accuracy of the comparison results is affected when form measurement results with different 3D measurement point distributions are compared, typical distributions are represented in this study and the basic results will also hold true for data point distributions other than the ones used here. The total numbers of data points resulting from the different sampling schemes used for this study are shown in table 1 and are representative of such numbers for form measuring instruments.

Equidistant grids.
Equidistant grids that have equally distributed data points (in a Cartesian or angular sense) exist for Fizeau Interferometers -for example, in combination with a computer generated hologram (CGH) [30] or via other techniques that directly acquire the data by means of a camera chip. The following grids were considered for each specimen: • 600 points along the x-and y-axes of the given aperture of the specimen. This configuration is called EQUI L . • 1000 points along the x-and y-axes of the given aperture of the specimen. This configuration is called EQUI H .

Line scans.
Line scans in x-and y-orientation are commonly used by measurement systems based on point sensors [4,5]. Compared to an equidistant grid, the point density of the coordinate system differs in x-and y-orientation. The following grids were considered for each specimen: • 100 points along the x-axis and 3600 points along the y-axis of the given aperture of the specimen. This configuration is called LINES L . • 200 points along the x-axis and 5000 points along the y-axis of the given aperture of the specimen. This configuration is called LINES H .

Circular scans.
Circular scans are commonly used by measurement systems based on point sensors [2]. Depending on which parameters are chosen, the point density along a circle is often much higher than the point density along the radius of the circles. Furthermore, the point density along the circle usually decreases the larger the distance to the center is. The following grids were considered for each specimen: • 100 complete circles (num circles ) on the given aperture of the specimen, 3000 along a single circle. This configuration is called CIRCLES L . • 200 complete circles (num circles ) on the given aperture of the specimen, 4000 along a single circle. This configuration is called CIRCLES H .

Spiral scans.
Spiral scans are commonly used by measurement systems based on point sensors [7]. Depending on the parameters chosen, the point density along a spiral is often much higher than the point density in the perpendicular (radial) direction. Furthermore, the point density along the spiral usually decreases the larger the distance to the center is. The following grids were considered for each specimen: • 200 complete spirals (num spiral ), 300 000 points across the given aperture of the specimen. This configuration is called SPIRAL L . • 400 complete spirals (num spiral ), 780 000 points across the given aperture of the specimen. This configuration is called SPIRAL H .
The formulas used to generate the data were where ϕ has 300 000 (or 780 000) equally distributed values between 0 and num spiral · 2 · π.

Evaluation procedure for the virtually generated data
The eight virtually generated data sets for each specimen were then fed into a typical evaluation program, which is described in section 2.1. For each specimen, the evaluation process was repeated four times. During each of the four evaluation processes, a slightly different approach was used to determine the necessary alignment to the design and/or to determine the BFSes.
It is important to mention that, in each case, the design topography as well as the BFSes were removed only from the available original data points (and not from the interpolated data or the data restricted to an aperture). To determine the necessary alignment with the design topography, and to determine the BFSes (but not for other preprocessing steps), either a restricted number of the original data points (i.e. restricted to a certain area of the specimen) or other (i.e. interpolated) data points are used. The option chosen depends on the evaluation strategy. The exact procedures used during the four evaluation strategies are described below.
Case 1: To determine the necessary alignment of the data sets with the design function and to determine the BFS of each data set, the data points of the original data sets are used, but are restricted to the x-and y-values contained within a circular aperture with aperture radius r eval . Since the data points are restricted to r eval but have been generated (i.e. measured) on a larger aperture, slightly different apertures are used for the two fitting steps depending on the sampling scheme. Furthermore, depending on the local point density of the sampling scheme, different parts of the specimen can influence the least-squares fitting procedures to a greater or lesser extent. Case 2: To determine the necessary alignment of the data set to the design function, the data points are interpolated to a regular common grid with an aperture radius of r eval . In this step, the same equally distributed grid is used for each data set. To determine the individual BFS of each data set, the original data points restricted to the aperture radius of r eval are used. Case 3: To determine the individual BFS of each data set, the data points are interpolated to a regular common grid with an aperture radius of r eval . In this step, the same equally distributed grid is used for each data set. To determine the necessary alignment of the design function, the original data points restricted to the aperture radius of r eval are used. Case 4: To determine the necessary alignment of the data sets to the design function and to determine the individual BFS of each data set, the data points are interpolated to a regular common grid with an aperture radius of r eval . Therefore, in both steps, the same equally distributed grid is used for each data set.
These four different evaluation strategies are summarized in table 2.

Results
In the following, the results of the comparison of the eight data sets in the four different evaluation strategies are presented.
The final comparison of the asphere is performed on an aperture radius of r eval = 12.4 mm and the final comparison of the toroidal surface is performed on an aperture radius of r eval = 20.0 mm. The aperture radii chosen had recently been used in measurement comparisons initiated by CC UPOB e.V. and are adopted here not only for the sake of simplicity but also to model a realistic scenario wherein the aperture radius used for the evaluations is smaller than the aperture radius of the specimen's design and that of the original 3D measurement data point clouds.

Asphere
In figure 3, the VRTs resulting from the four different evaluation strategies are shown for the asphere. The eight different data sets contributed to each VRT after the evaluation steps described in section 2.1 had been conducted. Initially, there seems to be hardly any visible difference. In figures 4-7, the resulting differences between each reduced data set and the given VRT are shown. Here, some differences between the eight different sampling schemes are visible, especially for cases 1 and 2. These differences mainly consist of slightly decentered spherical differences, where the EQUI and LINES sampling schemes seem to produce the same results and differ from the CIRCLES and SPIRAL sampling schemes.
This suggests that the two groups yield different BFSes that have been removed from the residual data. To obtain more insight into this matter, the radii of the BFSes calculated from the residual data are shown in table 3 for all four cases. This table shows that the EQUI and LINES sampling schemes yield almost the same BFS in all four cases, and that this result does not depend on the total number of data points. In contrast, the CIRCLES and SPIRAL sampling schemes show different radii of the BFSes depending on the evaluation strategy. Only if interpolated data sets are used to determine the BFSes (cases 3 and 4) will the radii be close to the radii of the other sampling schemes. One reason for this behavior is that, for the CIRCLES and SPIRAL sampling schemes, the data points are not equally distributed over the complete sample size, but are more prevalent in the center of the specimen. This leads to a different calculation of the BFS. To demonstrate the impact of the different radii on the topography height measured over the given aperture radius, table 4 also shows the peak-to-valley (PV) values resulting from the BFSes on the given aperture radius. The maximum difference due to the different sampling schemes amounts to 30 nm if the data points used to determine the BFSes are not interpolated to a common, regular grid.
When using interpolated data to determine the BFSes, the difference between the different sampling schemes decreases significantly (cases 3 and 4, see figures 6 and 7). While there are still some alignment differences when interpolated data sets are not used to determine the necessary alignment to the design (case 3), the differences between the sampling schemes are reduced to the sub-nanometer range when interpolated data sets are also used to determine the necessary alignment to the design (case 4). Nevertheless, especially for the LINES L sampling scheme, small differences are still visible that may be  Differences between the reduced data sets and the VRT for the asphere when using data points of the original data sets (restricted to reval) to determine the necessary alignment to the design and to determine the individual BFSes (case 1).

Figure 5.
Differences between the reduced data sets and the VRT for the asphere when using interpolated data to determine the necessary alignment to the design and data points of the original data sets (restricted to reval) to determine the individual BFSes (case 2). caused by interpolation errors, sub-sampling effects or alignment errors.
To compare the overall effect of the sampling schemes on the differences between the VRT and the reduced data sets, figure 8 shows the RMS and MAD values of the differences between the VRTs and the reduced data sets for all four cases. If no interpolated data sets are used to determine the necessary alignment to the design and the BFSes (case 1), the RMS values amount to about 5 nm. Using interpolated data only to determine the necessary alignment to the design but not to determine the BFSes (case 2) does not lead to a significant improvement. When using interpolated data to determine the BFSes (case 3), the RMS values decrease to only about 1.5 nm. This value decreased to the sub-nanometer range when also using interpolated data to determine the necessary alignment to the design (case 4).

Toroid
In figure 9, the VRTs resulting from the four different evaluation strategies are shown for the toroidal surface. The eight different data sets contributed to each VRT after the Figure 6. Differences between the reduced data sets and the VRT for the asphere when using interpolated data to determine the individual BFSes and data points of the original data sets (restricted to reval) to determine the necessary alignment to the design (case 3). Differences between the reduced data sets and the VRT for the asphere when using interpolated data to determine the individual BFSes and to determine the necessary alignment to the design (case 4). Initially, there seems to be hardly any difference visible. In figures 10-13, the resulting differences between each reduced data set and the given VRT are shown. Here, some differences between the eight different sampling schemes are visible, especially for cases 1 and 2. These differences mainly consist of slightly decentered spherical differences, where the EQUI and LINES sampling schemes seem to produce the same results and differ from the CIRCLES and SPIRAL sampling schemes.
This again suggests that the two groups yield different BFSes that have been removed from the residual data. To obtain more insight into this matter, the radii of the BFSes calculated from the residual data are shown in table 5 for all four cases. This table shows that the EQUI and LINES sampling schemes yield almost the same BFSes in all four cases, and that this result does not depend on the total number of data points. In contrast, the CIRCLES and SPIRAL sampling schemes show different radii of the BFSes depending on the evaluation strategy. Only if interpolated data sets are    Differences between the reduced data sets and the VRT for the toroidal surface when using data points of the original data sets (restricted to reval) to determine the necessary alignment to the design and to determine the individual BFSes (case 1). Figure 11. Differences between the reduced data sets and the VRT for the toroidal surface when using interpolated data to determine the necessary alignment to the design and data points of the original data sets (restricted to reval) to determine the individual BFSes (case 2). used to determine the BFSes (cases 3 and 4) will the radii be the same as the radii of the other sampling schemes. To again demonstrate the impact of the different radii on the topography height measured over the given aperture radius, table 6 also shows the peak-to-valley (PV) values resulting from the BFSes over the given aperture radius. The maximum difference due to the different sampling schemes is 188 nm if the data points used to determine the BFSes are not interpolated to a common, regular grid. When using interpolated data to determine the BFSes, the difference between the different sampling schemes decreases significantly (cases 3 and 4, see figures 12 and 13). While Figure 13. Differences between the reduced data sets and the VRT for the toroidal surface when using interpolated data to determine the individual BFSes and to determine the necessary alignment to the design (case 4).  there are still some alignment differences when interpolated data sets are not used to determine the necessary alignment to the design (case 3), the differences between the sampling schemes are reduced to the sub-nanometer range when interpolated data sets are also used to determine the necessary alignment to the design (case 4). Nevertheless, for this specimen as well, and specifically for the LINES L sampling scheme, small differences are still visible that may be caused by interpolation errors, sub-sampling effects or alignment errors.
To compare the overall effect of the sampling schemes on the differences between the VRT and the reduced data sets, figure 14 shows the RMS and MAD values of the differences between the VRTs and the reduced data sets for all four cases. If no interpolated data are used to determine the necessary alignment or to determine the BFSes (case 1), the RMS values are around 30 nm. Using interpolated data to determine the necessary alignment to the design but not to determine the BFSes (case 2) does not lead to a significant improvement. When using interpolated data to determine the BFS (case 3), the RMS values decrease to between about 1.5 nm and 6 nm. This value decreased to the sub-nanometer range when also using interpolated data to determine the necessary alignment to the design (case 4). RMS and MAD values of the differences between the VRT and the different reduced data sets for the four different evaluation strategies for the toroidal surface: Using data points of the original data sets (restricted to reval) (upper row, left), using interpolated data to determine the necessary alignment to the design (upper row, right), using interpolated data to determine the individual BFSes (lower row, left) and using interpolated data to determine the necessary alignment to the design and to determine the individual BFSes (lower row, right).

Discussion and conclusions
The results of this paper show that, when comparing highly accurate form measurements of optical aspheres and freeform surfaces, several influencing factors must be considered and special care is needed in data evaluation.
Evaluation procedures for a pointwise comparison of such measurement data must be able to process different, anisotropic 3D measurement point distributions for which different sampling schemes and resolutions have been used. Furthermore, the measured areas of the surface can be different in size, while the data sets are generated in different coordinate systems that use different sample orientations. Additionally, in most cases, uncertainties are not yet available for all measurement data, preventing weighted reference topographies from being calculated. Another challenge is that the true form of the specimen is usually unknown.
In this paper, we have presented an evaluation procedure that accounts for these influencing factors. Furthermore, using virtually generated test data, we have investigated the influence of the different 3D measurement point distributions on the comparison results for evaluation procedures that have different levels of sophistication.
We have demonstrated that caution is required when determining the best-fit spheres (BFSes) of the individual residual data sets as well as when determining the necessary alignment of the data sets with the design topography in order to compare highly accurate form measurements with accuracies in the range of several tens of nanometers RMS.
The two example surfaces studied in this paper (asphere and toroid) clearly show that the use of interpolated data to determine individual BFSes significantly reduces the rootmean-square differences between the measurement data and the VRT (pointwise median topography of the reduced data sets). A further reduction to the sub-nanometer level can be achieved when interpolated data sets are also used to determine the necessary alignment of the measurement data to the design. This study further shows that different sampling schemes may have a much larger impact on the comparison results than the total number of data points. However, for point densities lower than the ones investigated here, as well as for specimens with higher frequency structures, the effect of the total number of data points may be larger than that of the sampling schemes. To summarize, both effects may impact the comparison result and must be considered when comparing very high-accuracy form measurements of aspheres and freeform surfaces. Since the effects depend to a great extent on the specimen's design and on deviations from this design, quantitative statements for other surface forms cannot be made without corresponding evaluations. Nevertheless, the two examples presented here (which represent typical specimens in recent measurement comparisons) give an idea of the extent of the effects investigated. Based on our findings, we recommend interpolating all measurement data to the same common regular grid to determine the necessary alignment to the design and to determine the individual BFSes. We also recommend using the same grid as for comparing the reduced data and calculating the VRT.
Furthermore, the method of employing virtually generated test data presented here may be generalized to further improve and validate future comparison methods. This procedure allows comparison methods to be validated, the influence of typical measurement errors (e.g. positioning errors, noise or outliers) on the comparison results to be investigated, and new evaluation procedures (e.g. alignment procedures, interpolation methods or fitting algorithms) to be adapted and improved, as well as new sampling schemes or the data point densities necessary to measure the form of a certain specimen to be tested and developed.
Future work must address the development of methods for metrological compatibility as defined in [31], taking into account measurement uncertainties of the 3D measurement points (where these uncertainties are available).

Data availability statement
All data that support the findings of this study are included within the article (and any supplementary files).