Algorithms for using silicon steps for scanning probe microscope evaluation

The 2019 update to the Mise en Pratique for the metre adopted the lattice parameter of silicon as a secondary realisation of the metre for dimensional nanometrology. One route for this realisation is the use of amphitheatre like monoatomic steps of silicon. In response, in this paper we present new algorithms for one- and two-dimensional analysis of atomic force microscope images of these large area atomic terraces on the surface of silicon. These algorithms can be used to determine the spacing between the steps and identify errors in AFM scanning systems. Since the vertical separation of the steps is of the same order of magnitude as many errors associated with AFMs great care is needed in processing AFM measurements of the steps. However, using the algorithms presented in this paper, corrections may be made for AFM scanner bow and waviness as well as taking into account the edge effects on the silicon steps. Applicability of the data processing methods is demonstrated on data sets obtained from various instruments. Aspects of steps arrangement on surface and its impact on uncertainties are discussed as well.


Introduction
The advent of the scanning probe microscope (SPM), in particular the atomic force microscope (AFM) opened a window into the nanoscale world and has helped accelerate the growth of nanotechnology, a multidisciplinary science that has found applications in many areas including healthcare, materials science, the semiconductor industry, optics and agriculture. Dimensional metrology has risen to meet this challenge [1]. Measurements using SPMs, the workhorse of nanotechnology, Original Content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. are made traceable via calibration standards; gratings and height standards that are calibrated using metrological atomic force microscopes developed by national metrology institutes; details of many of the microscopes are given in the review article [2] and a review of the application of metrological AFM to critical dimension (CD) metrology is given by [3]. Height standards for scanning probe microscopes have been developed and, in keeping with height standards for calibration of surface texture instrumentation, they usually comprise a series of steps of known height. Measurements are made according to either the ISO 5436 standard [4] or a histogram method. The ISO method fits a profile to the base on either side of the step so that a correction for any tilt of the sample can be made. The histogram method collects the heights of all data points in the scan and plots them as a histogram giving two peaks; one for the base and one for the step.
The smallest step height standards available (from Nanosensors) have a nominal height of 8 nm. With the growth in nanotechnology and requirements for nanoscale metrology, broadly in line with Taniguchi's predictions [5], there have been various efforts to develop bottom up calibration standards based on crystal lattice spacings [6,7]. The disadvantage of these standards has been the asymmetry, i.e., the inability to apply the ISO 5436 method [4] for evaluating the step height. Work reported by Yacoot et al [8] showed the potential to produce monoatomic amphitheatre-like terraces of silicon and their potential for use as a calibration standard. This led to a major European research project [9], in which considerable effort has been directed toward producing such amphitheatre and staircase structures in a repeatable fashion so that they can be used as calibration standards. Within this Crystal Project, there was a large effort directed towards creating a standard procedure for producing silicon amphitheatre structures in a repeatable and controllable manner. An example of these structures is shown in figure  1(a). Measurements from these novel samples form the basis for evaluating data analysis algorithms demonstrated in this paper.
In parallel to these activities, the Consultative Committee for Length's Working Group for Nanometrology identified the need for secondary length standards for dimensional nanometrology as the current method of optical fringe division is prone to non-linearity at the sub nanometre level [10]. The lattice spacing of silicon was identified as a potential candidate. It has been measured extensively using x-ray interferometry in support of the work to measure the Avogadro constant [11] and in preparation for the 2018 revision of the SI. The Committee on Data for Science and Technology, CODATA [12] list of recommended values for fundamental constants includes an agreed value for the silicon {220} lattice spacing: d 220 = 192.015 571 4 × 10 −12 m, with a standard uncertainty of 0.000 003 2 × 10 −12 m, (i.e. relative uncertainty 1.67 × 10 −8 ) at a temperature of 22.5 • C in vacuum. This is the lattice spacing of an ideal single crystal of pure silicon with natural isotopic distribution. It should be noted for the length scales over which we are measure (∼10 atomic steps) the impurity content is not significant [13,14]. To obtain the lattice spacing for any other planes within the silicon crystal is a simple matter of multiplying the d 220 value by the square root sum of the squares of the Miller indices (220) to obtain the crystal lattice parameter a 0 from which the spacing of any other planes can be calculated by dividing a 0 by the square root of the sum of the squares of the Miller indices of the planes. Hence, d 111 = (d 220 The 2019 Mise en Pratique [14] for the metre includes three practical realisations of the lattice parameter of silicon as a secondary length standard: x-ray interferometry, where crystal lattice spacing acts as a ruler [15]; Transmission Electron Microscopy (TEM) calibration, where the individual atoms are imaged [16]; mono atomic steps on silicon surfaces to be used for the calibration of the vertical axis of microscopes e.g. scanning probe microscopes (SPM), [17]. This paper deals with the last approach, focusing on data processing approaches that can be used for obtaining reliable data from SPM measurements of mono atomic steps on silicon surfaces. Based on the Mise en Pratique document, we use the step height value of 314 pm throughout this work.
The analysis in earlier work was typically based on the fit of two straight parallel lines based on profile segments near to the same edges leading to uncertainties between 1% and 4% [18]. However, many effects can influence the surface profile near to a step edge leading to a fitted step height s that is different from the bulk spacing between the corresponding lattice planes. These effects include the possible change in thickness of the native oxide covering the surface near the edge, change in mechanical properties such as stiffness and friction, contamination at the edge and uneven thickness of water layers adsorbed onto the surface near the edge. Tip shape and instrument settings such as scan speed and feedback gain will also influence the measured surface profile unevenly around the edge. The guiding system and distance sensors in both state-of-the-art commercial AFMs and tailor-made metrological atomic force microscopes (MAFMs) implemented at a few national metrological institutes are influenced by thermal drift and other nonlinear errors [19,20] indicated as δz e (x) in figure 1(b). Using the fit of two straight parallel lines near the same edge will lead to a step height different from the spacing between the lattice planes.
Typical analysis in earlier work is based on line profiles along the fast scanning direction and has not fully assessed the quality of the 3D images. However, many applications now require the analysis of 3D images including the accurate measurement of surface roughness where e.g the first comparison of measurements are in progress (EURAMET project No. 1239 'Measurement of surface roughness by AFM').
To circumvent the influence from the profile section around the edge and the influence of non-linear errors from guiding, this paper we implement a universal and robust strategy for the measurement and analysis of full 3D images which: • removes the influence of the non-linear errors of the surface profile δz e (x) (see figure 1(b)) in the fitting process; • excludes an appropriate section of width w e of the profile near the edge from the analysis using medium wide terraces w a and w s for amphitheatres and steps, respectively (see figure 1(b)).
As mentioned above, two possible arrangements of mono atomic steps on silicon surfaces have been developed in the past: the staircase, where parallel mono atomic steps stack upward or downward from one side of the sample to the other, and more recently, the amphitheatre, where a set of concentric mono atomic steps are arranged downward from the periphery to the centre. The goal of the symmetrical amphitheatre arrangement development in the Crystal project [9] was to improve the calibration uncertainty. In this study, we demonstrate the benefits of the amphitheatre arrangement from the standpoint of separation of the steps from a polynomial background (both 1D and 2D), addressing aspects of different step length, different number of steps and different roughness of the surface. Due to the small sub-nm (0.314 nm) height of the mono atomic steps, SPM data processing procedures of line levelling, background removal and statistical analysis become critical for obtaining accurate results. This is illustrated in figure  1(c), where a high-degree polynomial fit was used to subtract non-linear background variations due to instrument errors such as piezo bow. In addition, advanced data analysis routines are needed to further improve the outcome of the fitting and optimised segmentation of curves is needed to avoid edges and their adverse effects on measurements of step height.
Step height should be kept constant over multiple steps during data fitting.
Methodically, this paper focuses on two main potential approaches for the evaluation of silicon mono atomic steps from SPM data: • Evaluation of individual 1D profiles, where measurements along the fast-scanning axis of the instrument are used to minimise time-dependent artefacts along slow-scanning axis, such as thermal or mechanical drift. • Evaluation of 2D data sets, where additional information about the scanner motion can be obtained and more data can be used for the statistical evaluation. This has the additional advantage compared to a profile measurement that the data is effectively averaged over a larger area.
The algorithms presented here are publicly available in the Gwyddion open source software [21,22]. The presented data fitting methodology can be applied in any data processing software.

Silicon steps and measurement procedure
Silicon steps with an amphitheatre arrangement were produced at Physikalisch Technische Bundesanstalt, Germany, from single crystals of natural silicon (111) wafers that were annealed in ultra high vacuum to invoke a diffusion based self-assembly of the steps or amphitheatres. The general preparation of mono atomic steps and the metrological characteristics is described in the Mise en Pratique [16]. The Si surface lattice is reconstructed into a 7 × 7 structure but areas of 1 × 1 reconstruction are sometime present. A native oxide layer with a thickness of (1-2) nm covers the surface. Further description of the production steps, examination of the mono atomic steps by interference microscopy and evaluation of stability over time will be presented in a separate paper. Many good quality terraces with terrace widths from < 1 µm to > 20 µm were identified on more than five different samples examined months after production and weeks after exposure to ambient air.
AFM measurements were performed using two highresolution AFM microscopes to assess the general usefulness of the algorithms developed and implemented. One microscope was a metrology AFM NX20 (Park Systems, South Korea). This microscope is designed for large samples, with scanning range of 100 µm × 100 µm × 15 µm in x, y, and z, respectively. The AFM is equipped with optical sensors in x and y, and strain gauge sensors in z. Scanning was conducted in intermittent contact mode with PPP-NCH probes and in contact mode with PPP-CONT (Nanosensors TM , Switzerland). The other system used was a bespoke AFM: Tip Sample Interaction AFM [8] (TSI) with a high resolution Plane Mirror Differential Interferometer to measure traceably the displacement of the cantilever with respect to the sample. Lateral scanning of the sample was performed using a precision stage with a scanning range of 100 µm × 100 µm in x and y that was servo controlled via feedback from capacitance sensors. The stage had been error mapped using optical interferometry. Scanning was conducted in intermittent contact mode with PPP-NCH probes.

Data pre-processing
Data preprocessing is typically used as a first step of AFM data evaluation in order to correct instrumentation dependent systematic errors, like individual profiles mismatch or sample tilt. Here, the required data preprocessing depends on the type of measurement and evaluation: • individual profiles, which are suitable only for 1D evaluation; • images, averaged along the slow axis to obtain profiles with reduced noise for 1D evaluation; • images, evaluated using the full 2D approach.
However, there are some common requirements for data irrespective of which method is being used for evaluation of the steps. In particular, human influence should be minimised and unnecessary data modifications avoided [23].
The 1D evaluation may not require any preprocessing. For single profile analysis, select a profile without defects. If several profiles in an image are averaged to obtain 1D average profiles, potential scan defects or particles on the surface should be marked. The average profile is then computed with these regions excluded. Line (1D) or plane (2D) levelling can provide a useful visual aid and facilitate visual inspection. However, it is not strictly necessary because the step height evaluation includes fitting of polynomials-and line or plane levelling is simply fitting by a polynomial of degree one.
The full 2D evaluation almost always requires the correction of misaligned scan line heights. Most conventional correction methods, such as median, mean or tilted line subtraction, deform the surface in the direction of the slow scanning axis. This is because they make the average height (or some other characteristic height) of all scan lines the same. However, an amphitheatre does not have constant average height for all scan lines, and therefore becomes deformed. The deformation has a slowly varying component, closely fitting a parabolic shape, which is then fitted well to a quadratic polynomial. It has also a high-frequency part appearing around the scan line where a new terrace level is encountered. This part is not fitted well by polynomials and can cause underestimation of the fitted step height. Median value subtraction frequently produces sudden jumps when the median changes from one level to another and should not be used.
The 'median of differences' (MoD) line offset correction method, implemented in Gwyddion [21], preserves the overall amphitheatre shape. The method calculates the height differences of corresponding pixels in each pair of neighbour scan lines. For each pair the median of the differences is then found and all lines are shifted in z to make all the medians zero. For data with a relatively high level of noise, which are common here, fitting can be improved by excluding regions close to terrace edges from the processing. This can be conveniently done using the same edge detection function described in section 4.4. When MoD is insufficient alone because tilt has to be removed from each line individually, the tilt removal is done first. The correct sequence of operations if all steps are included is: The rest of preprocessing for the full 2D evaluation is: marking of defects, possibly plane levelling, or manual marking of edge pieces that are not correctly detected automatically.
The performance of different scan line correction methods is illustrated in figure 2 (the same effect can be seen for images in figure 9). The profiles were taken at the image centre along the slow scanning axis in order to show the image deformation by the individual methods (curves are vertically offset for visualisation-the absolute height is irrelevant). For real measured data, which exhibit scanner bow, it can be seen that the full procedure (tilt correction + masked MoD) more or less restores the undeformed amphitheatre shape just by line correction. Other methods change the deformation, but it is not clear how since the measured profiles were deformed to begin with.
The effect is thus more clearly illustrated using artificial data which were generated using the Pattern synthetic data module in Gwyddion [21] to match real measurements, including point and line noise and scan line tilt. In this case 'as generated' data are noisy, but not deformed. All the scan line correction methods introduce deformations-except the full procedure, which again preserves the amphitheatre shape. The deformation is plotted as line residuals, i.e. differences of scan line mean value from the mean value of the ideal shape (before noise was added). The low and high frequency components of the deformation can be most easily distinguished in mean and tilt subtraction curves (observe the regular cusps in artificial data residuals). The residual curve for median correction also has sharp steps, i.e. high-frequency component of deformation, only at different locations. For the full procedure the mean square difference is almost as small as for the 'as generated' surface, even though it has different frequency content.
Remarkably, 'tilt + masked MoD' recovers the amphitheatre shape exactly, removing completely scan line tilt and offset if they are the only errors present. Therefore, the residual differences originate from point noise and background, not tilt and offset of scan lines. When two neighbour lines are subtracted to obtain the median of height differences, the contribution of noise is non-zero. It is zero only in expectation, but it is still a random variable, adding small random errors to the computed medians. The procedure then accumulates (integrates) these errors along the slow scanning axis. The slowly varying background adds a small systematic contribution. Together they result in the residual profile in figure 2. It is fortunate that the integration suppresses high frequencies in the point noise contributions because the slowly varying components of the residuals can be fitted by polynomials.

Step height evaluation
The step height is evaluated by fitting the surface with a piece wise function combining the steps and a smooth polynomial background. We will start with 2D (image) data evaluation. Methods formulated for the processing of 1D data (profiles) can often be difficult to generalise to higher dimensions. Dimensionality reduction is, on the other hand, usually straightforward. Therefore, in order to treat 1D and 2D data consistently, the 2D fitting procedure should be developed first. Its translation to 1D is described in section 4.5.
The procedure has two basic parts, segmentation of the image into individual terraces and least-squares (LSQ) fitting by a model of the surface. Segmentation can be accomplished by many different means, from fully automated to manual tracing of the steps. It will be discussed in section 4.4. Here we assume individual terraces have been suitably marked and proceed with the fitting. The method needs to simultaneously find a global polynomial background and the height of each terrace and achieves this in three steps: (i) Preliminary fit without assuming anything about individual terrace segment heights. (ii) Assignment of integer levels l g to terrace segments. (iii) Final fit with known terrace levels which yields the step height s.
The last fit is analogous to the method described in the Mise en Pratique [17]; the first two steps can be considered bootstrapping. An overview of the entire procedure is shown in figure  3 (for generated data with slightly exaggerated artefacts). We note the procedure is not inherently tied to silicon steps and could be used also for other regular step structures.

Preliminary (independent) least-squares fit
Denote N total the number of measurements, i.e. image pixels indexed by i = 1, 2, . . . , N total , with lateral coordinates x i and y i and heights z i . They are divided into G terraces segments (for details see section 4.4), indexed by g = 1, 2, . . . , G, plus remaining image area which consists of edges or defects and does not enter the fitting. A single terrace can be split into several segments by image boundaries or defects, as illustrated in figure 4(a). In this step each segment is considered separately.
A terrace segment is defined by its set of indices T g so that i ∈ T g iff the ith pixel belongs to segment g. For image data the coordinates x i and y i form a subset of a regular grid. Nevertheless, this is not a requirement-the following procedure can be applied unchanged to arbitrary XYZ data collected in a non raster scan using for example the routines in the Gwyscan library [24] or [25].
We would like to assign an integer level l g to each segment, equal to the number of steps to climb up (or descend) from a base level to reach the terrace. At the beginning, however, we do not know how to do that. So instead each terrace is described by its height h g , which is a real number to be determined. We then minimise the objective function S equal to the sum of squared differences where is the number of values fitted (normally N < N total because some pixels belong to edges) and b(x, y) is a background function describing tilt, bow and other errors associated with the scanning system. It should be noted that bow is the major error in non-metrological AFMs using a pzt tube scanner. However, for metrological AFMs that use a precision scanning stage for scanning, the error is likely to not be a bow and be orders of magnitude smaller [26]. In both cases the error can be taken to be a polynomial with K terms Here a k are the unknown coefficients; p k and q k are powers of the coordinates. When background characteristics in x and y are similar, the set of powers is usually chosen symmetrically by limiting the total degree of the terms p k + q k ≤ P, where P is the (total) degree of the polynomial. The number of terms is then K = P(P + 1)/2 − 1. The −1 appears because the constant term with p k = q k = 0 is replaced by heights h g and thus excluded.
The linear least-squares problem of minimising S is solved by putting ∂S/∂h g = 0 and ∂S/∂a k = 0. This gives the normal equations for the G + K unknown parameters-heights h g and coefficients a k , which can be written in block form T are the unknown parameters. Matrices H, M and T are G × G, G × K and K × K, respectively, with elements Matrix H is the diagonal matrix H gg ′ = δ gg ′ N g . Finally the two vectors forming the right hand side of (4) are The normal equation (4) are solved easily using Cholesky factorisation, obtaining the heights of all terrace segments. The Gwyddion implementation transforms x and y to range [−1, 1] to improve numerical stability.

Integer level assignment
Minimising S uniquely determines the heights h g which now need to be translated to integer levels l g . If we consider two In the preliminary fit, however, all heights h 1 to h 9 are independent.
neighbour terrace segments, only two possibilities exist. Usually they are separated by a single step and their levels differ by ±1, as depicted in figure 4(b). Occasionally the segment levels can differ by 0, i.e. the segments actually belong to a single terrace. Such oversegmentation occurs for instance if scratches and similar defects on the surface split a terrace into multiple pieces. The absolute height differences of neighbour terraces therefore form two clusters, one around the correct step height s and one around zero. This allows classifying the differences into groups −1, 0 and 1, constructing a neighbour graph similar to figure 4(b). To obtain absolute levels, we choose an arbitrary terrace, assign absolute level 0 to it and solve the pairwise level differences. A unique reconstruction requires the graph to be connected and the pairwise level differences to be consistent. When these conditions are not satisfied it indicates a segmentation failure or presence of large defects on the surface. The correct course of action is to abort the procedure in such case, as the Gwyddion implementation does. We note that this approach does not rely on a precise estimate of step height s. If neighbour terrace level differences are correctly sorted into ±1 and 0, the graph is built correctly and leads to correct absolute integer level assignment.
The construction of the neighbour graph requires deciding which terrace segments are neighbours. The segmentation method in section 4.4 produces a natural segment separation as the sum of kernel size and edge widening. Therefore, a simple criterion can be used: If the minimum distance between two segments is not larger than twice this value, they are considered neighbours. Otherwise they are considered disconnected.

Final step height fit
Once each terrace g has assigned an integer level l g the objective function (1) becomes where s a fitting parameter-the sought step height (the division into segments T g remains the same). The unknown parameter a 0 represents a constant offset and it could be included in polynomial b(x, y) as the constant term (p k = q k = 0). However, the presented form preserves more closely the block structure of the normal equations which now become ) .
Matrix T and vector r (a) remain unchanged from definitions (6) and (5). Vector a also retains its meaning although the coefficients will be determined anew. Vector u has only two elements u = (s a 0 ) T and the two elements of r (u) are Matrix U is only 2 × 2 and has the following elements: Solution of the normal equations (8) provides the step height s (and the LSQ estimate of standard deviations).

Segmentation and preprocessing
There are many possible methods how to segment the image. For instance active contour methods, often based conceptually on the Mumford-Shah approach [27], such as Chan's and Vese's method [28]. In principle they could allow simultaneous segmentation and determination of the smooth polynomial background. However, operations such as exclusion of regions close to step edges would have to be anyway incorporated by post-processing. A simpler automated segmentation method based on elementary morphological operations has been implemented in Gwyddion. It works as follows: (i) Application of an edge detection filter, realised as convolution with a set of Gaussians (of adjustable kernel size) multiplied by sgn(x) functions rotated in the xy plane. (ii) Thresholding with user-adjustable threshold. (iii) A morphological filter bridging small gaps in detected edges. (iv) Widening of edges (i.e. exclusion of regions close to edges from fitting) using Euclidean distance transform with adjustable width. (v) Optionally, combination with an arbitrarily shaped usersupplied mask to enforce the exclusion of some pixels, making them edges (or alternatively joining separated segments by forcing the inclusion of some pixels in the fitted area). (vi) Segment labelling and removal of segments that are too small.
Alternatively, the segmentation might be done manually or with the help of an external tool as Gwyddion can directly take a user-supplied mask image and use it as the marked terraces. The mask can be produced by any combination of Gwyddion functions, other software or even tracing edges by hand.

Reduction to one dimension
For analysis of one dimensional data, i.e. line profiles, the background polynomial b, introduced in equation (3), becomes a function of x only: Since K = P and the powers are now just the numbers 1, 2, . . . , K we can write directly k instead of p k . All the matrices shrink accordingly and contain no powers of y (formally q k ≡ 0). This is the only change in least-squares fitting. The segmentation method used for images also reduces to one dimension in a straightforward manner. Since the topology of the problem is much simpler, integer level assignment utilises the fact that each terrace (except the first and the last) has exactly two neighbours, one to the left and one to the right. Estimated mutual differences between neighbour terraces can be thus always be resolved to global integer levels.
In one dimension it is much easier to specify the segments to fit manually-so the Gwyddion implementation allows a manual selection of the regions to fit. An advantage of 1D scan line processing is that misalignment does not need to be corrected. This means fewer data modifications. Automated processing could enable evaluation of many scan lines and statistical characterisation of the results in the future. This could be an interesting alternative to 2D processing and also provide more useful uncertainty estimates than confidence intervals from LSQ.

Assessment of fit quality
The noise in typical measurements is relatively large compared to the step height (as discussed in section 5.2). Therefore, the residual sums of squares S (1) and (7) tend to be dominated by noise. Disagreement between model and data represents only a small fraction of their values. This has several consequences.
First, even for completely wrong fits, for instance with step height s only 2/3 of the true value (and corresponding wrong integer levels), S may not be even 20% larger than for a good fit. Such a wrong fit can occur if the polynomial degree is too low and is usually very easy to spot. In (semi-)automated processing, however, a quantity more sensitive to misfit would be useful for checking whether the fit is reasonable.
We looked into Akaike information criterion (AIC) [29], Bayesian Information Criterion (BIC) [30], and similar goodness of fit characteristics, in particular in relation to polynomial degree choice. However, here N is very large (N ≈ 10 6 to 10 7 ) and the number of fitting parameters moderate (in the 5-100 range). The criterion value is always dominated by term N log S which penalises the fit residuum and is common to all the criteria. We extrapolate that it would reach the minimum for P in the order of hundreds. It is not even feasible to fit 2D polynomials of such high degrees. So that would be an unsatisfactory recommendation-and the same holds for always using as high polynomial degree as technically feasible, considering the observed behaviour of the fits with increasing degree (see section 5).
A quantity which seems to characterise nicely the misfit is 'terrace discrepancy' ∆ g . For each terrace we calculate the mean difference between fitted height and actual height Note that there is no squaring in (13)-unlike in (7). The overall discrepancy ∆ is then defined This quantity penalizes mismatches between the mean heights of fit and actual surface, yet it is insensitive to noise because noise sums to zero (more precisely is suppressed by factors 1/ √ N g ). Therefore, the Gwyddion implementation displays it alongside with the residual sum of squares (see also figure  7) as a quantity changing more noticeably with overall misfit. The second consequence of large noise is that the estimated standard deviation of step height ε s is smallest for P = 0, i.e. no polynomial at all, even though the fit is quite poor and the value of s clearly wrong. Why? The number of data points N is huge, so the number of free parameters is insignificant and we can consider Student's coefficients in the limit N → ∞. The estimated confidence interval width for s (at one standard deviation) is then simplyε where Q is the cofactor matrix, i.e. the inverse of the normal matrix (8), andσ 2 is, in the variance-covariance estimate, simply the mean square differenceσ 2 = S/N. As noted above S decreases somewhat with increasing P, but relatively slowly, whereas Q 11 increases more rapidly (we will examine the behaviour of Q 11 in more detail in section 5.3). The net result is thatε s never attains smaller value than for P = 0. Of course, for P = 0 it is not satisfied that the differences between data and fit are random and the estimated standard deviation for P = 0 is not valid.

1D data evaluation
1D data evaluation has benefits due to the fact that fast scan axis of the AFM usually has less thermal and mechanical drift. Moreover, obtaining full 2D data with good quality can be more difficult. To acquire data for 1D evaluation, a feasible measurement strategy is to locate the amphitheatre and to record multiple line profiles with the slow-scanning axis disabled. We start the 1D data evaluation corresponding to this strategy. An example of the 1D fitting result is illustrated in figure  1(c) for a measurement using NX20. The profile was fitted with a polynomial model function of increasing degree at equidistant levels until a good fit was found. While the fit result seems to be satisfactory, it is necessary to investigate multiple independent data to come up with a general conclusion. A number of data from scanning at different locations, samples, with different tips, scan angle and instruments was fitted with polynomial model function of degree P = 0 to 18 ( figure 5). As the instruments used were either calibrated or had on-board traceable metrology, the results from the different instruments should be consistent. When the polynomial model function has a degree P ≥ 4, the fitted step height s is constant within a few pm. The average of the 5 fitted step heights is 319 pm with an observed standard deviation of approximately 5 pm, equivalent to approximately 1.5% of the fitted step height. The measurements were recorded on different samples, at different times, with different instruments and with different operating parameters. Other contributions to the spread in results include, for example, variation in thermal drift, stage guiding errors, and different surface profiles at the position measured. These results show that despite these error sources, the algorithms are robust to experimental deviations and useful for the estimation of step heights with a precision better than 5 pm on the systems investigated. To estimate the combined calibration uncertainty will require a model function representing the complete measurement procedure and the method of evaluation including these algorithms. The model function must have as input all significant parameters influencing the measurement. As the goal of this experiment was not to perform the interlaboratory comparison, but to show the algorithms applicability, we did not perform the full uncertainty analysis.
The convergence of the fitted step height s vs. polynomial degree gives a strong indication of good fitting quality. The source of the polynomial fluctuation of the baseline is due to imperfect response of the translation stage and/or sensors. The result is image bow or waviness of < 30 pm, which ultimately defines the accuracy of the measurements. The data fitting shown above was performed with Gwyddion. The same method can be easily implemented in other software platforms. To demonstrate this, we have implemented the same data fitting independently in Gwyddion and a LSQ routine in Microsoft Excel. Only small differences are observed when the algorithm is implemented independently (less than 2%). The differences can be attributed to the variation in the identification of the segments used for the fitting and excluded segments around the edges. This illustrates the impact of software implementation details that tend to be disregarded. The profile from the TSI AFM (not shown) is somewhat noisier due to the larger measurement loop as the signal for the Z axis metrology comes via the interferometers. Nevertheless it has the advantage of the direct traceability.
While the aforementioned single profile acquisition strategy is relatively simple and straightforward, it can sometimes be severely influenced by outliers caused by contamination, friction, and drift. One approach to mitigate this problem is to acquire images for profile analysis. First imaging was conducted to find the centre of the amphitheatre. An image is then recorded at the centre, which consists of high resolution scan in the fast axis, and small range (< 5 µm) scan in the slow axis. The image is then inspected for outliers and defects, before an average profile is obtained by averaging along the slow axis with outliers excluded.
To compare the performance of the amphitheatre arrangement with the staircase arrangement a Monte Carlo simulation of two equivalent measurements were made. For the simulated measurements to be equivalent they need to have the same lateral scan range and the same image distortion chosen to be δz e (x) = a 3 x 3 + a 2 x 2 (see figure 1(b) and (d)) where the 2nd and 3rd order distortion coefficients a 3 and a 2 were chosen to give a maximal distortion of 1 nm, that is, |δz e (x)| ≤ 1 nm over the full scan range of 100 µm consistent with the performance of state-of-the-art AFMs. The number of samples per length unit also have to be the same. A profile of a single scan line can be approximated by a medium sampling distance of 30 nm and a profile of high quality averaged over many scan lines can be approximated by a short sampling distance of 10 nm. A smaller sampling distance would probably violate the assumption that the data points are independent.
For the samples to be equivalent under the same measurement conditions the width of a staircase shall be equal to the width of two terraces on the amphitheatre, that is, w s = 2w a (see figure 1(b)). This will secure that the same number of step levels are assessed over the same scan range. Furthermore the total width w e of the edge not used in the fitting is chosen to be 0.4 µm for both scenarios. This is a reasonable length to secure that no effects near the edge will influence the fitted step height.
To compare the performance a small and a large step width were assessed (1 µm and 4 µm), heights in the range from one step (0.314 nm) to nine steps (2.82 nm) and sampling lengths of 10 pixel µm −1 and 33 pixel µm −1 . For a simulated surface roughness on the plateaus (that is the root mean square deviation of the profile) equal to 10 pm, the results of the simulation is given in figure 6. Further simulations have shown that the standard deviation of the fitted average height is proportional to the surface roughness.
From figure 6 it is seen that the amphitheatre gives the smallest standard deviation in most but not all cases. The amphitheatre performs much better for many steps and the standard deviation is a factor of three smaller for e.g. 9 steps (2.82 nm). However, for a single step height and a short step width the staircase arrangement is marginally better; this is however the measurement scenario which gives the highest standard deviation. This is caused by the wide edge zone w e not used in the fitting. For the amphitheatre arrangement there are nearly twice as many steps present in a profile and the number of points used in the fitting is therefore about 20% smaller compared to the staircase arrangement.

2D data evaluation
The benefits of 2D data evaluation are related to its capability of being used for characterization of the AFM scanner in addition to obtaining the step height. The scanner has typically some background out of plane movement related to its construction (e.g. parabolic movement for tube type scanners), which is superimposed onto the measured data. The background is present already on the 1D data and is one of the reasons why we have to do the levelling. However, a 1D profile of the background, taken as cross-section somewhere in the scanner plane can hardly be used as a scanner characteristics. With 2D analysis we can cover the whole scanners 2D out of plane movement (within the suitable sample area), so the results of the analysis are more easily applicable to the analysis of uncertainties related to the scanner.
An example of the 2D data evaluation, including the Gwyddion user interface for running it, is shown in figure 7.
The key aspect is that it is possible to cross-check the correct separation of different planes and to obtain the fit results, polynomial background and fit residuals. Similarly to the 1D case we can run also the study of the impact of the polynomial degree on the fitted height.
The noise in typical measurements is relatively large; we usually observe around (60-180) pm mean square difference. It is not even possible to distinguish individual terraces in the height distribution of a levelled surface as shown in figure 8 for as-measured data (using NX20). Scan line correction improves the histogram. However, clearly it would be still incorrect to attempt step height evaluation using the histogram. The peaks are deformed and irregularly spaced, if they are resolved at all.
As 2D analysis provides the 2D polynomial background we can also compare it to the original data as shown in figure 9 (this is the same measurement as in figure 1). Moreover, we can also separate the third component present in the datavalues that are not part of the polynomial background of the fitted model of the steps. These data are related to random imperfections in the scan. As can be seen from figure 9(b), this random part is relatively pronounced and is related to the typical imperfection of AFM microscopes, which is the mismatch between individual lines. While the fast axis data (individual profiles, typically shown horizontally) are usually highly consistent, there are often slight shifts between the individual profiles, related to mechanical and thermal drifts, tip shape changes or changes in feedback loop performance.  Such effects have to be compensated in the data pre-processing phase as discussed in section 3. Separating them from the scanner background can be useful for the uncertainty analysis.
It should be noted that the background cannot be separated into a polynomial 'systematic' and residual 'random' part in all cases. In some of the measurements we observed interference effects between cantilever and sample plane to distort the data in a way that could not be compensated by the proposed levelling. The only remedy is to prevent such measurements, e.g. by slightly re-aligning the laser beam on the cantilever. This would be in principle be a problem also for the 1D data sets; the benefit of using 2D data is also that we can easily see the problem and avoid using such data sets for AFM calibration.
Similar to the 1D case, we can also evaluate the dependence of the fitted step height on the polynomial degree as shown in figure 10. Compared to 1D data in figure 5, the heights for degree >5 fluctuate less (approximately by an order of magnitude), thanks to the higher number of data points used for the evaluation.

Impact of overall geometry
Numerical simulations in section 5.1 demonstrated that overall amphitheatre shape is preferable to staircase as it leads to smaller variance of the fitted step height. We also observed in 4.6 a peculiar behaviour of the LSQ estimate of standard deviation as a function of polynomial degree P. Here we explain both observations and show that they are closely related.
Consider two measurements with the same number of points and terrace levels, the same noise and bow, and fitted equally well. If the assumptions of LSQ are satisfied the standard deviation is estimated by (15). Sinceσ are the same for both measurements, any difference in variances is due to different cofactor matrix elements Q 11 which enters the LSQ standard deviation estimate in the square root term √ Q 11 . The matrix in (8) does not depend on any measured values. It depends only on the functions we fit and where we evaluate them. The same holds for its inverse Q. Therefore, √ Q 11 decides how well, or how poorly, measurements in a given configuration can be used to determine the step height, and different measurement configurations can be compared by comparing √ Q 11 . The result of such comparison of amphitheatre and staircase is shown in figure 11 for 1D fitting. In order to illustrate  clearly the fundamental differences, the matrices were calculated for ideal geometries. Specifically, G = 6, 12 or 35 ideal terraces of equal width were assumed on interval x ∈ [0, 1]. No regions were excluded around edges since the geometry was ideal. Furthermore, the LSQ problem was considered in the limit N → ∞. Therefore, √ Q 11 N, which does not depend on N, is plotted instead of raw √ Q 11 . Figure 11 confirms that the standard deviation is smaller for amphitheatre than for staircase, although the difference varies. In the case of many terraces and a moderate polynomial degree the difference can be large, whereas if there are only a few terraces the differences can be relatively small.
There is a remarkable qualitative difference in the behaviour of √ Q 11 for the two geometries. Although it appears to converge to similar values with increasing P, for staircase it jumps to the value quickly at degree P = 1, whereas for amphitheatre it increases much more gradually. This difference becomes more pronounced as the number of terrace levels G increases, which suggests to look at the problem in the high G limit. For this we will replace the level l g with a function l(x) in (7), writing the sum instead This is just a change of notation, because the levels l g were determined in previous steps of the procedure and l(x) is a known fixed function which expresses the level as a function of Overall shape Figure 11. Behaviour of the square root of cofactor matrix element Q 11 , entering the LSQ standard deviation estimate for step height. 'Terrace levels' means the number of different levels. For amphitheatres, where each level is split into two segments, the two segments count as a single level. position x. For the staircase geometry, l(x) resembles a linear function l(x) ∼ x, whereas for the amphitheatre l(x) ∼ |x| (see also figure 11). The steps superimposed on the overall shape become smaller and smaller with increasing G and l(x) thus becomes closer to x or |x|.
For the staircase this means that l(x) becomes almost linearly dependent on the background polynomials-more precisely, on the linear polynomial representing tilt. This explains why Q 11 increases sharply at P = 1 when the linear term is added to the polynomial background. The step height s becomes highly correlated with another fitting parameter, polynomial coefficient a 1 . On the other hand, the absolute value |x|, which describes the overall shape of amphitheatre, is difficult to approximate by polynomials. Therefore, the correlation between s and polynomial coefficients increases only slowly.
This explanation can be verified by considering a different type of amphitheatre geometry, one with an overall shape which can be approximated well by polynomials. The simplest possible example is the parabolic shape l(x) ∼ x 2 . It is still an amphitheatre, albeit no longer with terraces of equal widths-they become narrower with increasing distance from the centre. The results for this shape are plotted in figure 11 as 'Parabolic'. It behaves as predicted, exhibiting a sudden increase of √ Q 11 at P = 2, i.e. when the quadratic term is added.
In 2D the situation is analogous-the overall shapes of staircase and amphitheatre are plane l(x, y) ∼ xcosφ + ysinφ and cone l(x, y) ∼ √ x 2 + y 2 . Again, one can be well approximated by the linear polynomial, whereas the other is difficult to approximate by polynomials. There are geometrical shapes even more difficult to approximate by polynomials than a cone. Nonetheless, considering overall geometries we are actually able to prepare, amphitheatre is clearly preferred.
The final remark following from figure 11 is that in the amphitheatre geometry it is clearly beneficial to evaluate profiles containing a large number of terrace levels while the background polynomial degree should be as low as possible. However, measurement of larger areas, which contain more terraces, can require higher background polynomial degrees. Furthermore, the necessity to exclude edges from the fitting, which was not considered in the above discussion, limits how narrow terraces can be utilised. Therefore, these recommendations have to be balanced mutually and with practical measurement constraints.

Conclusions
The results from this paper directly support the use of silicon steps as a secondary length standard for nanometrology and build on work presented in the Mise en Pratique for the metre. As the predictions of Taniguchi [5] have largely been realised and the growth in nanotechnology together with its applications increases so will the need for sub-nanometre metrology over larger ranges. SPM users are already turning to nature using graphite steps to calibrate their SPMs. However, the advantages of the steps presented in this paper are clear; a link to the metre, multiple steps for long range calibration and amphitheatre-like structures that both allow an ISO 5436 type step height calibration to be performed i.e. the effect of sample tilt can easily be removed and give more accurate results than those obtained using staircase type structures. Given the almost universal application of SPMs in the field of nanoscience and the need for their calibration, the requirement for reliable calibration using the silicon steps will grow as will the need for reliable data analysis of the steps. Since the size of the steps is comparable with error sources in SPMs, great care is needed in data processing and this paper has demonstrated how this can be achieved. Robust 1D and 2D methods for fitting the monoatomic steps on silicon, and separating them from scanner imperfections, have been presented together with guidance on their usage. The 1D method is good for calibration purposes as it can reduce the possible errors related to more complex scanner background. On the other hand, AFM measurements are two-dimensional and in order to say anything about the particular microscope capabilities, the 2D method is a better reflection of the real microscope. Moreover, using the 2D method, interference effects and other measurement imperfections can be seen much more clearly. In comparison with the one dimensional fitting, the two-dimensional fitting uses data from a much larger region of the sample rather than using only multiple scanned lines from a narrow region to build up a one dimensional profile.
Using simulated data and theoretical analysis of the step evaluation procedure we have demonstrated that amphitheatre structures are better for calibration purposes than asymmetric staircase steps. The methods presented here can be used for secondary realization of the metre and as the discussed algorithms are freely available in the open source software Gwyddion they can be easily applied on any user's data. The presented approach can be used also to get data for an uncertainty component estimation related to the data processing itself. The uncertainty itself highly depends on the data set. However, for calibration measurements similar to the presented examples it can be estimated from the dependence on data processing parameters to be at least 10 pm for 1D data evaluation and at least 2 pm for the 2D data evaluation case. An important contribution to this uncertainty, which is not often encountered, is the impact of the polynomial degree for the levelling. It is important to note that the goal of the paper was not to present a full uncertainty budget for the AFMs used in this study or full uncertainty budgets for measurements of the particular silicon steps samples used in this study. We concentrated only on the uncertainties associated with the identification and fitting procedures for the steps. During this study it has become clear that other aspects of measurements of silicon steps samples such as, sampling interval, surface roughness, edge effects and other instrumental effects require further investigation and will be the subject of a separate paper in combination with the results of a planned AFM comparison using mono atomic amphitheatre structures. The algorithms presented in this paper will be used for processing data from the comparison and the uncertainty associated with the algorithms will form one component for the overall uncertainty associated with the measurement of silicon steps.

Acknowledgment
This work was supported by the EMRP project CRYSTAL, (Project Number: SIB61) jointly funded by the European Commission and Euramet participating countries. The silicon steps with amphitheatre arrangement were prepared by Ludger Koenders, Oliver Lencke and Ingo Busch at PTB as part of the above project and the authors also acknowledge useful discussions. The research was also supported by the project CEITEC 2020 under No. LQ1601, by funds from the Danish Agency for Institutions and Educational Grants and from the Engineering Measurement Programme of the National Measurement System in the UK.