TopoStats – a program for automated tracing of biomolecules from AFM images

We present TopoStats, a Python toolkit for automated editing and analysis of Atomic Force Microscopy images. The program automates identification and tracing of individual molecules in circular and linear conformations without user input. TopoStats was able to identify and trace a range of molecules within AFM images, finding, on average, 90% of all individual molecules and molecular assemblies within a wide field of view, and without the need for prior processing. DNA minicircles of varying size, DNA origami rings and pore forming proteins were identified and accurately traced with contour lengths of traces typically within 10 nm of the predicted contour length. TopoStats was also able to reliably identify and trace linear and enclosed circular molecules within a mixed population. The program is freely available via GitHub (https://github.com/afm-spm/TopoStats) and is intended to be modified and adapted for use if required.


Introduction
The use of Atomic Force Microscopy (AFM) in structural biology has been increasing over the past 30 years; AFM is now a versatile and accessible technique for directly imaging single biomolecules. This has led to its adoption by a wide community of biomaterials scientists, biophysicists and structural biologists. The advances in the field were facilitated in large part by hardware development: A 100X increase in image acquisition times has allowed the visualisation of dynamic biological processes [1] [2]. This has been coupled with the development of more sensitive imaging modes and probes that can resolve the double-helix of DNA [3], or the subunits of a macromolecular protein complex [4], using commercially available equipment. These complement what is perhaps the defining feature of the AFM, unique among other structural tools operating at sub-nanometre resolution (cryo-EM, X-ray crystallography): its capacity for imaging in liquid at physiological temperatures, where imaged (bio)molecules are active and free to explore their native conformational space, with the caveats that molecules need to be adsorbed on a solid substrate and that the AFM probe exerts a small force (often ~0.1 nN) on the sample. The technique has facilitated studies in which biological processes are watched as "molecular movies": examples of which include the observation of myosin walking along an actin filament [5], observing the structural changes in bacteriorhodopsin upon light exposure [6] and visualising the assembly of centromeres [1]. In addition to seeing these changes in molecular structure, direct imaging with the AFM facilitates the observation of rare molecular states and conformations within a snapshot of a heterogeneous population, for example visualising deviations in the DNA double-helix induced by supercoiling [7]. These unique features of the AFM make it a versatile structural biology tool that can operate either standalone and/or complementing other techniques such as cryo EM and Xray crystallography, where rare conformations of molecules are obscured by averaging.
However, bio-AFM has arguably suffered from a lack of the kind community-led investment in image processing and analytical capability seen for other techniques, most recently in the cryo EM "resolution revolution" [8] [9]. Contrary to cryo EM or X-ray Crystallography, there are relatively few free and open source softwares available for automated analysis, despite the importance of automated analysis for minimising selection bias and facilitating statistical analysis. This puts a restraint on the use of AFM as a quantitative imaging technique. When image processing tools are used in AFM studies, analysis is commonly manually repeated for each individual molecule within images. Tools that facilitate this include the Bruker Nanoscope analysis, ImageJ [10] or the open source AFM imaging software, Gwyddion [11]. Automation with these software is possible but can be restricted to image correction (Nanoscope) or require writing home-made scripts (Gwyddion and ImageJ). An additional complication is the variable quality of AFM images, which can significantly impact image analysis, as molecules that are aggregated, in close proximity, or poorly resolved may be difficult to separate and have their conformation partly obscured. This lack of available software, combined with the specific problems with AFM sample preparation, is highlighted by a number of AFM studies which have required development of home-built image processing softwares, often developed simultaneously by separate labs to address practically the same samples and problems [7][12] [13].
To directly address these issues, and to nucleate a virtual area of shared analytical infrastructure within the bioAFM community, we have developed TopoStats -an open-source Python utility that combines AFM image correction, molecule identification and tracing into a single automated protocol. We use a Python implementation of Gwyddion (pygwy) [11] for rapid image correction, which we feed directly into our own Python modules for automated tracing and analysis of biomolecules, described here step-by-step. We use multiple DNA minicircle samples to demonstrate TopoStats is a reliable and accurate tool for automated single molecule identification and tracing, before demonstrating its versatility when applied to biological and biomimetic pores. We encourage the community to contribute to these tools (available at Github https://github.com/afmstats/TopoStats), and hope that this can be a starting step to link AFM image analysis to the growing tools freely available through Python distributions.

Purpose of TopoStats
TopoStats was developed to be a simple, easy to use and open-source program intended to function as both a fully operational pipeline for generalised AFM image processing and molecular tracing as well as a platform for the development of more complex and specialised image processing routines. TopoStats is implemented in Python 2.7 and makes use of the freely available Gwyddion [11], NumPy [14] and SciPy [15] Python libraries. Using Gwyddion functions, TopoStats supports all commercially used file formats making its use unrestricted for labs running commercial, and most homebuilt AFMs. We actively encourage and welcome community development of the TopoStats functions and libraries, the source code, installation instructions and a tutorial dataset are all freely available at: https://github.com/afmstats/TopoStats.

Overview of TopoStats program
TopoStats takes raw AFM data as input, performs basic editing of the images to remove typical imaging artefacts ( Figure 1A, B, C), and identifies individual molecules ( Figure 1D) using Gwyddion functions. TopoStats then automatically generates backbone traces for each identified molecule ( Figure 1E) and computes the contour length of circular or linear molecules without any user input. TopoStats generates length distributions for all identified molecules, and outputs this information as text files (.json files) and plots ( Figure 1F) which we have used to analyse conformation of a range of biomolecules. Using our setup, TopoStats automated processing is reasonably fast: for a typical 512x512 pixel image, TopoStats corrected the artefacts and identified molecules within the image in 0.5 s and traced the identified (n = 16) molecules in 3.3 s (figure 1) on a commercially available laptop.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.23.309609 doi: bioRxiv preprint

AFM image correction
Distortions in raw AFM images were corrected using functions from the Gwyddion Python library 'pygwy'. First, we used first order polynomial subtraction (i.e., plane subtraction) to remove image tilt with the Gwyddion 'level' function ( figure 1A, B). Secondly, artefactual height (z) variations between fast scan (x-axis) line profiles were corrected by median background subtraction for each line using the Gwyddion function 'align rows', essentially ensuring that adjacent scan lines have matching heights (figure 1B, C). Remaining image corrections were removed using the automated Gwyddion function 'flatten base', which uses a combination of facet and polynomial levelling with automated masking ( figure 1D). Finally, we offset the height values in the image such that the mean pixel value (corresponding to the average height value of the surface) was equal to zero. High frequency noise was removed from images using a gaussian filter (σ = 1 pixel). We found this approach sufficient for all images shown in this study, however challenging, complex or unusual samples may require additional corrections.

Molecule Identification
TopoStats uses pygwy's automated masking functions to identify molecules on the sample surface. In this approach, each molecule is identified using a uniquely labelled mask (grain). The positions of these grains are defined by identifying clusters of pixels by height values that deviate from the mean by a user defined value, using pygwy's 'datafield.mask_outliers' function. We found a height threshold value of 0.75 -1σ to be optimal for most samples (with 3σ corresponding to a standard gaussian). This approach initially identifies all features with heights that deviate sufficiently far from the mean surface: single molecules, clusters of molecules or aggregates and arbitrary surface contaminants. For some samples, this threshold value needs to be carefully tuned by the user, as described for a range of biomolecules in section 3.3.
To refine our grain selection to include only single molecules we employed a simple approach to remove both clusters/aggregates (large objects) and surface contaminants (typically small objects). The median area for all grains is determined and grains that have an area +/-30% of this median value are removed. An additional pruning step removes grains that contain pixels that lie on the image borders.

Saving grain information
We save out the grain statistic information obtained using Gwyddion's pygwy functions to a ".json" file, situated in the root folder and named as the root folder i.e. "myfolderofdata.json". The grain statistic information is as follows: projected area, maximum height, mean height, minimum height, pixel area, area above half height, boundary length, minimum bounding size, maximum bounding size, centre x and y coordinates, curvature, mean radius, and ellipse angles.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.23.309609 doi: bioRxiv preprint

TopoStats Tracing
To implement molecule tracing in TopoStats we developed our own Python tracing library for generating smooth traces of each molecule identified as a Gwyddion grain ( figure 1D, figure 2A, B). We also implemented functions in TopoStats for basic analysis of the traces (e.g. computing molecular contour length) and for visualising traces. These traces can be saved as text files, to facilitate visualisation, analysis and processing using a given user's preferred software packages or home-written scripts.
TopoStats tracing is implemented using the NumPy [14] and SciPy [15] Python libraries. The tracing process is composed of 5 basic steps: firstly, the Gwyddion grain (figure 2B) is "skeletonised" into a single pixel wide binary representation of the geometric centre along the molecular backbone ( figure 2C). Secondly, the positions of each pixel in the binary skeleton are extracted as cartesian coordinates ( figure 2D). This initial coordinate array must be reordered such that the coordinates follow the path of the traced molecule ( figure 2E). These trace coordinates are then adjusted such that they follow the highest path along the backbone of the underlying molecule. This adjusted trace is then smoothed by splining ( figure 2F) to produce the final molecular trace which can be saved as a text file.

Producing a rough binary skeleton
We used a modified version of the established "Zhang and Shuen" skeletonisation algorithm [16] to transform each Gwyddion grain (figure 3A) into a single pixel wide skeleton (figure 3B, C). Our adapted skeletonisation algorithm initially follows exactly the Zhang and Shuen approach: each grain is iteratively thinned by evaluating the local environment (a 3 x 3 grid) for each pixel (figure 3D), those pixels identified to be at the grain boundary are deleted whilst those at skeleton ends or required to maintain connectivity are not. We extended this process by including two additional "pruning" steps after initial skeletonisation: firstly to delete "redundant" pixels in the skeleton and secondly to remove branches that emanate from the skeleton (figure 3C). The method for identifying and removing these redundant pixels and skeleton branches is described in detail below. We defined redundant pixels within the single pixel wide trace as points that were not absolutely required to maintain the connectivity and overall shape of the skeleton (figure 3B, white points), typically arising at corners in the trace. Specifically, pixels which had 2 or more direct (vertical and/or horizontal) neighbours, i.e., neighbours in P2, P4, P6 or P8 positions, were identified as redundant. Deleting these points does not break the trace connectivity, as the trace connects between the remaining points in the P2, P4, P6 or P8 positions through a diagonal connection. As this evaluation is done on the skeletonised trace, only a specific set of rules is required to identify these hanging pixels. Firstly, hanging points were identified and deleted if they satisfied the condition 1: where A(P1) is the number of [0, 1] neighbours in the (P2, P3), (P3, P4) … (P9, P2) sequence (as defined in figure  3E) and any of the following conditions 2 -5: Additional redundant pixels were identified and deleted if they satisfied condition 6: CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.23.309609 doi: bioRxiv preprint and any of the following conditions 7 -10: 7. P2 * P4 * P6 = 1 8. P4 * P6 * P8 = 1 9. P6 * P8 * P2 = 1 10. P8 * P2 * P4 = 1 After redundant pixels were removed, branches from the central trace were identified and deleted (figure 3B, blue points). The Zhang Shuen skeletonisation algorithm is known to produce anomalous skeleton branches and we thus judged any short branches from the central body of the skeleton to be artifactual and removed them. We identified potential branches by locating pixels with only one neighbour within a 3x3 local environment, i.e. any pixel that satisfied condition 11: where B(P1) is the sum of all pixel values within the local 3x3 pixel environment ( figure 3D). These coordinates are used to define the start of potential branches from which neighbouring pixels are sequentially added to the potential branch if they satisfy condition 12:

B(P1) = 2
Potential branches were deleted from the skeleton if a pixel was encountered along the potential branch that satisfied condition 13, i.e. if these branches were found to rejoin the main body of the trace:

B(P1) > 2
If pixels were found in potential branches that satisfied condition 11 these potential branches were judged to be linear molecules and were not deleted. This branch searching function is iterated until no branches are identified or deleted.

Determination of linear and circular molecules
We used a simple approach to determine if traces were of open-ended ("linear" in DNA terminology) or closed ("closed" circular, in DNA terminology): the local 3x3 neighbour array (figure 3D) was evaluated (using condition 11) for each pixel and those with only a single neighbouring coordinate were recorded. For a closed circular trace, there will be zero coordinates with a single neighbour, whereas a linear trace will have 2 coordinates with a single neighbour (i.e. both ends of the trace).

Producing an ordered trace
We extracted the cartesian coordinates of each molecule from binary skeletons (figure 4A) as a 2D NumPy array. In this procedure the coordinates are identified in ascending order along the x-axis and thus their sequence did not follow the trajectory of the underlying molecule and instead produced a nonsensical trace ( figure 4B). As such, we reordered the coordinates, to obtain a valid representation of the traced molecule, by implementing a localneighbour search algorithm. This algorithm iteratively identifies neighbouring coordinates from the list of "disordered" skeleton points, places the identified neighbour in the array of "ordered" coordinates and deletes this point from the list of disordered points. This approach maintains the direction of the traced molecule such that all coordinates from the skeleton are listed in a sequence that follows the trajectory of the traced molecule ( figure  4C).
The local-neighbour search function is initiated with a sensible coordinate to start the tracing process. For linear molecules, tracing starts from one of the skeleton ends, which are identified as coordinates with only one direct neighbour (as assessed using condition 11). For circular molecules, the starting coordinate is essentially arbitrarily assigned as any of the coordinates with 2-local neighbours, ensuring that tracing does not start at a crossing of the molecule over itself. These coordinates are the first points in the "ordered" coordinate array and, crucially, are . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.23.309609 doi: bioRxiv preprint removed from the list of "disordered" skeleton points. For circular molecules, one of the 2 neighbours of the starting coordinate are arbitrarily chosen as the next point in the trace and appended to the ordered coordinate array and removed from the list of disordered points.
This starting coordinate is the first reference point (Pi) from which the tracing algorithm identifies the next point in the trace. This next point is identified by searching the list of disordered points for neighbouring coordinates of Pi, i.e., do any coordinates lie within the 3x3 neighbourhood of Pi ( figure 4D). From the first point in a linear trace, and indeed from most reference points within linear and circular traces, only one neighbouring coordinate will be present in the disordered list, which can thus be appended to the array of ordered points and removed from the disordered list. This identified coordinate then becomes the reference point for the next iteration of the tracing process. For most molecules, this simple, and fast, approach is sufficient to identify and append all points from the disordered list to the ordered array. However, a more complex method is needed to deal with reference points with multiple neighbours, which can occur when a molecule winds over itself or has a more complicated shape. At such points, the search algorithm aims to maintain the direction of the traced molecule by identifying the candidate point which deviates least from the trajectory of the coordinates in the ordered array. This is achieved by first determining the angle i between the reference point Pi and the coordinate 3 points behind Pi (Pi-3 ) in the ordered array. Then, the angles i+n between each candidate point and the coordinate 2 points behind the reference coordinate (Pi-2) are calculated. The candidate point whose angle i+n is closest to the reference angle, i, is chosen as the next point in the trace, and is appended to the ordered array and removed from the disordered list. (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.23.309609 doi: bioRxiv preprint calculated. The candidate point with the vector angle most similar to that between Pi and Pi-4 is accepted as the next point in the trace.
The tracing process continues until either all the points from the disordered list are moved to the ordered array, (when the first point in the ordered array is identified as a potential next point, indicating that a circular molecule has been successfully traced), or until the reference point reaches the other skeleton end in a linear trace.

Producing a fitted trace
The single pixel wide trace generated by skeletonization approximates the geometric center of the molecular backbone generated from a binary "mask" of the underlying molecule. As such, the topology of the imaged molecule has little influence on the skeleton position which can thus be an inaccurate representation of the traced molecule, particularly at sharp turns or kinks. We addressed this problem by implementing a function to adjust the trace coordinates such that they traverse a path along the highest points along the molecule ( figure 5A). This function evaluates the local height profile of each trace coordinate, perpendicular to the trace direction, and adjusts the positions of each coordinate such that they lie at the highest point on the height profile ( figure 5B). To avoid fitting the trace to peaks arising due to noise, the topographical image is first gaussian filtered (2 nm full-width half maxima). This improves the fit of the trace to the underlying molecule, but highly curved segments of molecules remain challenging to accurately trace.

Splining Coordinates
Traces generated from images with a large (>1 nm) pixel size are not sufficiently sampled to smoothly trace the underlying molecule ( figure 6A). We solved this issue using parametric splining of the coordinates, to generate an interpolated trace that smoothly follows the contours of the underlying molecules. We used the SciPy interpolate functions to calculate splines. For the data presented here, the spline knots used to interpolate the traces were separated by 40Å, as an estimation of local bending. This value is defined by the user, and its value should be carefully considered based on the structural properties of the sample being investigated. To represent all points in the initial trace in the splined trace, an average of multiple independent splines is recorded ( Figure 6B). For . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.23.309609 doi: bioRxiv preprint molecules with highly kinked backbones, this splining approach can underestimate local curvature. This effect can be compensated by reducing the spline knot distance or analysing the non-splined traces. It is important to note that non-splined and splined traces will have distinct contour lengths and as such may not be directly compared.

Calculating contour length
The contour length for each trace is calculated as the sum of the vectors between all neighbouring points in the splined trace, using the following equation: Where n equals the number of points in the splined trace and vi equals the vector between cartesian coordinates i and i+1.

Saving trace information
The calculated contour length, conformation, molecule number for each traced minicircle is saved using the pandas library filename using the tracestats object to a ".json" file.

TopoStats Plotting
TopoStats contains a 'traceplotting' script which uses the seaborn and matplotlib python modules [17] for data plotting. This script uses the "tracestats.json" output from the tracing as input. Data is grouped based on directory basename and is plotted as histograms, kernel density estimate (KDE) plots, and combinations of the two, in addition to violin plots.

Acquiring AFM images to be evaluated by TopoStats
TopoStats was designed as an effective tool for analysing molecular conformations within AFM images. It is however most effective when best practices are followed, which are explained in detail elsewhere [18]. The preparation of MAC, NuPOD and NPC samples are described in detail respectively in the literature [19] [20][21] [22]. As the accuracy of TopoStats is affected by the resolution of AFM imaging, we recommend following best practices for AFM imaging of soft biomaterials in solution using PeakForce Tapping mode [23][18], although sample preparation and imaging parameters may require optimisation for different samples.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.23.309609 doi: bioRxiv preprint

AFM Imaging
All AFM measurements were performed in liquid in PeakForce Tapping imaging on a FastScan Bio AFM system using FastScan D cantilevers (Bruker). Imaging was carried out with a PeakForce Tapping amplitude of 10 nm, at a PeakForce frequency of 8 kHz, at PeakForce setpoints of 5-20 mV, (peak forces of <100 pN). Images were recorded at 512 × 512 pixels to ensure resolution ≥ 1 nm/pixel at line rates of 3.5 Hz.

Sample Preparation
DNA minicircles (sequences described in Appendix A) were adsorbed onto freshly cleaved mica specimen disks (diameter 6 mm, Agar Scientific, UK) at room temperature, using Ni 2+ divalent cations. 20 μL of 3 mM NiCl2, 20 mM HEPES, pH 7.4 buffer solution was added to a freshly cleaved mica disk. 5-10 ng of DNA minicircles were added to the solution and adsorbed for 30 minutes. To remove any unbound DNA, the sample was washed four times using the same buffer solution.

Results and Discussion
We designed TopoStats for fast and automated structure analysis of biomolecules from AFM images. Key to this is accurate backbone tracing of polymers and oligomers, and subsequent contour length measurement and conformation determination. We used four conditions to evaluate TopoStats function, each of which we deemed essential for its widespread use. Firstly, we aimed to successfully identify the vast majority (~90%) of available molecules that appeared isolated in the AFM images, including those from suboptimal images containing surface contaminants and aggregates. Secondly, we aimed to produce accurate traces. Thirdly, we aimed to distinguish between distinct conformations within a mixed population and, finally, we aimed to have TopoStats be versatile enough to identify and trace a range of biomolecules, without extensive optimisation and specialisation for distinct samples.

TopoStats for image processing and contour length determination
A key functionality of TopoStats is accurate identification and tracing of molecules from suboptimal images (those containing aggregates or surface contaminants). This facilitates faster data processing for the user as reliable molecule identification and tracing, including from poor images, reduces the need for manual inspection of each processed image. Additionally, optimising a sample to perfect homogeneity is not trivial and is often time consuming, and for some samples is not possible (Figure 7). Being able to extract useful information from suboptimal images thus facilitates AFM studies of more complex (and potentially interesting) samples and could save valuable lab time spent on sample optimisation. Here, we use two DNA minicircle samples (256 bp and 339 bp in length) to demonstrate that TopoStats can successfully identify and trace molecules from "ideal" images (339 bp sample) and from poorer images, containing aggregates and small surface contaminants (251 bp sample).
To check the completeness of molecule identification in TopoStats, we also manually counted the number of isolated, non-touching DNA molecules in the images to compare to the number identified by TopoStats.
Circular 339 bp DNA molecules were prepared by collaborators (appendix 1), immobilised on a mica surface, imaged with the AFM and the output raw data was analysed by TopoStats. Processed images showed a very clean sample, with essentially no aggregates or surface contaminants (figure 7Ai) facilitating excellent molecule identification: 99% of all single molecules were identified (415 of 419 molecules) and traced (figure 7Ai). The contour length histogram for 339 bp minicircles showed a well defined peak centred on the expected contour length of 115 nm (figure 7Aii). Despite the abundant presence of significant surface contaminants in the 251 bp . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.23.309609 doi: bioRxiv preprint sample, evident in the images (Figure 7Bi), this dynamic of successful molecule identification and tracing was repeated. Given the difficulty in visually distinguishing between small DNA fragments and linear DNA molecules in this sample, we only counted and compared the number of circular molecules, to minimise human bias. By this metric, TopoStats successfully identified and traced 84% of all visible molecules (figure 7Bi): 111 of 132 complete molecules. Plotting the measured contour lengths measured from these traces as a histogram showed virtually all traced molecules were full DNA minicircles: the histogram has a well-defined peak at the position of the expected contour length (85 nm for a 210 bp molecule), whilst there are comparatively few traces with shorter contour lengths (Figure 7Bii).

Figure 7: TopoStats tracing of a mixed set of images. For each dataset an (i) example AFM image is shown DNA traces overlaid in cyan and (ii) a histogram of the contour lengths. (A) 339 bp minicircles. (B) 251 bp minicircles. Blue stars represent the predicted contour lengths for each sample. Scale bars: 100 nm, vertical colour scale (inset colour bar in A): 3 nm.
Given this apparent accuracy in contour length measurement for 339 and 251 bp minicircles, we further explored TopoStats tracing and contour length measurement using an expanded range of DNA minicircles samples: specifically, 116, 194, 251, 339, 357 and 398 bp. These DNA minicircles are ideal for testing TopoStats tracing accuracy as their tunable length (defined by the number of base pairs) gives a theoretical contour length (0.34 nm/bp), which can be compared to the measured contour length produced by TopoStats. The 116, 194, and 357 bp minicircles were prepared by annealing oligomers of ssDNA whilst the 251, 339 and 398 bp minicircles were prepared in bacteria by -integrase recombination (251, 339) and xer recombination (398). The 398 bp minicircles are natively negatively supercoiled, all other species are relaxed or nicked.
The DNA minicircles were prepared by collaborators (appendix 1) and immobilised on mica (as described in section 2.2.2), imaged with the AFM and the output raw AFM data was analysed with TopoStats. Examining images from each sample with overlaid traces showed that TopoStats was able to generate good traces for each construct, using default parameters. These traces followed the distinct geometries of each sample, arising from . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.23.309609 doi: bioRxiv preprint their specific lengths and production methods. For example, the shorter DNA minicircles are highly constrained by their length, which is close to the DNA persistence length (50 nm) for 194 bp (66 nm theoretical length) minicircles, and below the persistence length for 116 bp (39 nm theoretical length) minicircles. These samples were visualised as tightly compact circular conformations (Figure 8Ai-ii). This contrasts with the longer DNA minicircles (339 bp and above), which are not restricted by the persistence length and can form more complex conformations with fluctuating local curvature, whose contours are followed by the TopoStats traces (Figure 8Aivvi).
We used these TopoStats traces to calculate contour lengths for each molecule and visualised the distributions from each construct as a KDE plot ( figure 8B). This distribution shows clear peaks for each species whose position increases in line with the increasing length of the DNA minicircles, and thus the theoretical contour length. We then used violin plots to better visualise the measured contour length distributions within each minicircle population ( Figure 8C). These plots showed broader contour length distributions for longer constructs (339, 357 and 398 bp samples) compared to the shorter minicircles (116, 194 and 251bp) with the 357 and 398 bp samples having particularly broad distributions. The 357 bp distribution appears bimodal, with the main peak centred at ~120 nm with a second population at ~100 nm ( figure 8A, B). We hypothesise that this minor peak is caused by an artefact in the tracing process, which produced shorter traces around highly kinked sites ( Figure 8D). At highly kinked sites, the skeletonization algorithm produces a branch-like linear trace emanating from the main body, resulting in a tennis racket shaped trace. This linear branch is removed by TopoStats after skeletonization (as described in section 2.1.5.5), as it is not representative of the underlying structure. A similar broader distribution of contour lengths arises from tracing errors in the 398 bp sample. Examining the traces revealed that these errors are caused by the complexity of the minicircle conformations: the longer 398 minicircles are natively negatively supercoiled, which can lead to more compact structures that writhe (fold over on themselves) [6]. These conformations are inherently more difficult to trace, as the path of the DNA polymer is much less clear, leading to some incorrect or incomplete traces ( Figure 8B), which causes a broadening of the contour length distribution. Reliably tracing these writhed (crossed) and more complex minicircle conformations should be feasible within our TopoStats framework but will likely require additional functions within the tracing modules that are specialised to deal with these complicated shapes. This is an area of current development.
. CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.23.309609 doi: bioRxiv preprint With the trend established between contour length distribution and minicircle base pair length ( Figure 8A, B), we next calculated the "average" (peak) measured contour length for each sample. We used the maxima of the probability distribution for each species to calculate this "average" value, as shorter DNA fragments bias the mean and median value. The measured contour lengths are listed in table 1, alongside the expected contour length (calculated based on the length in bp) and number of identified molecules. For all minicircles, there was good agreement between the peak measured contour length and the theoretical contour length: the expected length was within the noise range of the measured average for each sample. Indeed, the peak measured contour length deviated . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.23.309609 doi: bioRxiv preprint by a maximum of 6 nm from the expected value for all samples, excluding the 398 bp minicircles whose tracing was inhibited by their complex shape. Overall, this analysis demonstrates TopoStats' capability for fully automated image correction, molecule identification and tracing from AFM images, of varying quality. For each sample, a high proportion of all molecules were identified and successfully traced (>85% of all isolated single molecules). These traces were generally very accurate, as shown by the similarity between the peak measured contour length and expected contour length, defined by the length of the minicircles in base pairs. The exception to this was the natively negatively supercoiled 398 bp sample, whose more complex shape did prove challenging for TopoStats tracing.

TopoStats automated determination of conformational state
Having established that TopoStats accurately measured DNA minicircle contour lengths, we next showed that TopoStats could accurately identify distinct conformations (linear and circular) within a mixed population. To do this, we used TopoStats to determine the success of a DNA annealing reaction for 194 bp minicircle construct. AFM images of an annealed DNA minicircle sample were analysed with TopoStats to determine the proportion of successfully annealed (circular) DNA molecules compared to those that did not anneal (linear molecules).
Circular 194 bp DNA molecules were prepared by collaborators, immobilised on a mica surface and imaged with the AFM (Figure 9A). Using TopoStats, we identified and traced 127 DNA molecules from 19 AFM images. Of these, 41% of DNA molecules were successfully annealed (circular) whilst 59% remained linear. Manual inspection of these images revealed a further 6 DNA molecules that had not been identified by TopoStats, 4 circular and 2 linear molecules. To further explore the differences between the linear and circular molecules within the sample, we calculated the contour lengths for each circular and linear molecules and plotted their respective distributions independently as a violin plot ( Figure 9B). This showed a markedly broader contour length distribution for the linear molecules compared with the annealed circular molecules. This was reflected in the standard deviation around the mean contour lengths. Here, we used the mean contour length as we did not observe bimodal distribution for either population. The mean contour lengths and standard deviations were 55 ± 14 nm (N = 51) for linear molecules and 58 ± 6 nm (N = 76) for circular molecules. The average contour length for all traced molecules was 56 ± 12 nm. The distribution in the circular sample is narrower compared with the linear molecules as only correctly annealed and assembled molecules can form the closed circular conformations. In contrast, the . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.23.309609 doi: bioRxiv preprint linear population includes all fragmented and incorrectly annealed molecules, or those degraded by some means. It is also possible that some of this broader distribution arises due to tracing errors, like those described above ( Figure 8D). Through this simple example, we show the accuracy of molecular conformation identification in TopoStats and its potential for more detailed analysis of the separated populations. We envisage this capability to be useful for more complex analysis, for example in exploring and visualising the activity of DNA nicking enzymes.

Assessing TopoStats tracing of other Biological Molecules
Having demonstrated TopoStats' effectiveness for identifying, tracing and reporting on the conformation of individual DNA molecules, we next explored its versatility, by tracing three distinct molecular assemblies. These were: the membrane attack complex (MAC), a hetero-oligomeric pore forming protein complex that forms circular pore assemblies in bacterial membranes. A DNA-origami biomimetic ring, NuPOD (NucleoPorins Organised on DNA), which was designed as a small synthetic mimic of the nuclear pore complex (NPC) as well as the NPC itself, a massive ring-like protein complex embedded in the nuclear membrane. These three assemblies encompass native purified protein assemblies (MAC), synthetic DNA assemblies (NuPOD) and native biological membranes extracted from cells (NPC embedded in nuclear envelope). We applied TopoStats to automatically identify individual MAC, NuPOD and NPC complexes from representative images, to assess its usefulness for these samples. For each sample, the only TopoStats parameter that needed to be optimised was the height threshold used to identify particles (section 2.1.4), as well as the size of the box used to crop individual molecules (Figure 10 A, B, C respectively).
As with DNA minicircles, TopoStats showed excellent identification rates for the NuPOD sample, in which 97% (858 of 879 identified) of all molecules were identified, and the NPC, in which 96% were identified (24 of 25). The identification rate was poorer for the MAC where just 68% of MAC pores (13 of 19) were identified. This could be attributed to the higher height thresholding required to facilitate successful tracing of the MAC pore, and the fact that these molecular assemblies are prone to clustering. As the MAC has a very small lumen, if the entire pore is selected using a lower height threshold, the ring appears as a circle without a lumen. These circular masks are skeletonized into a single point. These single point traces are identified and discarded by TopoStats. The measured contour lengths of the assemblies were: 60 ± 8 nm for the MAC, 158 ± 8 nm for the NuPODs, and 287 ± 21 nm for the NPC (N = 13, 858 and 24 respectively). We calculated theoretical contour lengths for each sample using known pore diameters from previous studies [21][22] [24], which were 63 nm (MAC), 170 nm (NuPOD) and 267 nm (NPC). In each case, the measured contour length from TopoStats showed good agreement to those from literature, demonstrating TopoStats is a versatile tool capable of producing accurate traces from a range of samples and substrates.

Conclusions
In this study, we have demonstrated the power of TopoStats, our software package for automated AFM image correction, molecule identification and tracing. Using simple examples, such as DNA minicircles at a range of lengths, we have shown that TopoStats can identify and trace isolated molecules, providing precise measures of contour length. We have also shown that TopoStats can distinguish distinct molecular conformations (circular and linear) within a mixed population. Finally, we have demonstrated that TopoStats can be applied to a range of biomolecular assemblies, including pore-forming proteins, DNA origami rings, and large protein complexes embedded in native cellular membrane, with minimal parameter changes between these different samples.
TopoStats can be used as a platform to allow processing and analysis of AFM images across a range of samples, and environments. We expect TopoStats to be a useful tool for accelerating and simplifying image processing for . CC-BY 4.0 International license available under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprint this version posted January 5, 2021. ; https://doi.org/10.1101/2020.09.23.309609 doi: bioRxiv preprint many working in biological AFM. As an open-source package, we hope it will be useful as a platform to facilitate the building of more complex image processing or identification routines. TopoStats has been integrated into the "AFM-SPM" community hub for AFM and SPM software development (https://github.com/AFM-SPM/home), and we actively encourage community discussion, participation, and development. We are actively developing TopoStats' capabilities, with a focus on integrating automated polymer statistics calculation (e.g., persistence length), and to expand tracing to more complex samples e.g., DNA-protein complexes.