Nanoscale Pattern Extraction from Relative Positions of Sparse 3D Localizations

Inferring the organization of fluorescently labeled nanosized structures from single molecule localization microscopy (SMLM) data, typically obscured by stochastic noise and background, remains challenging. To overcome this, we developed a method to extract high-resolution ordered features from SMLM data that requires only a low fraction of targets to be localized with high precision. First, experimentally measured localizations are analyzed to produce relative position distributions (RPDs). Next, model RPDs are constructed using hypotheses of how the molecule is organized. Finally, a statistical comparison is used to select the most likely model. This approach allows pattern recognition at sub-1% detection efficiencies for target molecules, in large and heterogeneous samples and in 2D and 3D data sets. As a proof-of-concept, we infer ultrastructure of Nup107 within the nuclear pore, DNA origami structures, and α-actinin-2 within the cardiomyocyte Z-disc and assess the quality of images of centrioles to improve the averaged single-particle reconstruction.

S ingle molecule localization microscopy (SMLM) approaches, which can achieve localization precisions below 20 nm laterally and 50 nm axially, at or near the molecular scale 1 can reveal the organization of nanostructures such as supramolecular complexes and DNA assemblies. However, interpreting image reconstructions (localization maps) generated by SMLM is not trivial as intrinsic noise arising from the stochastic switching of fluorophores can obscure underlying molecular order. This is most challenging when the fraction of target molecules localized with high precision is low, a common result of a low labeling density, low switching efficiency, or high background signal in 3D imaging. 2 Single-particle averaging and reconstruction techniques can enhance the signal-to-noise ratio and reveal underlying patterns of organization from SMLM data. 3−8 Similarly, 1D and 2D autocorrelation (e.g., Fourier-domain processing) of SMLM reconstructions can reveal periodicity in a biological structure. 9,10 However, these methods require the target molecule to be efficiently labeled, detected with high signalto-noise ratio, and the complex to be very highly ordered. To perform averaging, they also require either consistent orientation of the biological complex or classification and alignment of the segmented regions of interest (ROIs), which presents further challenges. 2 This restricts the applicability of existing methods to a small subset of nanostructures. Data pixelation in these techniques is an extra processing step that loses precision on the molecular localization coordinates from SMLM.
To overcome these limitations, we developed a new approach for pattern analysis and recognition of order in SMLM data. It can assess any 2D or 3D SMLM data set for regular structures through the analysis of relative positions (RPs) of localizations. This technique (pattern extraction from relative positions of localizations, or PERPL) extends previous work using pair correlation 11−13 by extending the analysis into 3D. Further, it compares experimental intermolecular distances against models of ordered and disordered macromolecular geometry and uses appropriate statistical methods to determine the most probable model. It can be applied to large, 3D data sets in their entirety, on a standard laptop, making it a valuable addition to the SMLM analysis toolkit.
Results and Discussion. PERPL calculates the relative positions (RPs) between single molecule localization coordinates (XYZ) in 2D or 3D, calculates their distribution (RPD), and interprets this distribution using model distributions ( Figure 1). The experimental RPD is generated from the 3D relative positions (ΔX, ΔY, ΔZ) between each localization and all nearby localizations, within a chosen distance, for the entire FOV. The localization data does not need to be prealigned. The experimental RPD can be calculated for all simple combinations of these dimensions ( Figure S1, Supporting Information). Peaks in the experimental RPD indicate the presence of underlying molecular organization and characteristic length scales across 1D, 2D, or 3D space ( Figure  1C). The SMLM image reconstruction ( Figure 1B) or lowerresolution features of the sample can be used to choose suitable dimensions to use in generating the RPD and subsequent analysis.
To identify which patterns of molecular organization may be present, in silico models of candidate structures are constructed, and the RPD for these is calculated and then compared to the experimental RPD. Starting models can be generated using one or more of the following. The appearance of molecules in the reconstructed FOV can provide information about possible underlying symmetry (i.e., the rotational symmetry for Nup107 in nuclear pores). The shape of the experimental RPD and the positions of any peaks can indicate the presence of characteristic features, including possible repeated patterns and their length scale. Additional experimental or published information on structural features of a complex that contains the molecule of interest (e.g., electron microscopy data) can also suggest starting models. A Nano Letters pubs.acs.org/NanoLett Letter parametric method of generating these candidate geometries in silico allows for model fitting in subsequent steps (Supporting Information). The model RPD (a set of discrete 3D (or 2D) RPs) is generated from a list of localization coordinates in the model structure, in the same way as for experimental localizations. To reflect experimental noise and biological variability, discrete model distances are broadened using the theoretical distance distribution between two Gaussian sets of localizations. 14 This includes broadening on a zero-distance term for RPs of repeated localizations of the same molecule or localizations of nearby unresolvable molecules. A background of disordered localizations may also be included in the model, and examples are provided (Supporting Information).
Akaike's Information Criterion (AIC) 15,16 is used to determine which hypothesis (in silico model) best describes the real structure, or disorder, underlying the data (experimental RPD). AIC is a quantitative measure of information loss when approximating real data with a model (Supporting Information). The difference in AIC values between models is related to the relative likelihood (Akaike weight) that each model captures the reality underlying the experimental data. 16,17 We use the corrected AIC (AICc), 18 which improves on the accuracy of AIC for more complex models evaluated from fewer data points. Using PERPL, we obtain fits of model against experimental RPDs, relative likelihoods for the selection of the most likely model, schematic plots of the structural models, and fitted model parameter data ( Figure S2, Supporting Information).
Note, the AICc helps to determine which structural model best explains the data, but this model may still be inaccurate, if all of the candidate models are poor. However, it disfavors overfitting by more complex models, and where necessary, the structural models can also be compared by AICc against simple models of disordered molecular positions. This results in a level of confidence that the best structural model is more likely than a random arrangement, or vice versa (Supporting Information).
To demonstrate the method, we used PERPL to reveal the underlying nanostructure of Nup107 (labeled at the Cterminus 21 ) in nuclear pores, from an experimental 3D SMLM data set ( Figure 2, Supporting Information). The reconstructed image in XY shows multiple ring-like structures oriented with their axis of symmetry nearly aligned with the Zdirection (Figure 2A), together with structures separated in Z ( Figure 2H). Therefore, we investigated the distance Figure 2B) and in Z (ΔZ) ( Figure 2H). The analysis used a maximum pairwise distance of 200 nm (in X, Y, and Z), a distance slightly larger than the ring-like structures. The entire 16 × 17 × 0.9 μm 3D FOV, containing 36k localizations, was analyzed in a few minutes on a standard laptop. Both XY and Z distributions contain multiple peaks ( Figure 2B,I) that imply underlying sets of characteristic separations.
To determine the underlying organization, PERPL was used to construct candidate model structures with rotational symmetry in ΔXY ( Figure 2C), based on the Nup107 image reconstruction (Figure 2A). Models were parametrized for  19 EM maps of the nuclear pore shown in (G) and (N) generated from PDB 5A9Q using UCSF Chimera. 20 Nano Letters pubs.acs.org/NanoLett Letter diameter, degree of symmetry (5−11-fold, Figure S2), and localization precision. A substructure term (a, circled in Figure  2C, and related peak in Figure 2D) was included to correct for molecules too close to resolve or a spread of localizations resulting from overlapping images of molecules. 22 A background term was included to account for localizations predominantly outside of a single ring-like structure.
A statistical comparison of model and experimental RPDs ( Figure 2E, Table 1) showed the model with 8-fold symmetry was ∼4× more likely than the next most likely model (9-fold symmetry). The 8-fold symmetric model, with a diameter of 95.4(1) nm (1 s.d. on values of the fitted parameters given as variation in the last significant digit; Figure 2F, Table S1)) agrees well with particle average EM and SMLM data ( Figure  2G). 8,23 The XZ reconstruction ( Figure 2H) together with the two lobes in the ΔZ RPD ( Figure 2I) suggested a two-layer structure, where each layer has a thickness in the Z-direction ( Figure 2J). A model of this kind ( Figure 2K), which included a Gaussian spread within each layer, fit the experimental RPD well ( Figure 2L) and indicated that Nup107 (C-terminus) would be found in layers separated by 58.2(1) nm ( Figure 2M, Table S2). Again, this model and the inferred structure agrees well with EM data ( Figure 2N).
Returning to our analysis of ΔXY RPDs, we next restricted it to pairs of localizations within a single layer, by limiting ΔZ to less than 20 nm. Using these within-layer RPs, the relative likelihood of the 8-fold model increased to >10 10 × greater than the next most likely (9-fold, Tables 1 and S3). Thus, iterative model fitting and refinement can improve the interpretation of localization data sets. Restricting ΔZ to less than 20 nm likely removes uncertainty arising from the known angular offset between the two layers. 19 We next tested the ability of PERPL to reveal the underlying DNA origami nanostructure from a 3D DNA-PAINT 25 SMLM data set ( Figure 3A−D, Supporting Information), for which we had no prior knowledge. The image reconstruction ( Figure  3A) revealed geometric structures in different orientations on an approximately square lattice. Therefore, we constructed in silico models of simple geometric nanostructures and included features reflecting the presence of localizations at nearby grid points. Models included a triangular prism on a square lattice (all sides equal), a triangular prism on a square lattice (unequal sides), a cuboid on a square lattice, and a tetrahedron on a square lattice. Since the geometric structures were not all coaligned on any axis, we compared the Euclidean distances in 3D (ΔXYZ) between the experimental and in silico model localizations, using a maximum pairwise distance of 250 nm (in X, Y, and Z), just larger than the repeating feature size.
Comparing the model RPD with the experimental RPD ( Figure 3B, Table S4, Figures S3 and S4) suggests that the most probable model to explain the data is a triangular prism structure with sides of equal length on a square grid ( Figure  3B−D, Figure S4). The providers of the experimental data set confirmed that we had found the correct solution, constructed similarly to a previously published DNA origami design. 26 The estimated side length of 105.5(4) nm ( Figure 3C, Table  S5) was slightly larger than the design length of 100 nm ( Figure 2D). The discrepancy in side length could arise from several factors. First, we used an isotropic localization precision in the model, because it allowed us to conveniently average over all the orientations. However, localization precision is likely worse in Z than in XY. Second, we expect that the proximity of adjacent DNA origami structures is likely to result in an extra distribution of characteristic distances within the 250 nm cutoff for XYZ pairwise distances. Accounting for both of these would require a more complex model and would be possible with further development of our approach.
We then used PERPL on a more challenging 3D nanostructure within a biological sample, the Z-disc of cardiomyocytes. This contains a tetragonal lattice arrangement of actin filaments and ACTN2 (α-actinin-2) with characteristic distances under 20 nm. 27,28 ACTN2 in the Z-disc was labeled with an Affimer, 29,30 which binds to its second calponin homology (CH) domain as demonstrated by X-ray crystallography ( Figure S5, Table S6). The resulting 3D dSTORM data set contained 1.1 × 10 5 localizations, equivalent to detection of ∼1.5% of ACTN2 molecules in the Z-discs or ∼2.9% of Z-disc lattice points (Supporting Information). The low fraction of localizations per target molecule results from both the limited labeling density and high background in such a thick, dense structure, and as a result the 3D reconstruction did not show an obvious underlying pattern ( Figure 4A).
Using the cell axis as a reference direction (X) (Figure 4A), distributions in ΔX (cell-axial, Figure 4B) and ΔYZ (celltransverse, Figure 4C) were analyzed ( Figure 4B,D). For ΔX,  Nano Letters pubs.acs.org/NanoLett Letter we used only RPs with ΔYZ < 10 nm, and conversely for analysis of ΔYZ, based on known dimensions of the Z-disc lattice. The low signal-to-noise ratio of the experimental ΔX RPD made it challenging to find characteristic distances. We investigated a kernel density estimate of the ΔX values ( Figure  S6), using the broadening function for distances in an in silico model RPD. 14 This smoothened RPD showed regular variations approximately 20 nm apart and suggested that the structure may contain a repeating distance.
To test this, we constructed models for localizations found at multiples of a unit distance along the cell-axis (Figure S6), including a background term generated from a random uniform distribution of localizations across the Z-disc. As controls, we included a model containing only the background uniform distribution and one containing repeated localizations of the same molecule. In the best model, the ACTN2-CH2 domain occurs every 18.5(1.0) nm along the cell-axis, with 98% confidence that this repeating pattern is a better model than a random distribution for the true molecular arrangement ( Figure 4B, Tables S7 and S8). This compares well with the 19.2 nm periodicity of ACTN2 binding sites obtained from EM ( Figure 4E). 28 Our analysis also demonstrated that the Zdisc is most likely to contain five or six layers of ACTN2, similar to previous EM results, 27 although the differences between these models were not great enough to robustly select one model over the others (Table S7). This may be due to natural variability of the Z-disc but also the quality of the data; at greater distances across the finite (∼100 nm) Z-disc, the number of RPs obtainable, and therefore the signal-to-noise ratio, is reduced.
Data in the YZ plane ( Figure 4C,D) was more challenging to analyze, because localization precision is worse in Z than in X and Y, and the lattice contains multiple local discontinuities ( Figure 4E). 27 However, using a standardized kernel density estimate of the RPD (Figure S7), the distributions expected for broadened characteristic YZ distances fit the first two peaks well ( Figure 4D, Figure S7), and we inferred distances of 24.08(5) nm and 11.17(1) nm (Table S9). The expected distance in YZ between parallel actin filaments is 24 nm, 31 and we interpret 11 nm as the distance between a pair of ACTN2-CH2-Affimer labels, either side of an actin filament ( Figure  4E).
Finally, we tested the ability of our software to define the relative quality of particles to be selected for other analysis methods ( Figure 5), using SMLM data for the centriolar protein Cep152. The particles (centrioles) had already been Nano Letters pubs.acs.org/NanoLett Letter segmented from an SMLM image ( Figure 5A,B; Supporting Information) and classified as "top view" according to their orientation. 7 We calculated the ΔXY distribution per segmented particle, including all pairwise distances, and fitted the distribution of ΔXY in an in silico, 9-fold rotationally symmetric model to this. This model ( Figure 5C,D) did not require a background term to fit RPs within the segmented ROIs, and the amplitudes of the contributions of the intervertex distances were allowed to vary independently, since some vertices were obscure or missing in the single particles.
Each experimental top-view particle was scored for uncertainty in fitted amplitudes of the contributions of intervertex distances. Where the uncertainty in two of the four intervertex distance contributions was greater than 0.1× their amplitude, the particle was discarded. Averaging the remaining particles showed distinct clusters at each vertex and a rounder structure for the complex (Figure 5E,F).
Conclusions. Here we have shown that PERPL is a useful tool for understanding the underlying organization of nanostructures in cells and in vitro, in multiple types of data sets, using 2D and 3D RPs. It does not require a high labeling density or a high detection rate, and data can be combined from multiple images to provide sufficient data to generate the RPD. Its ability to analyze the arrangement of sparsely localized molecules makes it distinct from and entirely complementary to particle averaging techniques. PERPL can be used to analyze organization of any data where localization coordinates are obtained after image acquisition and processing techniques and is not restricted to SMLM data. It may also be used to determine multiple characteristic distances between localizations of two or more targets and can be developed further to infer their spatial relationship within a macromolecular complex. We further anticipate its use in analyzing noise distributions in SMLM data sets, to aid quantitative analysis of experimental localizations.

Notes
The authors declare no competing financial interest.

■ ACKNOWLEDGMENTS
Isolated cardiomyocytes were a kind gift from the Steele Group, University of Leeds. We thank Ulf Matti and Philipp Hoess for sample preparation and imaging of the Nup107 cells and NiccolòBanterle for the centriole sample preparation. We would also like to acknowledge Michael W. Davidson for his contributions to the development of the constructs used in early stages of this work.