Chop-n-drop: in silico assessment of a novel single-molecule protein fingerprinting method employing fragmentation and nanopore detection

The identification of proteins at the single-molecule level would open exciting new venues in biological research and disease diagnostics. Previously we proposed a nanopore-based method for protein identification called chop-n-drop fingerprinting, in which the fragmentation pattern induced and measured by a proteasome-nanopore construct is used to identify single proteins. However whether such fragmentation patterns are sufficiently characteristic of proteins to identify them in complex samples remained unclear. In the simulation study presented here, we show that 97.9% of human proteome constituents are uniquely identified under close to ideal measuring circumstances, using a simple alignment-based classification method. We show that our method is robust against experimental error, as 78.8% can still be identified if the resolution is twice as low as currently attainable and 10% of proteasome restriction sites and protein fragments are randomly ignored. Based on these results and our experimental proof-of-concept, we argue that chop-n-drop fingerprinting has the potential to make cost-effective single-molecule protein identification feasible in the near future.


Introduction
Over the past decades, mass spectrometry (MS) has allowed for ground-breaking discoveries in proteomics, enabling such impressive feats as the definition of a human protein atlas [1] and large-scale screening for protein disease biomarkers [2].However, not all protein-related research questions may be addressed by MS.
Examples are found in the nascent field of single-cell proteomics which, following the example of single-cell transcriptomics, is expected to give unprecedented insight into cell functioning and pathology [3].While MS has already made strides in this field by enabling the detection of proteins present at thousands of copies per cell [4], some important and clinically relevant proteins such as signaling molecules and transcription factors are expected to be present in the range of dozens of copies [5].The development of novel single-molecule protein identification methods is therefore necessary to unlock the true potential of single-cell proteomics.
In the search for single-molecule alternatives to MS, two main venues are currently being explored.On the one hand, conceptual methods utilizing the read-out of fluorescent dyes attached to a subset of residue types have shown promising results [6,7,8].However, methods using fluorescence-based readout strategies require efficient and specific labeling of residues.Optimizing labeling strategies is non-trivial [e.g. 9, 10] and less-than-perfect labeling may decrease accuracy, thus a label-free method would be preferred.On the other hand, electrical readout of protein properties, either folded or unfolded, may be generated by feeding the protein through a nanopore, over which an electrical potential is applied [11,12].Similar to nanopore sequencing of DNA, changes in the pore's electrical resistance while a protein is passing may give information on its properties.
In prior work, we showed that engineered complexes of FraC nanopores and proteasomes can be readily assembled without loss of proteasome activity or electrical conductance of the pore [13].Furthermore, we have shown that a linear relation exists between residual current through FraC pores and the molecular weight of passing protein fragments [14].We thus proposed that proteasome-nanopore constructs can be used to identify proteins, in a conceptual method dubbed chop-n-drop fingerprinting [13].An unknown protein can be processed terminal-to-terminal by the construct, cleaving it at proteasome target sites, after which the molecular weight of sequentially released fragments can be estimated based on the residual electrical current as they pass through the nanopore.The sequence of measured fragment weights can then serve as a characteristic signature -a fingerprint -of the protein.Once proven, this fingerprinting method can easily be implemented in a highly parallel fashion by adapting existing hardware that was developed for nucleic acid sequencing.
Compared to both MS and existing fluorescence-based measurement equipment, this hardware is inexpensive and has a small benchtop footprint, thus opening up opportunities for field diagnosis and in-house analysis for even small laboratories.It is as of yet however unclear whether chop-n-drop fingerprints are sufficiently characteristic to identify a single protein in highly complex mixtures.
Here we present a computational analysis of the chop-n-drop method, in which we show that simulated fingerprints of all proteins in the UniProt human proteome can be accurately classified using a simple alignment-based method.Considering these and previously published experimental results, we argue that chop-n-drop fingerprinting is a promising concept for cost-effective single-molecule protein identification.
Figure 1: Schematic overview of the chop-n-drop fingerprinting method.(A) A protein is fragmented by a proteasome directly introduced above a nanopore.The protease is engineered to lyse proteins at particular residues.(B) As the fragments pass the pore, a change in electrical current through the pore is measured.(C) The molecular weights of the fragments are estimated from the magnitudes of the current changes.(D) Finally the produced sequence of fragment weights is aligned to database fingerprints of known proteins, to identify the protein.

Simulation and classification method
To estimate the performance of the chop-n-drop fingerprinting method on a highly complex protein identification task, we developed a simulation pipeline mimicking the experimental procedure, including several sources of biological and technical noise that we expect to encounter.
In essence, the chop-n-drop fingerprint of a protein only consists of a sequence of weights, which are deduced from pore current blockades caused by sequentially cleaved-off fragments passing through the nanopore.The simulation of this process follows a straight-forward two-step process.First, akin to the proteasome cleaving a protein into fragments, we divide a given protein sequence into sub-sequences by splitting it at the proteasome's target sites.Although engineering proteasome specificity in our system is still a work in progress we assume here that we can force it to exhibit only trypsin-like behavior, thus we split sequences after arginines and lysines, unless followed by a proline.To account for the fact that the proteasome will likely fail to cleave at a fraction of target sites, we only cleave each target site with a certain probability, which we refer to as the proteasome efficiency (e p ).
Subsequently we mimic the passing of fragments through a heptameric FraC pore, the readout of the current blockade and the estimation of the fragment weight, by simply translating the sub-sequences into corresponding fragment weights.Although weights can be determined with high precision from sequences, the measurements in experiments may be less accurate and marked by a given resolution (r), the smallest detectable weight difference.In experimental setups, this parameter is dependent on pore and measuring equipment properties.To account for this in simulations, Gaussian noise is added to fragment weights, where the standard deviation of the noise is related to r (see Methods).Fragments weighing less than 500Da are removed, as they typically escape detection of heptameric FraC nanopores [14].Furthermore, as we previously found that the relation between weights above 2kDa and current blockades is non-linear [14], all fragment weights larger than this value are reduced to 2kDa.Lastly, although we expect the seal between proteasome and pore to be extremely tight based on molecular dynamics simulations [13], fragments may fail to enter the pore after cleavage.We account for this by only retain-ing each fragment with a certain probability, which we refer to as the capture rate (C).Although C is likely dependent on the size and charge of individual fragments, the relationship between these factors is unclear, thus we assume C to be constant.The resulting sequence of fragment weights returned by this process constitutes the fingerprint for a protein.
We used fingerprints generated using our pipeline to develop a classification method, which assigns a protein identity to a given fingerprint.We follow an alignment-based approach, where a query fingerprint is aligned to a database of previously generated fingerprints, using a custom dynamic programming implementation (Supplementary figure S1, see Methods).The database fingerprint that is most similar to the query fingerprint is assumed to have come from the same protein.We ran our simulation pipeline and classification method on all sequences in the UniProt human proteome (n = 20, 395).Under close to ideal simulated noise parameters (e p = 0.99, r = 5.0Da, C = 0.99) we find that our alignment based approach retrieves the correct identity for 97.9% of fingerprints (Figure 2).Inspection of made alignments shows that our algorithm correctly handles missing and fused fragments (Supplementary figure S2A).The majority of misclassifications occurs for shorter proteins, under 250 residues in length.

Simulations under optimal conditions
Of misclassified fingerprints, 48% shows more than 80% amino acid sequence identity to the protein as which it was wrongly identified, indicating that the resolution of 5Da assumed here is insufficient to consistently separate such similar entities (Supplementary figure S3).Upon inspection of these cases, we find that many misclassifications were in fact mix-ups between paralogous sequences.
The remaining misclassifications are caused by chance alignments with different fingerprints (Supplementary figure S2B).This is expected to occur more often if a protein is shorter, as it will generally produce a fingerprint of fewer elements, which is less likely to yield a unique pattern.

Simulations under suboptimal conditions
We subsequently probed how resistant chop-n-drop fingerprinting is to higher levels of experimental noise, by varying one noise parameter at a time while keeping all others near their optimal values (e p = 0.99, r = 5.0Da, C = 0.99).
In each case we find that accuracy deteriorates gracefully with parameter value (Figure 3A).Interestingly, we still attain an accuracy of 92.6% at a resolution of 50Da, which is worse than the 40Da resolution we reported previously [14] and more than tenfold worse than our current-best resolution of 4Da (GM, unpublished results).Similarly, we find that a lower proteasome efficiency or catch rate of 90% still results in 93.6% and 90.7% accuracy on average respectively.Finally, we repeated a simulation on the entire dataset with all noise parameters at sub-optimal values (e p = 0.90, r = 10.0Da,C = 0.90).Even under Figure 3: (A) Fingerprint classification accuracy over a range of sub-optimal noise parameter values; resolution (left), capture rate (mid) and proteasome efficiency (right).For each case the unvaried noise parameters are set to nearoptimal values (capture rate C = 0.99, resolution r = 5.0Da and proteasome efficiency e p = 0.99).Five replicates were generated for each parameter combination.(B) Cumulative histogram of correct and incorrect classifications of simulated chop-n-drop protein fingerprints for all human proteome constituents, assuming more realistic noise parameters; r = 10Da, C = 0.90 and p r = 0.90.Numbers are shown distributed over sequence length (bars), and relative to the total number of proteins (pie chart).these circumstances, we find that 78.8% of proteins are correctly classified (Fig- ure 3B).Here too, it should be noted that most incorrectly classified proteins were of lower sequence length.

Discussion
Single-molecule (SM) protein fingerprinting holds great promise to revolutionize biological research and diagnostics [15].We have previously proposed that this may be accomplished using a novel proteasome-nanopore construct, which cleaves a target protein into fragments and subsequently reads out the fragment weights [13].Here we present simulation results indicating that the produced sequence of fragment weights contains sufficient information to identify a protein.
In the presented simulations, we included sources of noise that may hamper fingerprint measurements in practice.We assumed that the proteasome may not cleave each target site, that weight measurements may be inaccurate up to a given weight resolution and that not all cleaved-off fragments may be caught in the nanopore.Assuming higher noise parameter settings -a fragment capture rate and proteome efficiency of 90%, with a measurement resolution of 10Dafor each of these noise sources, we find that overall accuracy remains sufficiently high at 78.8%.As accuracy increases with protein length, we find that chop-ndrop fingerprinting should be particularly suitable to identify larger proteins.
Our simulation builds on the assumption that fragment weight is correlated to the residual current measured while the fragment passes the nanopore.Indeed, we have previously shown that this is the case for selected fragments weighing between 500 and 2000Da [14].However it should be noted that rather than the peptide's weight, its volume, charge and shape influence residual current.
Once the experimental methodology has been further developed and protein fingerprints can be measured more routinely, we can define the relation between these properties and the residual current in more detail to predict fingerprints in a more robust manner.
The existence of different proteoforms, which was not accounted for in this simulation, presents both an opportunity and a challenge to chop-n-drop fingerprinting.Through alternative splicing and post-translational modification (PTM), multiple proteoforms with different functions may be generated from the same gene [16].Depending on the spliceoform or the PTM types present, different proteoforms may generate distinct fingerprints.This allows their individual identification at SM resolution, which is an important potential application of SM analysis, but also adds tens of thousands of potential fingerprint patterns, which further complicates the task of fingerprint classification.A solution may be to fractionate samples prior to chop-n-drop analysis, after which each fraction may be analysed using a dedicated classifier which only considers the proteoforms that could be present in a given fraction.
Over the past years, the obstacles on the road toward SM protein fingerprinting have been attacked vigorously from multiple angles, with several groups showing promising initial results and proofs-of-concept.While each proposed method has shown particular strengths, we argue that chop-n-drop combines several properties not found together in other methods.First, unlike fluorescence-based methods [7,6,8] it does not require the implementation of any labeling chemistries as properties of the target protein are read out directly, thus evading issues with erroneous labeling and simplifying sample preparation.
As a trade-off, fluorescence-based methods are more sensitive to differences between proteoforms as long as the difference involves the position or presence of a targeted residue type.As we show here that even at high resolution our method misclassifies proteins with high sequence similarity to other entries, it is likely that differences between highly similar proteoforms may also remain unnoticed.
Different methods based on the readout of folded proteins by electrical current blockage of a nanopore have been proposed as well [12,17,18].These were unable to analyse a wide range of protein sizes however; as the pore lumen needs to be of an appropriate volume for the analysis of a given protein size, a single nanopore is not able to detect minute differences in both small and large proteins.Here this problem is mitigated by the fragmentation step.
Most importantly however, the hardware required to implement chop-n-drop fingerprinting in a highly parallelized setting can be readily borrowed from commercial platforms for DNA sequencing using nanopores, which are inexpensive and have already been miniaturized to a handheld format.As such we envision that our method could soon fill a niche that no other method currently can; that of small-scale, in-house single-molecule protein identification.
In conclusion, we provide evidence that chop-n-drop fingerprints can provide sufficient information to identify proteins in complex samples, and present a suitable alignment-based classification method.Upon optimization of the fingerprinting procedure, we envision that our method may see practical implementation in the near future.

Methods
Code for in silico fingerprint generation and classification was wrtten in Python 3.8 (Python Software Foundation, www.python.org),and is freely available at https://github.com/cvdelannoy/chop_n_drop_simulation.

In silico fingerprint generation
We generate in silico chop-n-drop fingerprints by splitting protein sequences at protease target sites and calculating the weights of the resulting fragments from their sequences.We assume that fragments of a weight lower than 500Da are undetectable, thus these fragments are removed from fingerprints.Fragments of a weight larger than 2kDa are set to 2kDa, as prior investigations showed that the relationship between weight and current blockage is non-linear above this weight [14].
Three parameters are set to represent different noise sources; catch rate C, proteasome efficiency e p and resolution r.The catch rate denotes the fraction of fragments that enters the pore after lysis and is measured.In our simulations each fragment is retained with a probability of C. The proteasome efficiency denotes the fraction of target sites at which the proteasome cleaves.In simulations, each target site has a probability of e p of being cleaved.Note that a failure to cleave will result in two fragments being fused together, after which they remain represented in the fingerprint as the sum of their weights.Finally, the resolution denotes the minimum difference in fragment weight that can still be detected by current blockage, expressed in Da.In our simulations, the resolution is represented by the magnitude of Gaussian noise added to fingerprint weights.Specifically, we define the standard deviation of the noise such that the probability of a fragment size measurement deviating r from its actual size is fifty percent: Here σ r is the standard deviation of the Gaussian noise, r is the resolution, and Φ −1 is the inverse cumulative distribution function of the standard normal distribution.
N Y are the numbers of fragments in X and Y respectively.At each step in the alignment one of three actions may be taken.First, a single fragment of each fingerprint may be aligned, in which case the absolute difference of their weights is added to the total score.Second, two fragments of X may be aligned to one fragment of Y , corresponding to a missed proteasome target site.This action increases the score by the difference between the summed weight of the former and the single weight of the latter.Third, a gap may be introduced in either X or Y at the cost of a penalty.The gap penalty G is dependent on the resolution used during digestion: Here σ r is the resolution-dependent standard deviation of Gaussian noise added to fragment sizes during in silico digestion (equation 1) and L is the lower detection limit (L = 500 Da).This means that introducing a gap is preferred over matching fragments if the difference between fragment weights exceeds the difference expected in 95 percent of correct matches.The addition of L is required to ensure that a match is still preferred if a normally undetected fragment (i.e. of which the weight is under L) is fused to another fragment due to a missed proteasome target site.
A query fingerprint is classified by aligning it to all fingerprints in the database and assigning it the identity of the database fingerprint to which the distance is smallest.
Figure S3: Distribution of sequence identities between misclassified proteins and the proteins for which they were mistaken based on their chop-n-drop fingerprint.

Figure 2 :
Figure 2: Cumulative histogram of correct and incorrect classifications of simulated chop-n-drop protein fingerprints for all human proteome constituents, assuming low noise parameters; resolution r = 5Da, capture rate C = 0.99 and proteasome cleaving efficiency e p = 0.99.Numbers are shown distributed over sequence length (bars), and relative to the total number of proteins (pie chart).