A generalized protein identification method for novel and diverse sequencing technologies

Abstract Protein sequencing is a rapidly evolving field with much progress towards the realization of a new generation of protein sequencers. The early devices, however, may not be able to reliably discriminate all 20 amino acids, resulting in a partial, noisy and possibly error-prone signature of a protein. Rather than achieving de novo sequencing, these devices may aim to identify target proteins by comparing such signatures to databases of known proteins. However, there are no broadly applicable methods for this identification problem. Here, we devise a hidden Markov model method to study the generalized problem of protein identification from noisy signature data. Based on a hypothetical sequencing device that can simulate several novel technologies, we show that on the human protein database (N = 20 181) our method has a good performance under many different operating conditions such as various levels of signal resolvability, different numbers of discriminated amino acids, sequence fragments, and insertion and deletion error rates. Our results demonstrate the possibility of protein identification with high accuracy on many early experimental devices. We anticipate our method to be applicable for a wide range of protein sequencing devices in the future.


Introduction
There have been significant advances in nucleotide sequencing technologies in the past decade (1)(2)(3)(4)(5).Nanotechnological techniques are getting closer to single-molecule resolution and high readout accuracy ( 6 ,7 ).In particular, current nanopore sequencing methods for nucleotides rely on electrical signals from small k -mers (e.g.k ≤ 7), which results in 4 k variations of the superimposed signals ( 5 ,8 ).For small k -mers, the decoding problem is tractable ( 5 ).In parallel, there has also been exciting progress in the protein sequencing space such as prototype devices and experiments for single-molecule sensing ( 9 ,10 ) and the development of nanopores suitable for protein translocation ( 11 ,12 ).However, compared to DNA sequencing, the progress in protein sequencing has lagged (13)(14)(15).This is not surprising because protein sequencing involves the discrimination of 20 different residues with complex structures and charge distributions, compared to just four nucleotides in nucleotide sequencing ( 16 ,17 ).The possible variations in the signal from k -mers in the case of nanopore sequencing of proteins quickly become huge (20 k variations), making decoding difficult.
Mass spectrometry is a widely used method for protein identification that compares the mass spectra of the unknown protein to a database of known spectra of proteins.For de novo sequencing, Edman degradation has been largely replaced by tandem mass spectrometry ( 18 ).However, much proteomics is done by protein identification rather than de novo sequencing because in many cases partial sequence information is enough to identify a sequence ( 14 ).Despite being the gold standard of proteomics, the limited dynamic range of mass spectrometry makes it unusable to identify very low concentration peptides ( 16 ).Therefore, efforts have been made to develop alternative methods with higher sensitivity.Methods employing fluorescent tagging of a subset of amino acids (AAs) to generate a fingerprint (19)(20)(21) or using other properties of AAs in addition such as binding kinematics ( 10 ) are being investigated.Recent works suggest that up to six AAs could be labelled without spectral overlapping, which may be enough to identify the sequence from a database ( 16 ,22 ).
Nanopore methods seem more promising because of their potentially higher dynamic range and the possibility of longread sequencing ( 16 ).Progress towards this goal includes the development of engineered biological nanopores, steady translocation of peptides ( 12 ,23-27 ) and acquisition of electrical signals from the translocated peptide to discriminate 13 of the 20 AAs ( 28 ).Approaches based on solid-state nanopores are also being explored (29)(30)(31).Further improvements in single-molecule sensing, e.g. through optical signals from plasmonic resonance and surface-enhanced Raman spectroscopy, are promising in discriminating all 20 AAs ( 17 , 30 , 32-34 ).These advances, and possibly newer engineering techniques and experimental protocols with single AAlevel resolution, are required for successful de novo protein sequencing ( 19 ,35 ).However, improvements in current methods may be enough to develop a sequencing machine that can generate partial signals that can be used to identify proteins from a known database.
Despite these promising steps towards a successful protein sequencing device, considerations such as fluorescent overlap among the labelled AAs, sample preparation, cost, device engineering, ease of operation, nanophysics, nanophotonics and nanotribological principles mean that early devices will likely employ several strategies to make protein identification easier, while still providing a base for the future de novo devices.Well-known strategies include discrimination of a subset of AAs and using protein fragments (19)(20)(21).In addition, these devices will potentially generate error-prone readings and the signal from these devices may not be fully resolvable for an exact identification of the AAs.Therefore, the decoding algorithms to identify the AAs and proteins will likely output a probabilistic readout over all AAs for each segment of the signal.
One of the probabilistic approaches may be to start with a prior probability distribution of each AA.The prior distribution could be uniform (1 / 20 for all AAs), to reflect the absence of any prior information about the AA distribution on the proteins being sequenced, or adjusted based on some prior knowledge.For example, if it is known that the query proteins or fragments are cysteine-rich, the prior probability for cysteine can be set higher.Upon receiving the signals from the sequencing device, the decoding algorithm would update the priors to give the posterior probability distribution of the AAs such that the correct AA, hopefully, will now have the maximum posterior .In particular , the posterior probability for the correct AA should be close to 1 for a well-resolved signal.Depending on the noise and decodability of the signal, however, they will generally be < 1 for the correct AA and > 0 for one or more incorrect AAs.For the positions where the signal is unavailable or unresolved, the posteriors will be the same as priors.
In general, for a peptide of length L (and assuming, for now, the absence of insertion and deletion errors) the decoded output from these sequencing techniques can be written as a 20 × L probability matrix, where each column i contains the posterior probabilities assigned to each of the 20 AAs possible at position i in the sequence.Due to the error-prone nature of the signal acquisition and decoding, a direct reconstruction of each position of the protein sequence using just its posterior probabilities would likely indicate a non-existent or incorrect protein.Therefore, these posterior readouts must be queried jointly against some protein database to identify the protein that they were derived from.
Despite the diverse working principles of many such imaginable devices, they can be reduced to a single hypothetical device that can produce the posterior probability readouts corresponding to the different strategies.Additionally, the hypothetical device is also generalizable to devices without probabilistic decoding algorithms such as those based on just the fluorescent tagging, fingerprinting and possibly Edman degradation.In this case, observing one particular wavelength might suffice to reliably identify an AA, which can be represented by a probability vector such that the probability of the correct AA is close to 1.
Therefore, to study the problem of protein identification, here we propose such a hypothetical protein sequencing device and couple it with a database search method using its probabilistic readouts.In common with others, e.g. ( 36 ), we set as our initial target the identification of human proteins, using ∼20 000 distinct canonical human proteins from UniProt ( 37 ).We study several operating conditions of the device, such as different levels of signal resolution leading to posterior probabilities between unresolved and well resolved, different lengths of the input proteins or fragments, different error rates and various subsets of perfectly detectable AAs, using the readouts from the device to try to identify the correct originating protein.Hidden Markov models (HMMs) and more recently different neural network architectures have been suc-cessfully used to sequence DNA ( 8 ,38-45 ).These methods can also be extended to protein sequencing.Despite their potential for higher accuracy, neural network models are often difficult to interpret in a meaningful way.On the contrary, HMMs are simpler probabilistic models that score a sequence based on the matches, insertions and deletions required to match the probabilistic readouts to a sequence ( 46 ).More recently, HMMs and beam search algorithms have also been used to identify proteins from fluorosequencing ( 36 ,47 ).In this work, we use an HMM for protein identification, first showing that probabilistic, full-length and error-free readouts can be used to identify almost all proteins.Then, we show that the reduction of readout length to as short as 10 AAs can be enough to identify > 95% of the sequences from a single fragment.We further evaluate our protein identification method with simulated fluorescent tagging and show that readouts from just two AAs, leucine (L) and serine (S), on a full-length protein, lead to the identification of 95% of the proteins.In the same case, using one fragment with 100 AAs instead is sufficient to identify at least 76% of the proteins, and if the readout from a third AA, glutamate (E), is added, we can recover at least 94% of the proteins from such shorter fragments.Furthermore, we also test our method on error-prone readings based on discrimination of all 20 AAs or on reduced sets of AAs and show that it is quite robust to a wide range of errors.Our study highlights the potential for success under many different strategies, while also providing useful insights for designing a next-generation single-molecule protein sequencer.

Data
We retrieved all reviewed canonical human protein sequences from the UniProt database (UniProt release 2022_04) ( 37 ).We did not consider the non-canonical isoforms, in order to reduce the database size and computation time.To further reduce the computation time during our many sequence identification experiments, we discarded proteins above the 99th percentile of the length distribution.This resulted in the exclusion of extremely long sequences such as titin (34 350 residues).We were left with 20 181 sequences with a median length of 411 residues ( Supplementary Figure S1 A).These sequences were used as the database to study the problem of identification of human proteins.

Hypothetical single-molecule sequencer
For any input sequence, our hypothetical device sequentially reads each AA and outputs a posterior probability vector over 20 AAs (Figure 1 ).This posterior takes the value p max for the correct AA.The parameter p max gives us the control to simulate performance ranging from perfect determination of every AA ( p max = 1), through devices with good discrimination (e.g.p max = 0.9), as far as p max = 0.05, when all AAs are completely indistinguishable.In practice, for a given sequence, the signal quality could vary along the sequence, leading to different p max values along the sequence.The remaining probability (1 − p max ) might also have some non-uniform distribution over the remaining 19 AAs.However, we simplified our simulations by eliminating these degrees of freedom by assuming that p max is a constant in each simulation and (1 − p max ) is equally divided among the 19 incorrect AAs.If the device is free from sequencing errors such as insertions and deletions A B Figure 1.Hypothetical sequencing device used in this study and data analysis workflow.The sequencing device reads an input protein (in this example, a peptide QFEGSAL) and outputs a posterior probability o v er all 20 AAs for possibly each AA in the input sequence.Two example operating conditions are shown.In case (A), the sequencing device can identify all AAs with p max = 1 / 2 and the corresponding readouts for the input peptide are shown as a heatmap of the posterior probabilities.In case (B), the sequencer is set to output high posterior probabilities ( p max = 9 / 10) to the AA subset L, S and E, which could be because either these AAs are tagged or the device is able to generate good quality signals for only these AAs.Hypothetically, the device is configurable to extend this subset from none to all 20 AAs.If the input AA is not in this subset, the posterior probability is uniform (1 / 20).In this readout, deletion error occurred after the first AA, Q. T heref ore, the uniform posteriors for F are absent.Similarly, an insertion error occurred after the next AA, E, leading to a spurious probability vector suggesting the presence of L. The rates of these insertion and deletion errors are configurable in our simulations.These posterior probabilities are used as emission probabilities of the match state of an HMM.The HMM scores every sequence in the human protein database ( N = 20 181).For illustration, five top example hits and their corresponding scores from the HMM using posteriors from case (B) are shown.The sequence with the highest score (in this example, protein P02647 with score 372.38) is the inferred protein from the posterior probability readout of the input sequence.
(indels), the decoded readouts from a protein of length L are a 20 × L posterior probability matrix.

Inputs to the sequencer
Developments in nanopore methods have led to successful translocation of full-length proteins, e.g.mediated by the enzyme unfoldase ClpX ( 48 ) and more recently enzyme-free electroosmotic methods ( 26 ).These experiments suggest the possibility of sequencing devices that take full-length sequences.We therefore started investigating our protein identification method with single full-length protein molecules as inputs to the sequencer.Our method is equally applicable to any technology that generates posterior probability matrices as described earlier, including any derived from the joint decoding of multiple protein molecules.
In some of the protein sequencing and identification techniques, using a protein fragment is desirable.For example, in fluorescent labelling and subsequent Edman degradation, peptides of length ∼30 AAs or less are preferred due to the limitations of the complete cleavage cycles ( 49 ).Similarly, for nanopore-based devices, some AAs may clog the pore due to their volume, hydrophobicity and charge ( 50 ).Using protein fragments may be one possible strategy to mitigate this prob-lem.Therefore, in addition to using full-length sequences, we also used sequence fragments of various lengths (100, 50, 25, 15, 10 and 5 AAs) as input to the sequencer.First, we again used a single random fragment per protein.However, as many proteins may contain regions such as repeats and conserved domains in common, the location of a single random sequence fragment per sequence might introduce bias in the protein identification process.To investigate and possibly eliminate such systematic biases, while limiting the computational time, we repeated this analysis 10 times per protein, each time using a different random fragment.
Second, since a digested sample could contain many fragments from the same protein, multiple readouts could be used to improve protein identification.Therefore, in cases where performance of our method was not strong, we investigated combining the results from each of the 10 fragment readouts for possible improvements in protein identification.

Amino acids identified by the sequencer
Recent experiments have demonstrated the possibility of generating unique signals from all 20 AAs ( 27 ,32 ).This might lead to the development of a device that can discriminate all AAs on a full-length sequence in the near future.Hence, we simulated the case where the device can attempt the identification of all 20 AAs on a full-length sequence.Additionally, using our parameter p max for the posterior of the correct AA at each sequence position, we simulated different working conditions from a well-resolved signal ( p max = 0.9) to an unresolved signal ( p max = 0.05).
Despite the promising early experimental results in single AA identification, prototype devices are likely to reliably identify only a reduced set of AAs.Additionally, fluorescence methods generally label only a small set of AAs ( 20 ,22 ).To analyse such scenarios, we also simulated the identification of reduced sets of AAs.Generally, the choice of the set of AAs will be dependent on experimental factors such as ease of labelling and spectral overlap.In our simulations, we used three such sets.The first set consists of the five most abundant AAs (L, S, E, A and G) in the human protein database ( Supplementary Figure S1 B).In contrast, our second set consists of the five least abundant AAs (W, M, C, H and Y), and thus is not very informative and likely represents a worst case.With both of these sets, we varied the number of AAs distinguished from one to all five.In addition, we also used a third set containing three AAs (C, Y and K) that are convenient to chemically label ( 13 ,22 ).For this set, we considered all single and double AA combinations.In these simulations, the combinatorial space becomes huge.Therefore, we limited ourselves to the full-length, 100-and 50-AA fragments, and p max of 0.8 and 0.2, which is adequate to study our method's protein identification from better and modest signal quality.

Incorporating errors
Any sequencing method is prone to errors.For devices based on fluorescence techniques, these errors might arise due to inefficient labelling and detection ( 16 ,51 ).For devices dependent upon the translocation of a protein, errors may arise due to the non-uniform speed of the sequence and possibly during the decoding of the signals.In nanopore sequencing of DNA, such errors are also called skip (due to fast translocation) and stay (due to slow translocation) errors ( 8 , 40 , 52 ).Therefore, it is likely that protein sequencing techniques may also suffer from these types of errors.To model these errors, we simulated different levels of insertions and deletions in the signal.For example, for a sequence of length 100 AAs at an insertion rate of 10%, we insert exactly 10 random posterior probability vectors.This simulates the misidentification of AAs, which is similar to substitution errors due to incorrect decoding, and possibly a slow translocation of the peptide.Similarly, for a deletion rate of 10%, we delete exactly 10% of the posterior probability vectors.This simulates a faster translocation or a missed detection of signals.
Translocation speed is likely dependent on the shape, size, charge distribution and other physical properties of the AAs and their response to the translocation mechanism used.Therefore, some positions of the protein may be more prone to errors.However, to simplify our analysis, we assumed these errors to be uniformly distributed along the sequence.In cases where a sequence has both insertion and deletion errors, we simulated deletion first followed by insertions.This ensures that the insertion errors do not undergo subsequent deletion.Furthermore, after performing an experiment, the location of these errors may not be easily detectable, although, theoretically speaking, it might be possible to infer the position of the errors through analysis of the raw signal and some measure of translocation time.While such additional information might be useful in signal decoding, in our study we assume that this information is unavailable: instead of raw signals, our device already gives the decoded readouts as posterior probabilities.
We therefore discard all positional information regarding the errors in our simulations and do not use the error rate information for the subsequent protein identification step.

Identification of proteins using decoded readouts
The readouts from our device are the posterior probabilities of AAs along the sequence reflecting the uncertainties of AAs identified from the signal (Figure 1 ).The core of our protein identification method is an analysis that uses these probabilities to infer the originating protein.For a single readout from a protein or a protein fragment, if we consider the AAs from the unknown protein as match states of an HMM, the posterior probabilities can be treated as the emission probabilities of these states.Additionally, since there might be sequencing errors in the form of indels, we can also assign some probability of transitioning from any of the match states to the insertion and deletion states.Thus, the readouts of the sequencing device can be described using a classic HMM [see e.g. ( 53)].Our goal of identifying a protein sequence from a database using the readouts then becomes very similar to the problem of searching a database of target HMMs using a query sequence.This is a well-studied problem in bioinformatics, HMMER ( 54 ) being one of the most widely used tools for this purpose.In our case, we have one HMM query from each readout of a protein, which has to be queried against a database of known target sequences.Therefore, to identify the protein, we can reverse the classical HMM search problem.Using HMMER and the HMM constructed from the readouts, we score all the sequences in the database.We regard the hit with the highest score as the inferred protein.
In the classical HMM search problem, the probabilities of emission from the match state and various transition probabilities of each target HMM are typically calculated from the occurrences of AAs in a multiple sequence alignment of evolutionarily related sequences.However, the HMMs in our study use posteriors from the sequencing device as the match emission probabilities.To accommodate the potential sequencing errors (indels), we allowed transitions to the indel states by setting the transition probability from each match state to the insert and delete states to 0.1.In contrast, the transition probability from match to match state was 0.8.These transition probabilities were chosen a priori and are flexible enough to deal with indels while still preferring match-to-match transition between AAs in the proteins of the database.Remaining transition probabilities (e.g. from insertion to match, deletion to match) and all other parameters were the default values in HMMER3 (v3.3.2).These parameters can be further adjusted if the error rates of the sequencing device are available.To make the generated HMMs accessible to HM-MER3, the HMM model fitting parameters for Viterbi, multiple ungapped segment Viterbi and forward log-odds likelihood scores were calculated using hmmsim from HMMER3.To query the human protein database, we used PyHMMER v0.6.3 ( 55 ), a Python library binding to HMMER ( 54 ).By default, HMMER3 checks and filters sequences with a composition bias in the AA distribution, e.g.tandem repeats, and large hydrophobic regions.However, we turned off the composition bias filter in HMMER3 for greater sensitivity ( 56 ).

Assessment of results
Using all proteins ( N = 20 181) in our database, we simulated readouts for different cases.To identify the protein, we first constructed query HMMs based on these readouts and then scored each target protein sequence in the database.We took the protein with the highest score as the inferred protein for the given readout.For analyses using multiple fragments (readouts) per protein, we summed the HMMER scores for each target protein over all query fragments and took the highest total score to indicate the inferred protein.We used the accuracy, defined as the fraction of proteins that were correctly identified, to evaluate the performance of our method.

Effect of signal quality
We first studied the effect of signal quality on protein identification by varying the posterior of the correct AA ( p max ) in full-length proteins.We varied p max from 0.05, where the AAs are completely indistinguishable, to 0.9, where AAs are extremely distinguishable (Figure 2 ).The accuracy was 0.96 (19 373 proteins correctly identified out of 20 181) for a low p max of 0.08, and was close to 1 for p max > 0.09.This suggests that our method might be useful even on extremely noisy signals where there is little discrimination of individual AAs.

Effect of sequence length
To evaluate the performance on different length protein fragments, we generated random fragments of different lengths (5-100 AAs).Since the position of a random fragment might lead to varied results, we generated 10 random fragments per sequence.In these simulations, we investigated six values for p max (0.2, 0.3, 0.4, 0.6, 0.8 and 0.9) to represent different working conditions of the sequencing device ranging from low-quality to best-quality signal outputs.The mean accuracy was > 0.94 for fragment lengths from 25 up to 100 AAs for all p max values used (Figure 3 , solid lines).For fragments of these lengths, the choice of p max had very little effect on the accuracy.This suggests that protein identification is possible with small fragments (25-100 AAs) using devices that generate modest quality signals.
On fragments shorter than 25 AAs, a higher p max was required to achieve comparable accuracy.In particular, p max above ∼0.5 is needed for accuracy > 0.8 for fragments with as few as 10 AAs.Fragments of length 5 AAs were inadequate for protein identification, even with the highest p max values.
Additionally, in all of the above simulations, we found very little uncertainty around the mean accuracy and the error bars [95% confidence intervals estimated using bootstrap, as implemented in Seaborn v0.12.2 ( 57 )] are extremely small (Figure 3 ).This implies that the choice of a random fragment has negligible impact on the accuracy of our method.To confirm this finding, we also assessed the proportion of these fragments that led to a correct protein identification ( Supplementary Figure S2 ).We find that for fragments with 25 AAs or more, irrespective of our chosen p max , the majority of sequences are identified from every one of the random fragments.This further supports our finding that each random fragment is informative, and the location bias, if any, is minimal.This also means that using a single random fragment of length 25 AAs or more in our simulations is generally sufficient for protein identification.
Next, we investigated whether the results from multiple fragments could be utilized to increase the accuracy, particularly relevant for the cases with low accuracy, i.e. with small fragment length and poorly resolved signal ( < 25 AAs and p max < 0.5).Using this technique, we find an increase in accuracy in almost all cases (Figure 3 , dotted lines).For the longer fragments, the accuracies were already > 96% and improvements are not visible in the plot.Importantly, the accuracy using small fragments (e.g.10-AA fragments) was now > 96% in almost all of the cases.
For 5-AA fragments, however, combining results from 10 fragments was not sufficient to identify proteins (Figure 3 ).We investigated whether further increasing the number of 5-AA fragments would lead to an increase in accuracy.For this, we set p max = 0.9 and increased the fragment number from 10 to 1000.Using the sum of scores across 1000 fragments to infer each protein, we were able to detect a modest increase in accuracy, from 0.02 to 0.10 ( Supplementary Figure S3 ).Thus, our method is also applicable for real-world sequencing cases where typically a sample of digested proteins would contain numerous small peptides.

Effect of reduced AA sets
Fluorescent labelling techniques usually label a subset of AAs to generate unique fingerprints, which can be used to identify a protein ( 13 ,20 ).Identification of a few AAs in this way, or using label-free methods with a hypothetical device that cannot observe all 20 AAs reliably ( 17 ), is relevant to new nanopore devices because it provides a way to work towards the identification of all AAs in the future while still being useful to generate useful protein signatures.Therefore, we investigated the effectiveness of our search method combined with such fingerprinting techniques.In particular, we are interested in identifying the AAs that are more informative and can provide non-ambiguous fingerprints.We first fix the working conditions of the device by setting p max to 0.8, representing a good quality signal, or 0.2, representing a low-quality signal from the AAs.In both of these cases, we consider both the full-length and fragment (50 and 100 AAs) strategies.Based on the results already presented, these combinations provide us with settings appropriate to examine better and worse sequencing conditions, while limiting the combinatorial space.
Starting with the five most abundant AAs (L, S, E, A and G), we gradually increase the number of these 'labelled' AAs from a single AA (L) to all five (LSEAG).Even utilizing just the single AA L on full-length proteins and a good signal ( p max = 0.8), our HMM search was able to identify 75% of the proteins (Figure 4 A).Adding further AAs of this high-abundance set, the accuracy was > 95%.This is in sharp contrast to the results from the reduced set consisting of the least abundant AAs (W, M, C, H and Y), where we required all five AAs to achieve an accuracy of 80%.We also used a third set of AAs (C, Y and K) that are often used in fluorescent labelling.C and Y are rare AAs in our database, whereas K is somewhat abundant ( Supplementary Figure S1 B).Compared to the previous two sets where all AAs had either high or low abundance, this set represents the labelling of AAs that both have high and low abundances.The best accuracy from this set was 80% when we used all three AAs (CYK).To simulate the practical use case where the experimenter may choose to label some of these three AAs, we ran our simulation on all possible  subsets of (C, Y, K).The combinations with K had a higher performance, presumably because K is a high-frequency AA.This suggests that labelling at least one high-frequency AA might be a good strategy.
We repeated the above simulation for the remaining combinations of p max and sequence length (Figure 4 B-F).Similar to the above result, the set LSEAG still had the highest performance followed by CYK and WMCHY.As expected, the accuracy of all three sets decreases with the decrease in sequence length (Figure 4 , comparison of panels A-C and of D-F) and p max (Figure 4 , compare panels A and D, B and E, and C and F).In particular, accuracy drops sharply for the fragments using the sets CYK and WMCHY, where we can identify < 40% of the sequences in all the remaining cases.However, we were still able to identify 94% of the proteins using the set LSEAG on 50-AA fragments with a p max of 0.8 (Figure 4 C).On further reduction of p max to 0.2, the accuracy was 0.20 (Figure 4 F).In addition, this low p max was not sufficient to identify sequences from the fragments using CYK and WMCHY.These results show that given a higher quality signal, fingerprinting using more abundant AAs might still be reliable for fragments as short as 50 AAs.
We note that the utility of each individual labelled AA seems correlated with its abundance in the human proteome, as might be expected in terms of the total amount of information derived from each choice.We considered the possibility that some less abundant AAs might provide disproportionately high discriminating power, perhaps because of, for example, their (co-)location in certain proteins or because of individual proteins' deviations from average abundances, but we have observed no evidence for this during our experiments involving human reference proteins.and fragments (100 and 50 AAs; centre and right, respectively), and with p max of 0.8 and 0.2 (top and bottom rows, respectively).For each AA set, the number of reduced AAs was increased from one up to all set members.The legend indicates resulting set memberships; the AAs are labelled in panel ( A ) and omitted in panels ( B )-( F ) for clarity.For comparison, the accuracy when all 20 AAs are used for each case (as in Figure 3 ) is also shown (hexagon marker).For the set CYK, we also used all combinations of two AAs.

Effect of errors
To investigate the effect of errors in protein identification, we first set the working condition of the sequencing device by fixing p max to 0.8.This simulates a good signal and makes it easier to study such effects.For worse signals, the device may be already less accurate, therefore making it harder to ascertain whether the performance is reduced due to errors or the signal quality.For the input sequence, we use fulllength (Figure 5 ), 100-( Supplementary Figure S4 ) and 50-AA ( Supplementary Figure S5 ) fragments.For the possible AA discrimination from the device, we examine discrimination of all 20 AAs, and AAs from the reduced sets (LSEAG, WMCHY and CYK).To reduce the combinatorial space, we used all AAs from each reduced set for this analysis.On a full-length sequence, with identification of all AAs, our protein identification method is extremely robust to errors (Figure 5 , top left).Indel rates as high as 50% still resulted in a maximum accuracy of 0.99.However, upon further increasing the indels to 60%, the method starts to break down as reflected by the sharp decrease in accuracy (0.40).The reduced AA sets, however, are less resilient to errors.The set with the most frequent AAs (LSEAG) was the most promising and had a maximum accuracy of 0.95 at error rates of 30% (Figure 5 , top right).An additional increase in error rates has a more pronounced effect on accuracy.In particular, the method starts to break down after 40% indels and the accuracy drops to nearly zero for higher error rates.In contrast, the sets with lower frequency AAs (WMCHY and CYK) had very poor tol-erance to errors (Figure 5 , bottom).In these sets, error rates up to 10% gave an accuracy of ∼0.67, which may be considered acceptable.
On the other hand, fragments generally are less tolerant to errors and they have lower performances compared to the cases with full sequence.For example, using 100-AA fragments ( Supplementary Figure S4 ) with all AAs identifiable, we can identify 80-95% of the proteins for error rates up to 40%.With an unequal amount of insertions and deletions in readings, 60% of either of these errors still gave a maximum accuracy of 0.97.However, simultaneous increases in the insertion and deletion rates generally lead to extremely poor performance with accuracies up to ∼0.2.Similar to the full-length case, the high-abundance reduced set LSEAG has a better performance compared to the sets WMCHY and CYK.This set had a maximum accuracy of 0.95 for error rates of 20%.Error rates of 30% may be considered acceptable, where the maximum accuracy was ∼0.63; for error rates of 50% and higher, our method was able to identify < 20% of the proteins.The low-abundance sets WMCHY and CYK were not useful in protein identification.They already had a very low accuracy even without any errors in the signal, and further reduction of sequence length to 50-AA fragments causes less error tolerance ( Supplementary Figure S5 ).For example, whereas when all AAs are identifiable the accuracies at 50% error rates can be as high as 0.95, for the reduced set LSEAG, the maximum accuracy is ∼0.73 for error rates up to 20%, after which the method seems to break down.Effects of error-prone readouts on full-length sequences.Heatmaps show the accuracy under different rates of insertion ( y -axis) and deletion errors ( x -axis) on the readouts.The panel titles indicate the AAs identified by the sequencer, i.e. identification of all 20 AAs (top left) and the three reduced sets of AAs (remaining panels).For these results, we used full-length proteins from the human protein database ( N = 20 181).The effects of errors on protein identification using 100-and 50-AA fragments are shown in Supplementary Figures S4 and S5 , respectively.

Discussion
Protein sequencing technologies are evolving at a fast pace with many new devices and methods.To explore the potential success of these different technologies in a scenario where protein sequences are determined only with uncertainty or errors, and potentially from fragments rather than full-length proteins, we proposed a hypothetical black-box sequencing device.The readouts from our hypothetical sequencer consist of position-wise uncertainties in the AA as a distribution of posterior probabilities, with additional errors (insertions and deletions) also possible.This format is likely to encompass the types of readouts available from early single-molecule protein sequencing devices, and is comparable to position-specific probabilistic representations of protein families and domains such as profiles and motifs.Profiles, in particular, are more generalized representations because indels are allowed at every position ( 46 ,58 ).In the case of protein sequencing, indels in the signals originate from errors in the sequencing device or decoding algorithm.Therefore, HMMs are a natural way to model our probabilistic readouts.
We devised a method for protein identification that, for a given readout, builds an HMM using the posteriors as emission probabilities on a match state.Analogously to existing homology search techniques, we then query the human protein database to find the most similar sequence for a given readout.Our hypothetical device is thus suitable to study the protein identification problem by first generating data and then using them to query a database and infer the likely originating protein.Several additional improvements on this pipeline are possible.First, depending upon the experimental settings and knowledge of error rates of the sequencing device, the transition probabilities of the HMM can be suitably configured to further increase the accuracy.Second, in this work we attempted an exact identification of the proteins.In some experimental settings, it may suffice to know just the protein families, which is also possible within our method.Third, our assumption throughout this work was that the readouts will always originate from proteins in the database.In practice, sequencing samples may often be contaminated with spurious peptides from unknown sources.In such cases, a suitable Evalue or bit-score threshold may also be applied to filter the spurious sequences.
A further feature of our device is a tunable parameter, p max , corresponding to the posterior probability for the correct AA.We use this to simulate variable signal quality and decodability, irrespective of the working principles of the device.However, this is a rather simplified version of the actual operating characteristics of a real device, where signal quality could follow some complex distribution along the sequence.We also simplified indel error processes by assuming uniformly random errors along the sequence.Despite these simplifications, such a model helped us evaluate the performance of our protein identification method under diverse conditions.More complex models of sequencing device outputs could be adopted in future iterations of this research.
Successful experiments for the discrimination of all 20 AAs at a single-molecule level ( 32 ) and the translocation of long peptides ( 26 ) are promising for the development of a protein sequencing device in the future that can read all AAs in fulllength proteins.Simulating these conditions, we found that, in the absence of indel errors, p max as low as 0.08-an exceedingly weak signal-was sufficient to identify > 95% of the proteins from a single sequence (read).This suggests that protein identification using our method is possible despite a marginally resolvable signal.In addition, the method can also tolerate indel rates at least as high as 50-60% if the signal quality is good (e.g.p max = 0.8).
Early devices will likely use protein fragments to mitigate problems such as nanopore clogging and limited Edman cycles.Further simplification for the device development would be to attempt the identification of a reduced set of AAs.We simulated both of these scenarios and studied the accuracy of our protein identification method.We first used random fragments of different lengths and different p max values.We were able to identify at least 94% of the sequences using fragments as short as 25 AAs.Shorter fragments, containing 10-15 AAs, also provide a good accuracy (0.8) if the device generates good signals ( p max = 0.8).Interestingly, our results also indicated that the location of a random fragment from within a protein has a negligible effect on protein identification.However, this could also be a consequence of our choice of protein database because we used canonical human proteins, where the sequences are generally unique.
Our method is easily extendable to devices that operate on multiple sequence fragments.We found that by combining scores from multiple fragments, we could significantly improve the accuracy in marginal cases.
We also studied the performance when a reduced set of identifiable AAs is used.This simulates situations such as where some AAs are fluorescently labelled so that they are identified through the reduction of intensity after each Edman cycle, whereas the unlabelled AAs do not cause a reduction ( 16 ,51 ).Previous studies have utilized such techniques to generate protein fingerprints ( 13 , 19 , 20 ), although none of them proposed a general method for identifying the originating protein from a large database.In our study, we investigated the use of three different subsets of AAs (LSEAG, WMCHY and CYK).Using our HMM protein identification method, the set with high-frequency AAs (LSEAG) had an extremely good performance: indeed, detection of just two AAs (LS) was sufficient to identify almost 96% of the sequences.In contrast, we need all five of the low-frequency AA set (WMCHY) to identify 80% of the sequences.The set CYK also had similar accuracy, probably because despite C and Y being rare, K is a more frequent AA in the human proteins used as our test set.For protein fragments, these sets gave a lower performance depending on p max and the length; however, the set LSEAG still had the highest accuracy.Thus, we conclude that if labelling is to be used, labelling more frequent AAs would be advantageous in our method for protein identification and could give excellent results.
We further investigated the impact of possible errors from the sequencing device by generating indel-prone readings.We found that on devices where there is discrimination of all 20 AAs, even with indel rates as high as 40-50% our method is successful irrespective of whether full length or fragments were used.In many cases studied, accuracy was > 0.90.However, reduced AA sets were not as robust to errors.The set (LSEAG) with highly abundant AAs was the most promising and had acceptable performances for error rates up to 20-30% depending upon the length of the input sequence.Therefore, our method can be used on devices with low to moderate error rates.
In conclusion, we used a hypothetical sequencing device to explore the ability of a novel, HMM-based method to identify proteins different types of possible readouts.In many of the cases, we had good accuracy, which suggests that our method could be successfully applied to different sequencing devices in the future.Our results could also help develop new strategies for protein sequencing devices.

Figure 2 .
Figure2.Effect of signal quality on protein identification for full-length proteins.Note the sharp increase in accuracy in the low signal quality region ( p max ∼ 0.06-0.08).The accuracy is > 0.95 for p max > 0.08.Note the logarithmic scale on the x -axis.

Figure 3 .
Figure 3.Effect of sequence length under different values of p max .The solid lines represent the mean accuracy of protein identification from single fragments, a v eraged o v er 10 random repetitions per protein.Dif ferent lines indicates dif ferent lengths of the fragments.The error bars (95% confidence intervals) are smaller than the marker size.The dotted lines represent the accuracy when the sum of scores from 10 random fragments is used to infer the protein, only making a visible difference for 10-and 15-AA fragments.

Figure 4 .
Figure 4. Effect of using reduced sets of AAs.Three sets of reduced AAs (LSEAG, CYK, WMCHY) were used with full-length sequences (left column) and fragments (100 and 50 AAs; centre and right, respectively), and with p max of 0.8 and 0.2 (top and bottom rows, respectively).For each AA set, the number of reduced AAs was increased from one up to all set members.The legend indicates resulting set memberships; the AAs are labelled in panel ( A ) and omitted in panels ( B )-( F ) for clarity.For comparison, the accuracy when all 20 AAs are used for each case (as in Figure3) is also shown (hexagon marker).For the set CYK, we also used all combinations of two AAs.

Figure 5 .
Figure5.Effects of error-prone readouts on full-length sequences.Heatmaps show the accuracy under different rates of insertion ( y -axis) and deletion errors ( x -axis) on the readouts.The panel titles indicate the AAs identified by the sequencer, i.e. identification of all 20 AAs (top left) and the three reduced sets of AAs (remaining panels).For these results, we used full-length proteins from the human protein database ( N = 20 181).The effects of errors on protein identification using 100-and 50-AA fragments are shown in Supplementary FiguresS4 and S5, respectively.