QuipuNet: Convolutional Neural Network for Single-Molecule Nanopore Sensing

Nanopore sensing is a versatile technique for the analysis of molecules on the single-molecule level. However, extracting information from data with established algorithms usually requires time-consuming checks by an experienced researcher due to inherent variability of solid-state nanopores. Here, we develop a convolutional neural network (CNN) for the fully automated extraction of information from the time-series signals obtained by nanopore sensors. In our demonstration, we use a previously published data set on multiplexed single-molecule protein sensing. The neural network learns to classify translocation events with greater accuracy than previously possible, while also increasing the number of analyzable events by a factor of 5. Our results demonstrate that deep learning can achieve significant improvements in single molecule nanopore detection with potential applications in rapid diagnostics.


Introduction
Nanopores have emerged as powerful sensing devices for single molecules (2,3), with applications in DNA sequencing (4), protein detection (1,(5)(6)(7)(8)(9), the study of protein folding (10), SNP genotyping (11), data storage (8), and DNA computing (12).A typical setup consists of two liquid filled reservoirs connected by a nanopore with diameters down to a few nanometres.An external electric field drives charged molecules through the nanopore, as shown in Figure 1a.The passage of molecules modulates the current, producing a characteristic signal that contains information about the shape of the molecule.
The readout is a time-series current trace corresponding to the shape of the molecule, usually called an event.Detection of such events can be achieved using simple current thresholds, but the subsequent analysis of features within each identified event is often made difficult by a poor signal-to-noise ratio, varying conformation of the molecule, and non-specific interactions with the nanopore surface.For example, Figure 1b shows two events from a multiplexed protein sensing technique published in (1).The authors used a DNA molecule as a carrier for a protein target.Modifications along the DNA molecule and bound targets produce secondary current drops during the translocation event, as shown in the two traces.In the first half of the structure, DNA hairpin loops at defined positions and their corresponding secondary drops were used to encode a digital barcode.This barcode uniquely identified a binding site in the other half of the DNA molecule.The presence of a target at the binding site could be inferred from a single secondary drop in the second half.This approach allows the simultaneous detection of a large number of targets, only limited by the number of distinct barcodes.The information is encoded in additional current drops during the event, much like the knots on a string used in the Inca Quipu system (13).
Analysis of the event data requires accurate detection and subsequent interpretation of secondary current drops (1).However, simple peak finding algorithms often fail at reliably classifying large parts of the data.Common causes of errors are a varying peak magnitude, noise (6), fluctuating velocities (14), overlapping peaks, DNA knots (15) and folded molecules.To mitigate these effects the nanopore community has developed sophisticated algorithms (16)(17)(18)(19).However, they frequently require manual parameter tuning for each dataset and supervision of algorithms (1,9).In the worst case scenario, researchers have to manually interpret the data, leading to small sample sizes, possible confirmation bias, or data analysis duration exceeding measurement time.
In this paper, we show that deep learning is ideally suited for automating the analysis of nanopore sensing data.For our study, we use the previously mentioned multiplexed protein sensing dataset from (1).The dataset contains separate control measurements for each specific barcode, without other bit permutations present in the solution.This automatically provides labelled data to train the supervised learning model.At the same time, the data is sufficiently complex to require an elaborate algorithm for the classification of events.In (1) a twelve step approach was used to identify the bit sequence and presence or absence of a target on each DNA construct.That method relied on more than a dozen manually adjusted parameters that were carefully optimised, but still it could only use a small fraction (∼ 20%) of events, discarding up to 80% of the difficult-to-interpret events that failed some predefined set of criteria.Here, we show that machine learning models are able to interpret and classify data without the need for manual tuning and the development of complex algorithms while increasing the number of usable events by a factor of five.Our implementation is open-source and available online to enable the adaptation of deep learning to other nanopore sensing problems (20).

Methods
We chose convolutional neural networks (CNN) as the machine learning approach because of their suitability for detecting local patterns (21,22).A recent study showed that CNNs perform well on simulated current traces from an STM tunnel junction (23).For comparison, DNA bases can be accurately determined from current levels using recurrent neural networks (24).However, our goal is fundamentally different, as we are trying to identify the pattern encoded on the DNA secondary structure from a variable nanopore system.Therefore, we chose to use the CNN architecture.A typical CNN consists of two parts, as shown in Figure 1c.First, a series of convolutions are applied to the raw input data.Then a dense neural network learns to interpret the processed signal.The output is a prediction about which class a particular input belongs to.In our case, the prediction is a barcode on the DNA constructs and whether a target has bound to it.
Before feeding the data into the neural network, we perform two preparation steps.First, the raw dataset contains erroneous detections, caused by contaminations, incomplete DNA fragments, and non-specific interactions with the pore walls.We use standard filtration methods to remove these detections (25): we exclude events whose area under the current trace (electronic charge deficit) lies outside two standard deviations of the mean, as well as those with current drops larger than 3.2× the unfolded event current level.Details are available in the supplementary material (20).This filtration removes up to 30% of the detections recorded with the measurement setup.After filtration we still observe some events with errors, such as a missing bit in the barcode structure.Therefore, perfect accuracy is unattainable using realistic datasets.
Secondly, we want the model to identify a molecule, but not the experiment.The problem arises because nanopores vary in shape and conductivity, leading to a correlation between events measured with the same nanopore.It is possible for the neural network to overfit to these variations, thereby learning to identify a nanopore instead of the barcode on a molecule.To reduce such over-fitting, we normalise the events from each nanopore to have the same unfolded current level (arbitrarily set to −1).In addition, we test the model using independent experiments to reduce the chances of spurious correlations.Table I shows the number of events in the training and test sets.
As mentioned above, our predictor model is based on a machine learning technique called neural networks.The architecture of such a network specifies how the network nodes are connected and what operations are applied.In order to find a suitable architecture for nanopore data, we investigated different alternatives by educated trial and error.The model presented here is inspired by the image classification network in (26), which we modified to perform 1D convolutions.Figure 2 shows the architecture.
We optimised the (hyper-)parameters to work well for nanopore data by trial and error.A typical procedure is to pick one hyper-parameter, such as the number of convolution layers, then increase the number and measure the resulting accuracy.If the accuracy increases we stick with the new number, but if it decreases or does not change we stick with the old number.We then pick a different hyper-parameter and repeat the procedure.To avoid over-fitting to the test, we measured the accuracy gains using a development set, which is independent from the test set and 20 times smaller than the training set.The reported numbers in Figure 2 are the result of our optimisation.
The input for the neural network is a current trace from a measurement event.The data from (1) produced events with an average length of 402 data points.This includes short stretches of current recording before and after the event.As the maximum length of the event never exceeded 700 points, we use a 700-element vector as the input.The shorter events are padded at the end with Gaussian noise ( = 0,  = 0.072, corresponding to average noise levels).
Each box in Figure 2 corresponds to a so-called 'hidden layer' that performs a specific task and passes on the information.Here, we give a brief description of each component; we refer interested readers to the machine learning literature for more details (21,22).
Convolution layers extract features with local structure, such as peaks or steps.These layers perform a discrete convolution on a segment of the input by multiplying it with a small window, called kernel, and moving along to the next segment (stepping by a single vector element).The output is large if the input features match the kernel, where its weights are learned from the training data.For example, Figure 1c shows the output after the first convolution layer, where the orange line corresponds to a kernel that detects peaks.Other kernels detect other features in the input data, which are often difficult to interpret, as seen by two grey lines that correspond to different kernels.After each convolution, we apply a batch normalisation (BN) layer that normalises  the data to have zero mean and unit variance (27).These layers improve our network training convergence.Finally, an activation function is applied -a piecewise function called rectified linear unit (ReLU),  () = max(0, ).This non-linear function is necessary for learning nonlinear relationships between features (22).The activation function completes one row in the diagram, its output goes into the next convolution layer.Roughly speaking, the deeper layers capture more abstract and complex features.We follow the common practice of increasing the number of kernels for deeper layers (21): from 64 to 128, then to 256.Each step doubles the amount of information passed to the next layer.For every two convolutions, we have a 'Max Pool' layer to reduce the amount of information by down-sampling FIG. 2.
The architecture of the neural network, where each element is briefly described in the methods section.Acronyms: BN is a batch normalisation layer; ReLU is a rectified linear unit and is shown on top.Numbers in the brackets correspond to the matrix sizes encoding a single event at that point in the network.The model has 3 995 920 trainable parameters.spacial dimensions.A Max Pool layer splits an input vector into segments of three numbers and returns only the maximum values within the segment.This arrangement is believed to improve spacial invariance for feature extraction (22).
The dropout layer reduces over-fitting by randomly switching off a fraction of nodes in the layer above.This encourages the network to learn more robust features that do not depend on a single node (28).Note that the dropout is only applied during training, because we want maximum accuracy while using the algorithm.
The second half of the network is a densely connected neural network with two hidden layers and a ReLU activation function.In a dense network, the nodes between adjacent layers are fully connected, as illustrated in Fig- ure 1c.The weights for these connections are learned from the training data.
The output layer is adjusted depending on the task.In our case, we have two outputs: the barcode and sensing region.The barcode output is a vector with 8 elements and a softmax activation function.The softmax normalises the output vector to have a sum of one such that each element is a proxy for the probability for a different barcode.We take the maximal value to be the predicted barcode.For the sensing region, the output is a single number with a sigmoid activation function.This number is a proxy for the probability of having a target bound to the sensing region.Note that these are two networks that are trained separately and give independent outputs.
The model is trained for 200 epochs on a GPU (Nvidia GeForce GTX 1080 TI).The aim of the training is to find the weights that maximise accuracy, which corresponds to minimising a loss function.For barcodes, the loss function is categorical cross-entropy, while for the sensing region it is binary cross-entropy.To minimise the loss function we use the Adam optimisation algorithm (29) (LR=0.001;decay=0.97;batch size of 32).Typical training takes 200 min, while evaluation is much faster at 1600 events/s, making QuipuNet suitable for real-time classification.

Results
QuipuNet correctly identifies almost all events even with highly complex shapes, as shown in Figure 3.For example, the first event in column one enters the nanopore with the barcode first, while the second and third examples enter with the sensing region first.QuipuNet can interpret both directions.Columns two and three show that it learns to identify folded DNA events which occur when a nanopore captures the DNA molecule somewhere along its length.These events are particularly difficult to interpret because there are many possible outcomes and peaks tend to be less pronounced.For comparison, the method from (1) discarded folded events so that only the  1).(b) These 9 example events present typical results from the data set in (1).The original algorithm only identified the two blue events: 111, unbound and 001 bound; while QuipuNet correctly identified all these events.It is important to note here that QuipuNet increased the number of usable events by a factor of ∼ 5.
events shown in blue could be identified.Precision is the fraction of correctly identified samples out of attempted guesses while recall gives the fraction of correctly identified samples out of all the events.Data utilised is a fraction of events that the algorithm attempted to identify.
Table II presents a quantitative comparison of accuracy.The first metric for accuracy is precision, which gives the fraction of correctly identified events out of at- tempted guesses.Precision can be boosted by refusing to label difficult events.On the other hand, the recall metric gives the fraction of correctly identified events out of all the events (after filtration).For example, the Bell & Keyser method (1) and human experts achieve high accuracy but have a low recall because events with ambiguous barcode patterns are discarded.
QuipuNet achieves a precision of 0.946 for barcodes and 0.971 for the sensing regions.This is 1.0% and 3.4% higher than the Bell & Keyser method.A much bigger difference can be seen in the recall metric because QuipuNet classifies all the data.The recall is five times larger than the original method (1) for both the barcode and sensing region.These results suggest that QuipuNet accurately classifies the nanopore event data, including folded events.As a result, QuipuNet outputs 5 times more data than the previous method for the same experiments.
To measure human expert performance, one of the authors labelled 500 randomly chosen events and compared them with the true labels (it took around one hour).Only 45% of events could be labelled reliably because of the ambiguity introduced by folds or overlapping peaks.Compared with human performance, QuipuNet is 3.3% less precise at reading the barcode and 4.4% better at reading the sensing region.In both cases, the recall metric is more than twice that of a human expert.
To optimise for accuracy, we can discard low confidence predictions to increase the precision.Practically, it makes sense to discard events where a barcode is simply missing or otherwise impossible to identify.To achieve this, we estimate the confidence using the maximal value of the softmax output vector and then discard events with the lowest confidence.We use a 'data utilised' fraction to show how much data remains after discarding low confidence predictions.
Figure 4a shows the accuracy as a function of data utilised for the barcode predictions (evaluated on the test set).The accuracy increases with the amount of discarded data, suggesting that the confidence estimator correctly identifies poor predictions.The accuracy curve is significantly above manual labelling and the Bell & Keyser method, suggesting that QuipuNet outperforms both.For illustrative purposes, at 80% utilised data QuipuNet precision is 0.987, which is higher than the human performance.Figure 4b shows an equivalent plot for the sensing region predictions.Here, QuipuNet achieves a nearly perfect precision of 0.997 for 80% utilised data.In both cases, discarding low confidence predictions increases the accuracy of the QuipuNet algorithm.
The predictions for the sensing region have a higher accuracy than those for the barcodes.We attribute this to two effects.First, the sensing region typically has a higher signal-to-noise ratio, i.e. larger current drops.Secondly, the barcode prediction is an intrinsically harder problem, because the algorithm must distinguish between 8 different classes, instead of two.
Figure 4c shows where the errors are made for the barcode predictions.The matrix suggests that QuipuNet makes more mistakes for certain barcodes.For example, the prediction for barcode '100' has a precision of only 0.86, which can be attributed to the small training set.It only has 876 events measured by two experiments while the third experiment was used for the test set.A larger training set is expected to improve the accuracy.
The error matrix also provides insights for designing more robust barcodes.The barcodes '000', '001', '101', '110', and '111' all have a similar amount of training data, but the symmetric barcodes have a higher accuracy.Here, symmetric barcodes are '000', '101', and '111' ('010' has smaller amount of training data).This observation suggests that using only symmetric patterns for barcodes might improve the overall accuracy.
Finally, we trained QuipuNet on a reduced training set to assess the relationship between accuracy and training set size, as shown in Figure 5.For the sensing region, we randomly picked the same number of events for bound and unbound states.For the barcode, we randomly reduced the training set size of the '011' barcode to a number specified on the x axis, while the other barcodes had the same number of events as specified in Table 1.The resulting recall metric reaches 80% at 2000 training events, 90% at 8000 and then slowly increases to > 90% for more than 8000 training events.The increase in accuracy beyond 90% appears to be asymptotic and would require even larger training sets.The classification of the sensing region (blue data in Figure 5) reaches higher accuracies for smaller training sets as it only has two classes and signal-to-noise for protein signals is higher than for barcodes.

Discussion
We have shown that convolutional neural networks can accurately classify events from nanopore data.Our network achieves better accuracy than the previous algorithm (1) or manual classification, while at the same time classifying events that were impossible to interpret before.As a result, five times more data can be analysed from the same experiments.Furthermore, the machine learning approach simplifies the analysis by eliminating manual parameter tuning and algorithm development.Instead, we rely on experiments to generate the labels that are used to train the neural network.
In the supplementary material, we use QuipuNet to analyse raw data from other nanopore experiments (20).In (11) the authors detected single-nucleotide polymorphisms from the presence of a single binding target.Their designed DNA molecules contain only the sensing region with no barcode.We successfully reproduce results from their analysis using QuipuNet.Despite a significantly lower signal-to-noise level for this data set we obtain accuracy of up to 72% when including folded events.If only the unfolded events are analysed, the accuracy is 0.91.This shows that QuipuNet can be readily applied to other nanopore sensing datasets.When designing a nanopore experiment, others should consider the relationship between the desired accuracy and the number of training events.
Our work suggests that deep learning is particularly suitable for nanopore sensing because the experiments can generate large amounts of training data; often with predefined labels.A similar conclusion was reached for nanopore-based DNA sequencing, where a recurrent neural network improves the precision of DNA sequencing (24).Future work may address other difficult problems in the nanopore field.Specifically, peak localisation in noisy datasets (6) can be trained using DNA with known modification positions.Also, running QuipuNet against simulated datasets (generated classically or with generative adversarial networks) could guide the design of the DNA structures in order to maximise the information density or readout accuracy.Both are critically important for information storage on DNA and hold the promise of highly multiplexed protein sensing for medical applications.
FIG. 1. Convolutional neural networks for the analysis of nanopore data.(a) The shape of molecules is contained in the timedependent ionic current signal from passing the molecules through a nanopore.For example, a molecule with three protrusions passing through the nanopore leads to a current event with three secondary current drops as indicated by the red arrow.(b) Current traces associated with the modified DNA molecules from (1).The first half of the molecule encodes a unique barcode: the first peak marks the start; three bits uniquely identify molecule design; and the last peak signifies the end.The two events have barcodes '111' and '001'.The second half has a binding site for a specific molecular target.(c) Data analysis using deep learning methods: convolution layers extract local features such as current drops of different width (shown in orange).The features are interpreted by a fully connected neural network, which outputs a prediction for the barcode and the target binding state.

FIG. 3 .
FIG.3.Example events identified by semi-automated algorithm (1) and QuipuNet.The sketches in (a) show some of the possible DNA configurations during the passage through a nanopore.The shape of the molecule complicates the semiautomated analysis(1).(b) These 9 example events present typical results from the data set in(1).The original algorithm only identified the two blue events: 111, unbound and 001 bound; while QuipuNet correctly identified all these events.It is important to note here that QuipuNet increased the number of usable events by a factor of ∼ 5.

FIG. 4 .
FIG. 4. Evaluating the performance of QuipuNet.(a) Barcode prediction accuracy (precision) as a function of data utilised.The accuracy increases when the least confident predictions are removed.(b) Sensing region prediction accuracy as a function of data utilised.(c) Error matrix: rows represent true barcodes from the test set, while columns are the barcodes that QuipuNet assigned them to.In an ideal case, it would be a diagonal matrix.The matrix was evaluated using the entire test set.On the right, bars show the number of events in the training set for each barcode.Accuracy correlates with the size of the training set.

FIG. 5 .
FIG. 5. Recall accuracy as a function of training set size.The number of events are shown for bound/unbound states and the '011' barcode.

TABLE I
. The number of events in the training and testing sets.The last two columns show the number of independent experiments without protein (unbound state) and with protein (bound state).