CONNET: Accurate Genome Consensus in Assembling Nanopore Sequencing Data via Deep Learning

SUMMARY Single-molecule sequencing technologies produce much longer reads compared with next-generation sequencing, greatly improving the contiguity of de novo assembly of genomes. However, the relatively high error rates in long reads make it challenging to obtain high-quality assemblies. A computationally intensive consensus step is needed to resolve the discrepancies in the reads. Efﬁcient consensus tools have emerged in the recent past, based on partial-order alignment. In this study, we discovered that the spatial relationship of alignment pileup is crucial to high-quality consensus and developed a deep learning-based consensus tool, CONNET, which outperforms the fastest tools in terms of both accuracy and speed. We tested CONNET using a 90 3 dataset of E. coli and a 37 3 human dataset. In addition to achieving high-quality consensus results, CONNET is capable of delivering phased diploid genome consensus. Diploid consensus on the above-mentioned human assembly further reduced 12% of the consensus errors made in the haploid results.


INTRODUCTION
Single-molecule sequencing (SMS) technologies produce much longer reads compared with next-generation sequencing, greatly improving the contiguity of de novo assembly of genomes. However, the relatively high error rates in long reads make it challenging to obtain high-quality assemblies. A computational-intensive consensus step is therefore required to resolve the discrepancies in the reads and improve assembly accuracy.
To resolve their discrepancies, reads are usually aligned to the draft assembly. A trivial consensus step could be implemented by taking base-by-base majority votes at each position of the alignment pileup. Such a method is succinct but vulnerable to systematic bias and high indel rates. As the techniques in sequence alignment advance, a partial order alignment graph is found to be useful in improving consensus (Lee et al., 2002). There are several efficient consensus tools (Vaser et al., 2017;Koren et al., 2017;Ruan and Li, 2020) based on this idea, and Racon (Vaser et al., 2017) is generally considered to be the fastest. When paired with a fast assembler miniasm (Li, 2016), miniasm + Racon is the most efficient assembly pipeline. It has also been observed that Racon (4) (i.e., iterate Racon 4 times) often achieves the highest accuracy. the worst case, it even underperformed Canu. CONNET is designed to achieve an absolute improvement in accuracy. CONNET's network can handle input assemblies ranging from low accuracy to high accuracy, and polish them, giving a consistently high-quality result.
Another challenge in Oxford Nanopore consensus is non-uniform sequencing errors in the reads. The deletion rate (5.74%) is more than twice as high as the insertion rate (2.48%), according to the error profile generated using Nanosim (Yang et al., 2017) on a human genome chr20 (Jain et al., 2018), making recovering missing bases in consensus more challenging. We address this problem by introducing an extra baserecovery phase in our consensus pipeline. In medaka's E. coli consensus, deletions (missing bases in assembly) account for more than 75% of total errors. Compared with medaka's results, CONNET's result halved the deletion rate.
Here in this study, we present CONNET, an accurate and flexible consensus tool. CONNET showed the highest accuracy of any existing method and ran faster than the tools with comparable accuracy. For E. coli, CONNET improved the accuracy of an E. coli de novo assembly from miniasm from 88.65% to 99.92% in 0.52 CPU h, better than Racon 4 +medaka, which achieved 99.85% accuracy in 1.30 CPU h. On a 24 CPU-core machine, CONNET achieved 99.80% accuracy on human chromosome 1 in 1.55 h, whereas Racon 4 + medaka achieved 99.77% accuracy in 3.27 h. For comparison purposes, our trivial consensus tool achieved 98.68% accuracy on E. coli and 98.76% accuracy on human assembly.
Finally, CONNET can generate phased diploid genome consensus for Oxford Nanopore data, further boosting the accuracy of the resulting assembly in diploid organisms such as Homo sapiens.

RESULTS
To demonstrate the high-quality consensus produced by CONNET, we performed experiments on E. coli and the human genome. In all the experiments, CONNET was able to achieve the highest accuracy, while being the fastest of the tools with comparable accuracy. We also extended CONNET to a diploid genome consensus tool and benchmarked on several chromosomes of the human genome assembly. The datasets used in the article and their links are summarized in both the ''Data and Code Availability'' section and Table 5.

Consensus on E. coli Genome
We benchmarked CONNET against other tools, including medaka, Racon, Canu, and wtdbg2, on two Oxford Nanopore E. coli datasets. Among the tools benchmarked, Canu and wtdbg2 are both complete genome assemblers that contain a built-in consensus step. CONNET, as well as medaka and Racon, are consensus tools that take draft assembly instead of raw reads as input, and therefore need to be coupled with an assembler in this experiment. Miniasm (Li, 2016) exactly suits this purpose, as it is a long read de novo assembler that does not contain a consensus step. Moreover, miniasm + Racon pipeline is the recommended usage of Racon. For medaka, the default usage is pomoxis + medaka, where pomoxis is a wrapper of miniasm + Racon.
We obtained a publicly available 903 E. coli SCS110 dataset generated with the R9.4.1 sequencing chemistry. Table 1 summarizes the assembly and consensus results. Identities were evaluated against the E. coli SCS110 reference genome using QUAST (Gurevich et al., 2013) after an iterable tool indicates how many iterations the tool has been run. CONNET was trained on the same dataset due to the scarcity of publicly available R9.4.1 E. coli datasets. In Table 2 and Figure 3, we show results using different datasets or chromosomes for training and testing. However, Figure 1B has shown that CONNET has learnt a general model for E. coli assembly, and an agnostic of the algorithm used for generating its input. CONNET was trained on the draft assembly generated by miniasm and performed equally well, using wtdbg2, racon, and Canu as its input.
CONNET achieved the highest accuracy, 99.92%, followed by medaka's 99.85% accuracy. Figure 1A compared the performance of iterable tools: CONNET, wtdbg2, and Racon. The performance of all tools was increasing and converging. However, other tools could not achieve the accuracy of CONNET after a sufficient number of iterations.
We analyzed the errors in consensus with Figures 2A and 2B. Owing to its unique base-recovery phase, CONNET was the best tool for reducing deletion errors. CONNET and medaka were the most effective tools for resolving homopolymer errors. Medaka has proposed a specialized ''homopolymer compression'' algorithm, whereas CONNET's strength with homopolymer accuracy comes from our novel spatial-aware input tensor. We further studied the impact of sequencing coverage on consensus accuracy in Figure 3. We downsampled the dataset used in Table 1 to different coverages ranging from 273 to 813. As expected, consensus accuracy falls as coverage decreases. However, at 633, or 70% of coverage of the original dataset, CONNET achieves a higher accuracy than other tools on the original 903 dataset. At 363, or as low as 40% of the original coverage, CONNET still outperforms partial order alignment-based tools at 90v. CON-NET enables accurate consensus at a lower coverage, resulting in lower sequencing costs.
As deep learning-based tools were shown to be more accurate, we further polished the results of all tools using CONNET and medaka, in Figure 1B. The performance of CONNET was stable and consistent. After two iterations, CONNET polished all results to over 99.90% accuracy. On the other hand, when a noisy input assembly (of 88.65% accuracy) was provided, medaka was not able to compute a high-quality consensus accurately, as it underperformed on graph-based methods, achieving only 99.37% accuracy. Despite the fact that all accuracy improved after being polished by medaka, the consensus accuracy of medaka was largely determined by the accuracy of input draft assembly. When given its own result as input, medaka failed to improve the accuracy further, whereas CONNET was able to continue to improve medaka's accuracy. We also observed that CONNET can benefit from pairing with other consensus tools, as its performance has improved to 99.94%, whereas the performance of medaka was capped at 99.85%.

Dataset Pipeline # Contigs Total Bases (bp) Identity (%) Time (min)
Training set (543) miniasm + CONNET (2 iScience Article To demonstrate that CONNET was not overfitted to its training data, we also benchmarked another publicly available E. coli K-12 dataset released in 2015. The dataset is based on an earlier (and obsolete) R9 sequencing chemistry, and the same dataset has been used in previous studies (Loman et al., 2015;Vaser et al., 2017). We subsampled the dataset into three datasets with similar coverage (543, 603, 603, respectively) for performance cross-validation. Using the CONNET model trained with one of the subsampled dataset (543), our results showed that CONNET outperformed all other tools.

Consensus on the Human Genome
To extend our discussion to larger and more complicated genomes, we benchmarked CONNET, Racon, and medaka on data from the Whole Human Genome Sequencing Project (Jain et al., 2018). Our reference genome for evaluation was obtained by applying the known variants of sample NA12878 (Zook et al., 2014) to the human reference genome GRCh37.  Table 3 summarizes the performance of different tools on chromosome 1. All tools were benchmarked on a 24-core Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz workstation. Assembly time was excluded from the running time, whereas all pipelines were started from the same miniasm draft assembly. As shown in Table 1, medaka spent most of its time on four iterations of Racon. It might be asked whether the running time could be optimized by reducing the number of Racon iterations while keeping a comparable performance. We therefore benchmarked three different versions of medaka pipelines, with the number of Racon iterations being 4 (default), 1, and 0, respectively. Other than miniasm + CONNET, we also introduced a miniasm + Racon + CONNET pipeline in the benchmark as CONNET can benefit from being paired with other consensus tools. The model of CONNET was trained on chromosome 1, and Figure 4 shows the accuracy of all other autosomes (chromosome 2 to 22). As shown, CONNET consistently outperformed the others, achieving over 99.80% accuracy on average.

OPEN ACCESS
To test the generality of the model that CONNET has learned, we performed a cross-species benchmark by applying the model learned from human chromosome 1 (the same as the one used in Table 3) on the E. coli

Diploid Consensus on the Human Genome
We noticed that the performance of all tools dropped in the human genome assembly, compared with the E. coli assembly. One possible reason is the diploidy of the human genome. Existing consensus tools may be confused in diploid regions if they are programmed to output one best consensus sequence. Based on our haploid genome consensus component, we extended CONNET to a phased diploid genome consensus tool with the help of a read-based variant phasing tool, Whatshap (Martin et al., 2016). We have proposed a new measurement of accuracy (in Methods) for diploid genome assembly.
To determine the extent to which a diploid consensus is better than its haploid counterpart, we used the following statistics. We defined L h as the aligned length and A h as the accuracy of a haploid consensus. For a diploid consensus, two haploids will be generated. Thus, we defined its aligned length L d as the arithmetic mean of the aligned length of the two haploids. Its accuracy A d was calculated as described in Methods. We assumed all bases are independent, so that the comparison of A h and A d can be viewed as a Two Proportion z-test. The pooled proportional was calculated as The lower the p value, the more significant the performance of the diploid consensus deviates from the haploid. Figure 5 showed the accuracy of each contig when diploid genome consensus was applied to human chromosomes 18 and 19. As shown, diploid contigs are more accurate than the corresponding haploid contigs in all but one contig (the second shortest contig in chromosome 19, where the p value is 0.40: diploid iScience Article consensus got 7 incorrect bases out of 159 bp, whereas haploid consensus got 6 out of 160 bp). In other contigs, the p value can be as low as 1e-300 (the first contig of chromosome 18 has a length of 15.1 Mbp, diploid consensus got 24.9 kbp incorrect bases and haploid got 33.8 kbp), indicating that diploid consensus is significantly more accurate than haploid consensus. On the two chromosomes we have tested, diploid consensus further reduced the total errors in haploid consensus by about 12%.

DISCUSSION
We have described CONNET, a deep-learning based diploid consensus tool. We have shown that CONNET is more accurate than other state-of-the-art consensus tools and is capable of achieving an absolute improvement in accuracy that does not depend on the quality of input draft assembly.
We demonstrate that deep neural network methods have the potential to better solve the consensus problem. Compared with partial order alignment-based methods, deep neural networks are more flexible as they do not rely on heuristic rules to determine the most probable sequence. Deep neural networks are also less vulnerable to systematic bias in sequencing, as it is feasible to train multiple models for various sequencing methods.
CONNET is the first deep-learning based method to make use of spatial relationships in alignment pileup. A sliding window of 3 instead of size 1 is used for input tensor construction. Compared with the other deeplearning based method, medaka, CONNET has one fewer layer of BRNN and manages to achieve better accuracy. Our input tensor efficiently captures interesting features in alignment pileup, which makes the neural network easier to learn. Furthermore, our experiments show that CONNET has learnt a generalized model except for sequencing chemistry. CONNET is agnostic of the algorithm used for generating its input, and the model trained on the human genome is applicable to the simpler E. coli genome.  iScience Article CONNET has achieved the highest consensus accuracy, and it is able to further polish existing results from other assembly or consensus tools. A more accurate genome assembly would be useful for other bioinformatics problems such as variant calling. We have also studied the role that sequencing coverage played in consensus accuracy. As a consequence, a lower coverage is sufficient for CONNET to maintain the same level of accuracy as other state-of-the-art tools, reducing the cost involved in sequencing.
Currently, CONNET is designed for Nanopore sequencing data. In the future, we may extend its application to sequencing data from other platforms such as PacBio.

Limitations of the Study
We designed CONNET to deliver better performance in consensus accuracy. We intend in our future work to focus not only on accuracy but also on the improvement of genome assembly, including better continuity.
In our implementation, we chose a sliding window of size 3 for input tensor construction. Ideally, a larger sliding window that might be able to capture more spatial relationships is preferred. However, considering the memory limit of a typical Graphics Processing Unit used for training, better encoding of the input tensor is required to enable a larger sliding window.   Figure 3. Illustration of the input tensors. (a) An example of alignment pileup and its corresponding input tensor at the correction phase. (b) An example of alignment pileup and its corresponding input tensor (partial) at the recovery phase. Spaces were added for better visualization. Insertions are in red. Related to "Transparent Methods -Spatial relationship of alignment pileup".
Supplementary Figure 4. A schematic illustration of the diploid consensus module. The left part of the figure shows the haploid consensus workflow. The "phasing" is achieved by applying WhatsHap to the raw reads and the haploid consensuses at the positions where variants were found. The right part of the figure shows that raw reads are partitioned into two groups by WhatsHap. Consensuses will then be computed from the two groups of reads, respectively. Related to "Transparent Methods -Diploid consensus module".

Workflow
There are three types of consensus errors: mismatch (incorrect nucleotide), insertion (extra nucleotide), and deletion (missing nucleotide). We discovered that errors from existing tools are mainly in the form of deletions. Therefore, we designed a pipeline to reduce deletions in our consensus. We separated our consensus module into two phases. The first phase, correction, aims at correcting the mismatching bases in the assembly and removing extra bases; the second phase, recovery, aims at recovering the bases that are lost during assembly construction. In the recovery phase, we also invented a method to capture spatial relationships in inconsistent regions of the alignment pileup, where there are many discrepancies in the reads.
Workflow is described in Supplementary Figure 1. CONNET is an iterative consensus tool. It takes raw reads and a draft assembly as input, and outputs a polished assembly. Each iteration consists of two phases, correction and recovery, while each phase consists of one alignment step and one consensus step. In both alignment steps, raw reads are aligned to current draft assembly sequences to provide pileup information needed in the subsequent consensus step. Both consensus steps make use of neural networks. The networks take as input the encoded spatial relationship in alignment pileup. Correction network predicts the nucleotide (A, C, G, T, or X, which denotes a deletion) at each genomic position of the current draft assembly. Recovery network predicts the number of missing bases at each position. Lost bases are then recovered using pileup information.

Neural network architecture
CONNET adopts a bidirectional recurrent neural network (BRNN) (Schuster and Paliwal, 1997) architecture for alignment pileup. As assembly sequences come in variable lengths, we have chosen recurrent neural network as a starting point. Similar to a time series, genomic sequences have a direction as well. Due to the possible strand bias in sequencing technology, we may expect the characteristics of forward strands and reverse strands to be nonidentical. BRNN, which processes the input data in both positive and negative direction, is used to capture strand-specific information.
Our neural network architecture for the correction and the recovery phase is identical except for parameters like output dimension. Supplementary Figure 2 shows settings used in the correction phase. Each genomic position in the draft assembly corresponds to one neuron in the input layer and then corresponds to one node in the BRNN layer. We choose bidirectional long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) as the BRNN layer for its capability of remembering long sequences. The output layer of either phase is designed as one neuron per genomic position. As each node in LSTM corresponds to one output neuron, to make our architecture concise, we directly connect them using a single layer. We chose to use a fully connected layer that captures all information from the previous layer. In order to prevent over-fitting, we inserted a dropout layer after each LSTM node, randomly ignoring some of the hidden units during training.
In the correction phase, output tensor represents the probability distribution of the correct nucleotide, A/C/G/T/X, at each genomic position. In the recovery phase, output tensor represents the probability distribution of the number of lost bases, 0/1/2/3/4/≥ 5, at each genomic position.
Both phases perform classification tasks. In the testing process, the class label with highest probability is predicted. In the training process, categorical cross-entropy is used as a loss function. Ground truth is obtained by aligning the reference genome to draft assembly. For each genomic position in the draft assembly, the ground truth of correction phase is the aligned nucleotide (or "X" in case of deletion) in reference genome; ground truth of recovery phase is set to 0, except at the places where the reference genome is inserted.)

Spatial relationship of alignment pileup
Existing machine learning-based methods (Luo et al., 2019;Luo et al., 2020;Poplin et al., 2018) typically consider columns of alignment pileup independently when constructing the input tensor. The number of occurrences of different nucleotides or probability distribution of nucleotides at a single column is a common approach. This results in a small tensor, and thus little information is encoded in each genomic position. Such representation loses all spatial relationships in the pileup as k-mers cannot be reconstructed from nucleotide counts.
For example, if we know, there are 4 A's and 3 C's at a genome position in the pileup, and 4 G's and 3 T's at the next position. One scenario is there are 4 reads with "AG", and 3 reads with "CT", which may indicate two read groups coming from different haplotypes of a diploid genome. Another scenario would be 2 reads with "AT", 2 reads with "AG", 2 reads with "CG" and another 1 read with "CT", which may indicate misalignment or a high sequencing error rate in the case of a diploid genome. If we use a simple counting method to represent the alignment pileup, our network cannot differentiate between these two scenarios.
To our knowledge, we are the first to use a deep learning-based method to consider the spatial relationship in alignment pileup. We used a sliding window of 3 instead of size 1 for input tensor construction.
Let be the alignment of query reads to reference sequence . In our case, is the raw reads and is the current draft assembly sequence. For each genomic position ∈ , let ! be the nucleotide in reference sequence and ! be the alignment pileup at . We encode information in !"# , ! , !$# , instead of only ! , when constructing the input neuron ! for each genomic position . In our design, the network is expected to output predictions ! corresponding to ! .

The correction phase
Input neuron ! consists of two 125-dimensional vectors, ! $ and ! " , for each genomic position . Each one of the 125 values in ! $ (resp. ! " ) corresponds to the count of a particular 3-mer centred at in the alignment pileup, considering only reads aligned to the forward (resp. reverse) strand (Supplementary Figure 3a). In this way, we manage to encode the spatial information stored in !"# , ! , !$# in our input neuron ! . Here our alphabet consists of five letters: Σ = {A, C, G, T, X}, where the first four correspond to nucleotide and the last one "X" represents a gap in alignment pileup. Therefore, we have a total of |Σ| % = 125 possible 3-mers. Each value in ! $ (resp. ! " ) would correspond to such a 3-mer.
The recovery phase In additional to ! $ and ! " defined in the correction phase, input neuron ! includes two more 150-dimensional vectors, ! $ and ! " , for each genomic position . In the same manner as above, the superscript "+" indicates information from the forward strand reads, and "-" indicates information from the reverse strand reads. Each value in , ! $ (resp. ! " ,) corresponds to the count of a particular 3-mer, which overlaps with an insertion in the pileup at , considering only reads aligned to the forward (resp. reverse) strand (Supplementary Figure 3b). This explains |Σ| % = 125 out of the 150 values in either vector. The remaining 25 values come from another rule: if there is an insertion in the pileup, a special symbol "$" is appended to 3-mer centered at . By our design, the number of total 3-mers ended in "$" equals to the number of total reads that contains an insertion in the pileup. Intuitively, the 3mers ended in "$" help us distinguish the scenarios of many reads with short insertions and few reads with long insertions.

Diploid consensus module
CONNET relies on phasing information to generate a diploid genome consensus (Supplementary Figure 4). We start from the haploid consensus result from CONNET. We treat the initial draft assembly as a "reference genome" and treat the modifications our haploid consensus introduced to the draft assembly as "variants". In this setting, we have used WhatsHap for phasing the "variants" in order to separate the raw reads into two groups corresponding to two haplotypes. A diploid consensus can be obtained by applying a haploid consensus to each group.

Implementation of trivial consensus
For each column of pileup: output = argmax &∈( {pileup. count(x)} where Σ = {A, C, G, T, X} and "X" denotes deletion. In case the insertion AF exceeds 20%, the insertion pattern with the highest frequency is inserted to consensus.