Dynamic Pooling Improves Nanopore Base Calling Accuracy

In nanopore sequencing, electrical signal is measured as DNA molecules pass through the sequencing pores. Translating these signals into DNA bases (base calling) is a highly non-trivial task, and its quality has a large impact on the sequencing accuracy. The most successful nanopore base callers to date use convolutional neural networks (CNN) to accomplish the task. Convolutional layers in CNNs are typically composed of filters with constant window size, performing best in analysis of signals with uniform speed. However, the speed of nanopore sequencing varies greatly both within reads and between sequencing runs. Here, we present dynamic pooling, a novel neural network component, which addresses this problem by adaptively adjusting the pooling ratio. To demonstrate the usefulness of dynamic pooling, we developed two base callers: Heron and Osprey. Heron improves the accuracy beyond the experimental high-accuracy base caller Bonito developed by Oxford Nanopore. Osprey is a fast base caller that can compete in accuracy with Guppy high-accuracy mode, but does not require GPU acceleration and achieves a near real-time speed on common desktop CPUs. Availability: https://github.com/fmfi-compbio/osprey, https://github.com/fmfi-compbio/heron Keywords: nanopore sequencing, base calling, convolutional neural networks, pooling


Introduction
The MinION by Oxford Nanopore Technologies (ONT) is a portable DNA sequencer, which is capable of producing very long reads [21,5]. The main drawback of this sequencing technology is its high error ratethe median read accuracy barely achieves 96% (Table 3, see also [33]). The source of sequencing errors is the fact that the MinION measures tiny electrical current influenced by DNA passing through the pore. This noisy signal is then translated into DNA bases via base caller software.
The sequence of signal readouts can be split into events, each event corresponding to a passage of a single DNA base through the pore. A single event consists of roughly 8-10 signal readouts on average. However, local variations in the speed of DNA passing through the pore cause the event length to vary widely, some events consisting of a single readout, while others spanning tens of readouts (see Figure 1). Moreover, the speed may also vary globally between the runs, making some runs on average slower than others. Segmentation of the signal into events is a difficult problem, and most base callers avoid performing explicit event segmentation.
Recurrent neural networks (RNNs) are designed specifically for processing sequence data, and thus may seem a natural choice for nanopore base calling. However, convolutional neural networks (CNNs) offer several benefits. First, each CNN layer can process the whole sequence in parallel compared to sequential processing in RNNs. Thus, CNNs can achieve high throughput. Second, CNNs offer many possibilities for architecture optimization. Typical RNNs can be tuned mainly by changing the layer type, and the number and sizes of layers, and there were very few successful improvements documented in literature. Compared to that, CNNs offer a much larger variety of choices, including skip connections [14,16], normalization [18], activation functions [30], convolution factorization [34,15], etc. A similar shift from RNNs to CNNs can be observed in natural language processing and speech recognition, both of which were first dominated by RNNs [9,11,29].
A basic building block of CNNs is a convolutional layer, which in the one-dimensional case takes as an input a tensor Xof dimensions (T, C in ) representing a data stream of length T , each data point containing C in values called channels and outputs tensor Y with dimensions (T, C out ). The convolution with kernel size Dis described by kernel weightsW of dimensions (C out , D, C in )and bias vector Bof length C out , both of which are trained. The convolution is computed as follows: Y t, j = 0≤d<D, 0≤i <Cin X t+d, i W j,d,i + B j (to handle boundaries, the input tensor X is first padded with D 2 zeros at both ends). The length of the output stream can be reduced by applying stride s, where only every s-th element of the output is computed. A simpler version of striding is pooling, where a fixed operation, such as maximum or mean, is applied instead of learned kernel weights on windows of Dpoints with stride s.
To make base calling faster and reduce signal redundancy, CNN base callers often subsample the input signal. For example, the first layer of the Bonito network uses stride of size 3 in version 0.2 and size 5 in version 0.3. Besides increasing the time and memory efficiency, this step allows the neural network layers to increase their effective receptive field (i.e., the part of the signal they see) without increasing kernel sizes of internal components. However, subsampling with a fixed-sized striding step may completely discard very short events, while longer events are still overrepresented. In addition, the widely variable event lengths make base calling hard for CNNs because each convolutional layer considers fixed-size windows. Under these conditions it may be difficult for the network to capture dependencies between signal readouts at specific distances in the DNA, as these have variable distance in the actual signal processed at individual layers of the network. In particular, long events decrease the effective receptive field in terms of the actual DNA context observed, while very short events may suffer from reduced representation after downsampling.
To address these problems and provide convolutional layers with a more stable context width, we introduce a novel component, which we call dynamic pooling. This component can adaptively subsample the signal by variable rate depending on the input. Intuitively, we want the neural network to predict the current speed of the DNA at each position and then downsample signal readouts based on this factor. Such adaptability can enable the network to keep short events, and downsample long events.
Slightly similar approaches are Deformable convolution [10] and Dynamic convolutions [38]. Both of these approaches do not change the number of timesteps in the sequence, but instead of applying the same convolution for each timestep, they use a different convolution in every timestep. Deformable convolutions use the same convolution weights but alter input positions. Dynamic convolutions compute different convolution weights for every timestep.
Thus these approaches would keep long events unnecessarily long, whereas our approach shortens long events to a reasonable size. Also these approaches need custom CUDA kernels to be efficient on current hardware, whereas our approach relies only on simple primitives.
A very straightforward approach to adaptive subsampling would be to predict for each input data point, whether the network should use it or ignore it. This is similar to conditional computation [6] and leads to a non-continuous loss function. Such an approach therefore cannot be trained by standard gradient descent algorithms. In contrast, our dynamic pooling approach is continuous and differentiable almost everywhere.
To demonstrate benefits of dynamic pooling, we apply it to two strong baseline base caller networks. The first (named Heron) is an improved version of the high-accuracy Bonito v0.2 architecture, an experimental CNN base caller from ONT. Our improvements, including dynamic pooling, boost the accuracy by almost 2 percentage points on a rice dataset, decreasing the error rate by approximately 33%.
Secondly, we develop a fast base caller, Osprey, which on a common desktop CPU can process 1.7 million signals per second, nearly matching the sequencing speed of MinION. Osprey has a better accuracy than Guppy 3.4 in high accuracy mode, but unlike Guppy, Osprey can be run efficiently on systems lacking high-performance GPUs.

Methods
In this section, we first describe dynamic pooling, our novel method for subsampling the signal to adapt to its varying speed and to provide subsequent layers with a more stable context within their receptive field. In the second part, we summarize technical details of application of this approach to the base calling problem.

Dynamic Pooling
Dynamic pooling, our approach to adaptive subsampling, works as follows: Given the input sequence x 1 , x 2 , ..., x n , we first calculate three new sequences: • feature vector: f i = f (x i ) ∈ (0, 1) C (C is the number of output channels) • point importance: w i = w(x i ), w i ∈ (0, 1) • length factor: m i = m(x i ), m i ∈ (0, 1) Each of the functions f, w, m can be any differentiable function, typically a single convolution or a network with several convolutional layers.
The main idea is that we time-warp the signal using length factors m i representing (fractional) distances of features in the pooling space ( Figure 2). Each input point x i at position i is thus mapped to a new position p i in the pooling space which is computed as the cumulative sum of the length factors up to index i: To produce the final output sequence of length p n , we discretize the fractional signal positions and combine feature vectors that map to nearby positions by weighted nearest-neighbour resampling. That is, to compute output r j at position j, we compute a weighted sum of feature vectors which are mapped to pooling space between positions j − 1 and j + 1. The weights are given by the distance from j and multiplied by the importance w i . The resulting output vector r j is thus computed as follows: Note that our dynamic pooling layer is continuous and almost everywhere differentiable through m, w and f . Also our formulation is a generalization of both mean pooling and striding. To obtain mean pooling with size s, we set w i = 1/sand m i = 1if i = ks and m i = 0 otherwise. To obtain striding with step s, we set m i = w i = 1if i = ks and m i = w i = 0 otherwise.
Our experience indicates that several practical concerns need to be addressed in application of dynamic pooling. First, the average pooling factor can easily diverge during training if no bound on its size is set. Even though we saw correlation between the output size after pooling and the actual number of sequenced bases at some point of the training, the training often degenerated into average pooling 1 (i.e. no pooling at all) or pooling so high that the output size was smaller than the number of sequenced bases.
To address this issue, as well as to roughly fix the target pooling factor (and thus computational budget), we borrow an idea from batch normalization [18]. Ideally, we would like the average pooling factor over the whole dataset to be equal to some fixed constant S. Since this is hard to achieve, we approximate this by keeping the average pooling factor fixed for each batch during training. We first calculate length factors m (s) i for each sample s and sequence point i, then we calculate the average length factor for the batch M = 1 bn s,i m (s) i , and finally, we renormalize each length factor as m (s) It is possible for the renormalized length factor to be greater than one, but we have not found any issues with that. Note that the pooling factor in each sequence may vary, but the average pooling factor per batch is fixed. We also keep the exponential moving average of normalization factors S M , and use it during inference similarly to batch normalization.
We also saw issues with divergence during training. This was due to very high gradients in length factors. Notice that in the prefix sum, the gradient at point j is the sum of gradients for positions between j and n: . This means that if there is any noise in gradients, all of it will be concentrated in ∂L ∂m1 , which will have very huge variance. Also note that variance of gradient is different for every ∂L ∂mj . To fix these issues, we limit the gradient to pass only through 20 elements, and thus we approximate ∂L ∂mj ≈ j+20 k=j ∂L ∂p k

Application to Base Calling
We developed two base callers, which use dynamic pooling to improve the speed vs. accuracy tradeoff. Heron is a high accuracy base caller, while Osprey is a base caller that runs in real time. Both base callers use deep convolutional neural networks inspired by Bonito v0.2, which is in turn based on the Quartznet speech recognition architecture [20]. The architecture overview is shown in Figures 3 (Heron) and 4 (Osprey). The basic building unit of these networks is the convolutional layer, described in the introduction. The   Figure 4: Fast basecaller architecture. Block with gradient background represents block replaced by dynamic pooling networks use several special cases of convolutions. A pointwise convolution is a convolution with kernel size one (there is no interaction between neighbouring time points). A depthwise convolution is a convolution with no interaction between channels (the relevant parts of the weight tensor are diagonal matrices). Thus a depthwise convolution needs fewer parameters and fewer multiplications to compute. A depthwise convolution followed by a pointwise convolution is called a separable convolution. Convolutional layers are organized into blocks. One block consists of a repeated application of a convolutional layer, batch normalization ( [18]), and non-linearity. To allow smooth flow of gradient, blocks also employ skip connections, where the input of the block is processed by a single pointwise convolution followed by batch normalization and it is then added to the output.
In the rest of this section, we describe several improvements added to our networks; further details and ablation results can be found in the Supplementary material.
Space-to-depth compression in blocks (both Heron and Osprey). Peresini et al. [28] have recently introduced this technique, where in each block, we compress the number of sequence points by a factor three and enlarge the number of channels by a factor two. This increases the number of trainable parameters, while only modestly increasing the total number of operations and running time. In Osprey, we limit the number of operations during compression and decompression by replacing the original convolution/transpose convolution with a mean pooling during compression and three smaller convolutions during decompression.
Gated linear units (GLU) (Heron). We change the original Swish nonlinearity [30] to GLU [11]. GLU first splits channels into two halves (x 1 , x 2 ) and then computes g(x 1 , x 2 ) = x 1 σ(x 2 ). Note that Swish is a special case of GLU non-linearity with x 1 = x 2 . Even though the GLU comes with the cost of double the computation budget, in our context it works much better than increasing the number of channels in the network.
Variable receptive fields in depthwise convolutions (Osprey). Increasing the receptive field of convolution is often beneficial, but it comes at an increased computational cost. As a tradeoff, we split the channels into four groups in ratio 2:2:1:1 and apply a different receptive field in each group (3, 7, 15, and 31, respectively).
Recurrent neural aligner as the output layer (Heron and Osprey). Existing base callers (like Chiron, Deepnano-blitz, Bonito 0.2) produce an intermediate output with a simple distribution over five possibilities (A, C, G, T, or blank) for each sequence point. The subsequent connectionist temporal classification layer [13] produces the final sequence without blanks, selecting the one with the highest likelihood in the intermediate probability distribution. However, the intermediate distribution assumes that point predictions are independent from each other. The recurrent neural aligner [31] introduces dependency between predictions by propagating state information. We describe this technique and its application to base calling in more detail in the Supplementary material, as we believe it may be useful also in other bioinformatics contexts.
Details of dynamic pooling (Heron and Osprey). We replace one of the strided convolutions with a dynamic pooling module (the exact application differs in Heron and Osprey, see Figures 4 and 3). We calculate feature vector f i using a one-layer convolution with the same number of channels and kernel as in the layer replaced by dynamic pooling (we set striding to one). For calculating moves and weights, we use a three-layer convolutional network in Heron and a simple one-layer pointwise convolution in Osprey. We use a deeper network in Heron, since we are operating on the raw signal.

Datasets
For training and testing of our base callers, we use a mix of samples from various datasets listed in Tables 1  and 2

Experiments
We evaluate our base callers on several testing data sets. Each base called read is aligned to the corresponding reference by minimap2 [22], and the accuracy is computed as the fraction of correctly called bases in the aligned portion of the read. The median accuracy over the whole testing set is reported in a corresponding table. In case of our base callers, the testing data sets did not overlap with the training data sets; this cannot be guaranteed for other base callers.

High Accuracy Results
Heron is designed as a high accuracy base caller. As such, we compare its results to the series of experimental high accuracy base callers by Oxford Nanopore Technologies (Bonito 0.2, Bonito 0.3, and Bonito 0.3.1). Note that Bonito typically has a higher accuracy than the standard production base caller, Guppy in the high accuracy mode. Table 3 and Figure 5 show that Heron is faster than Bonito and achieves comparable accuracy, sometimes outperforming Bonito significantly. On the rice dataset, the error rate is reduced by 40% compared to Bonito 0.3.1.

Fast Base Calling Results
Fast base callers are important to achieve live base calling necessary for applications such as run monitoring, selective sequencing etc. To achieve live base calling for MinION, one has to process approximately 1.5 million signals per second. Osprey has been designed to achieve this goal on a common four-core desktop CPU. Note that the only other base caller that can achieve the same goal on a CPU is Deepnano-blitz [7].   Table 4: Median read accuracy of fast base callers. The speed is measured on a desktop CPU i7-7700K using four cores. The speed with dynpooling varies due to changes in average pooling factor. Table 4 shows that Osprey is both more accurate and faster than the fast version of Guppy and can even compare in accuracy to a one year old high accuracy version of Guppy.
We have also investigated the high performance of Heron and Osprey on the rice dataset. We have noticed that in this dataset, the speed of DNA passing through the pore is slightly slower (Figure 7). We speculate that our new methods coupled with a more diverse training set (Figure reffig:trainset) help our tools to outperform ONT base callers in such cases. Figure 7 shows that the dynamic pooling scheme indeed helps the network to adapt to varying sequencing speeds present in our data sets. In particular, there is a strong correlation between the average pooling factor and the observed DNA speed (computed as the ratio of the aligned sequence length to the sequencing time in seconds), with r 2 ranging from 0.41 for mouse to 0.82 for zymoset and Klebsiella datasets. Figure 8 illustrates that dynamic pooling is able to compensate in part for the changes in sequencing speed within a read; figure 9 illustrates that long events can be shortened and short events lengthened.

Dynamic Pooling Effects
Overall we see that dynamic pooling results in a much more uniform number of signal steps per called base.

Conclusions and Future Work
We presented dynamic pooling, a novel component for neural networks, and applied it to the problem of nanopore base calling. Dynamic pooling allows the network to better accommodate speed variation in the input signal and thus improve accuracy without increasing computational costs.
To demonstrate the effectiveness of dynamic pooling, we have developed two base callers for nanopore data. Osprey is a fast base caller, which allows near real-time base calling without the need of a GPU, while keeping the accuracy high enough for most sequence analysis workflows, since its accuracy exceeds the standard Guppy high accuracy base caller from a year ago (which is not real-time without a powerful GPU). Real-time base calling enables sequencing run monitoring and selective sequencing [27].
The second base caller, Heron, pushes the boundary of base calling quality. Heron can deliver consistently high accuracy on a variety of datasets, in some cases improving on the most accurate base caller to date.
While we developed dynamic pooling with nanopore base calling in mind, it has a potential to improve various other sequence related tasks, where the input signal has variable density of information. Example domains include speech recognition and video stream processing. Another intriguing goal is to extend dynamic pooling to multiple dimensions to be used in the context of image processing. However, this is not a straightforward task, since calculating input point positions in multiple dimensions is much more complicated than the prefix sums.
Acknowledgements. This research was supported by grants from the Slovak Research Grant Agency VEGA 1/0458/18 (TV) and 1/0463/20 (BB), Slovak Research and Development Agency APVV-18-0239, Horizon 2020 research and innovation programme No 872539 (PANGAIA), integrated infrastructure programme ITMS2014:313011ATL7, and by an academic research grant from Google Cloud Platform (VB). We also gratefully acknowledge a hardware grant from NVIDIA.

S1 Depth-to-Space Compression
In our previous work [28], we devised an alternative block for neural networks, which trades a lower number of sequence points for a higher number of channels. This is achieved by compressing the signal with strided convolution at the start of a convolutional block and decompressing it via strided transposed convolution at the end of the block. In our work, we increase the number of channels by two and compress sequence length by factor 3. This substantially increases number of parameters (by factor of 2), while increasing amount of multiplication only modestly (by factor of 4/3) and surprisingly decreasing real computation time. Our design in [28] was specifically tailored for inference on a rather restrictive neural network accelerator, but we found that its benefits are also visible on a GPU after small adjustments. Specifically, because depthwise convolutions are faster on GPUs, we keep the first depthwise convolution after compression that we originally omitted from the block.
In the Heron high accuracy network, we also introduce a novel "channel-cross-shift" operation before the compression, which is designed to increase communication between compressed parts of the sequence. We split channels in each sequence point into two halves A and B, and shift A by one point to the right and B by one point to the left, with empty slots padded with zeroes ( Figure S10). In practice, while the cross-shift does not seem to significantly improve the overall accuracy after convergence, it improves the training curve S13.
The number of operations in compression and decompression is still high for Osprey fast CPU implementation (both operations are equivalent to two pointwise convolutions per sequence point). Surprisingly we can simplify compression to average pooling followed by pointwise convolution, which effectively decreases the number of operations by two thirds during compression and does not hurt the accuracy much (when keeping the same number of floating point operations, this design is actually slightly better). Simplifying Figure S10: Visualization of the cross-shift operation followed by compression.
decompression is not so easy. Here we opt for a rather arbitrary design where we split channels into overlapping groups and each group is feeded into one pointwise convolution, which produces one third of the outputs.

S2 Gated linear units
Many modern convolutional neural networks, including Bonito v0.2, use Swish activation function f (x) = xσ(x) [30]. An interesting extension, which comes at the cost of doubling the amount of computation, is to use a gated linear unit (GLU) [11], which first splits channels into two halves (x 1 , x 2 ) and then applies transformation g(x 1 , x 2 ) = x 1 σ(x 2 ). Note that Swish is a special case of GLU withx 1 = x 2 .
We use GLUs in Heron high accuracy network, doubling the output size of all pointwise transformations to compensate for GLUs cutting the number of channels by half. A FLOPs equivalent of this new GLU architecture would be the original Swish architecture with the number of filters in the whole network increased by factor 1.41. Interestingly, when we tried to train such a large Swish network, it did not bring any increase in accuracy (see S5). As such we think that either the GLU network benefits from the more complex channel interactions, or that the large Swish network fails to train properly. Also note that introduction of GLUs does not lead on a GPU to the expected doubling of the running time, perhaps because more time is spent in depthwise convolutions, and increasing the size of pointwise convolutions does not hurt the speed that much.
Finally, we observed that using standard initialization from Pytorch did not work for our network. Instead, following the methodology from [26] chapter 6.4.1, we devised a custom initialization scheme. We split all weight matrices before GLU nonlinearity by half (in the output dimension) and we initialize only first half of the matrix randomly and set second half to the same values as the first half. Due to this initialization the initial state of the network mimics network using Swish nonlinearity.

S3 Recurrent Neural Aligner
The last stage of base calling, called decoder, is responsible for converting the sequence of high-level features predicted by the CNN into the final DNA sequence. Because typically multiple signal points correspond to the same DNA base, many base callers (e.g. Deepnano-blitz, Chiron, Bonito v0.2) use the connectionist temporal classification (CTC) decoder [13]. The network outputs a probability distribution over the fivevalue alphabet Σ ε = {A, C, G, T, ε} at each signal point. This defines a probability p Y of observing sequence Y = y 1 , . . . , y n over the alphabet Σ ε as the product of point probabilities of individual characters y i .
Let R(Y ) be a function that reduces such a sequence Y to the final output by omitting blank symbols ε (this function can be alternatively defined to perform additional operations, such as replacing runs of identical symbols to a single symbol). Note that several different sequences Y may yield the same reduced sequence Z. The CTC decoder aims to find the output Z with the highest overall probability arg max Z Y :R(Y )=Z p Y . Such a layer is differentiable and can be trained by standard algorithms based on gradient descent. However, inference of the best sequence Z in a trained network is hard, and consequently beam search heuristic is typically used.
The underlying probabilistic model of CTC assumes that point predictions y i are independent of each other. This simplifying assumption enables further applications such as pair-decoding [33]. However, a more complex model with dependencies between positions may simplify the necessary computations within the main CNN. For this purpose we use a recurrent neural aligner framework [31].
In a recurrent neural aligner, the probability distribution at sequence point i depends not only on the input but also on the previous predictions ( Figure S11). More precisely, the probability q Y of Y = y 1 , . . . , y n over the alphabet Σ ε is the product of conditional distributions Pr(y i | h i , R(y 1 , . . . , y i−1 )), x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 …. To sample a sequence y 1 , . . . , y n of intermediate predictions from alphabet Σ in CTC, each y i would be generated independently (top), while in the recurrent neural aligner, the probability of y i would depend on the previous non-blank symbols (bottom). Note that the actual prediction of the network is a sequence of non-blank symbols with the highest probability, so the sampling is not used in the training and inference algorithms.
where h i is the i-th output of the CNN. The conditional probability distribution P r(y i | h i , R(y 1 , . . . , y i−1 ) for all five values y i ∈ Σ ε will be computed by some mapping Q(h i , z 1 , . . . z j ), which gets as an input vector h i and string z 1 , . . . z j = R(y 1 , . . . , y i−1 ). This mapping can be an arbitrary differentiable function producing valid probability distributions.
Similarly, as in CTC, the probability of output sequence Z is simply Pr(Z | h 1 , . . . , h n ) = Y :R(Y )=Z q Y . Then the model can be trained by back propagation via log likelihood loss. To compute the likelihood of a known ground-truth output Z=z 1 , z 2, ..., z m , we first compute conditional probability distributions Q(h i , z 1 , . . . , z j ) for every pair of i, j (i ≥ j). The final likelihood of Z can be then computed via dynamic programming using the forward algorithm. During inference, we use the beam search where we keep the top k candidates for each prefix of the predicted sequence. Compared to CTC, where beam sizes of 10 are usually sufficient, we need to use a much higher beam size with recurrent neural aligners (we use k = 50).
Finally, we describe the design of conditional probability distribution Q(h i , z 1 , ..., z j ). We split its calculation into two parts. We first embed the previous non-blank predictions z 1 , . . . z j into a fixed-size vector via prediction encoder G. We then mix h i and the output of G via function H as Q(h i , z 1 , . . . , z j ) = H(h i , G(z 1 , . . . , z j )).
We tested several variants for the design of functions G and H. One simple option is to consider only the last k predictions z j−k+1 , . . . , z j . In our work, we use k = 6. Using just the last several predictions is motivated by the fact that the signal current depends mostly on the small context around the current base in the pore. Then G can be a table containing for each of 4 k input strings a small matrix W of trained weights and a bias vector b. Function H will facilitate vector-matrix multiplication between h i and the result of G as follows W h + b. The resulting vector of length five is transformed by the softmax function to produce a proper probability distribution.
Another option is to use RNN for G, which encodes the sequence of bases into a fixed size vector at each position and then e.g. use H(h, g) = W 2 (h * ReLU (W 1 g + b 1 )) + b 2 . We find that this design does not lead to a significant increase in base calling accuracy and causes decoding to become much slower. Also if the dimension of h is small, we find that training sometimes depends on luck during initialization.

S4 Training methodology
We train all of our high accuracy networks with AdamW ( [24]) algorithm. We use a cyclic cosine learning rate scheduler with warm restarts ( Figure S12) [23]. We anneal the learning rate from maximum of 1e-3 toward zero, and we exponentially increase the length of cycles by a factor of 2 for each successive cycle. This results in multiple possible stopping points for training (where each of them has converged with a small learning rate). We used a fixed budget of 7 cycles (which means 127 multiples of the first cycle in total). The first cycle takes 8000 batches. We start our learning by a quick warmup period for 1000 batches.
Based on the convergence plot S13, we can see that our models almost converged, but there is still a small room for more optimization. However, the last cycle of training took around 4-6 days (depending on the network) over 2-4 GPUs. We decided that taking double of that computational budget is not worth a small increase in performance.
We use the same setup for fast models, but we increase the batch size by a factor of 5, so we can more efficiently utilize the GPUs. We also use 5 times fewer steps for each cycle and use 5 times higher learning rate.

S5 Ablation Results
We also measured the effect of individual improvements in Heron base caller. We started by retraining Bonito 0.2 architecture on our training dataset with our optimizer settings and iteratively added more improvements to it. Table S5 shows that each improvement individually brings only a modest gain but together the gain is quite substantial. Surprisingly, increasing the number of channels (to match FLOPs equivalent of enlarging the network due to GLU activation) leads to degradation of performance. This was not due to overfitting,  Figure S13: Convergence of various models over time.
but actually due to worse training performance. We hypothesize that this was caused by bad initialization of the network.