Turbo autoencoders for the DNA data storage channel with Autoturbo-DNA

Summary DNA, with its high storage density and long-term stability, is a potential candidate for a next-generation storage device. The DNA data storage channel, composed of synthesis, amplification, storage, and sequencing, exhibits error probabilities and error profiles specific to the components of the channel. Here, we present Autoturbo-DNA, a PyTorch framework for training error-correcting, overcomplete autoencoders specifically tailored for the DNA data storage channel. It allows training different architecture combinations and using a wide variety of channel component models for noise generation during training. It further supports training the encoder to generate DNA sequences that adhere to user-defined constraints. Autoturbo-DNA exhibits error-correction capabilities close to non-neural-network state-of-the-art error correction and constrained codes for DNA data storage. Our results indicate that neural-network-based codes can be a viable alternative to traditionally designed codes for the DNA data storage channel.

), defining the raw error rate, the distribution of errors between deletions, insertions, and substitutions/mismatches, and the distribution and positions for each error type.For example, an insertion happens to 80 % at a random position and 20 % in a homopolymer, and the inserted base is one of the four bases with equal probability.Bottom: the simulation interface of MESA, with the new rule chosen as the sequencing method.Close to the bottom of the image, the "Download current config" button allows the download of a JSON file containing all parameters.The JSON file data can then be used for training by adding it to the corresponding config/error sources file of Autoturbo-DNA.The new error profile can then be selected by its ID using the related hyperparameter (either -sequencing, -synthesis, -pcr, or -storage).

Figure 1 :
Figure1: Example of generating a configuration file that can be used to train models with Autoturbo-DNA, related to Figure1.Top: the MESA interface to design a new rule, showing the ability to name the rule (here "Autoturbo-DNA test"), defining the raw error rate, the distribution of errors between deletions, insertions, and substitutions/mismatches, and the distribution and positions for each error type.For example, an insertion happens to 80 % at a random position and 20 % in a homopolymer, and the inserted base is one of the four bases with equal probability.Bottom: the simulation interface of MESA, with the new rule chosen as the sequencing method.Close to the bottom of the image, the "Download current config" button allows the download of a JSON file containing all parameters.The JSON file data can then be used for training by adding it to the corresponding config/error sources file of Autoturbo-DNA.The new error profile can then be selected by its ID using the related hyperparameter (either -sequencing, -synthesis, -pcr, or -storage).

Figure 2 :
Figure 2: Boxplot of the accuracy of the trained models, separated either by the used encoder, transcoder or decoder architecture, related to Figure 1 to 3. A red line represents the median, a green triangle represents the mean, and the outliers are represented by green dots.

Figure 3 :
Figure 3: Stability score over 1000 epochs in a 10 epoch rolling average for a latent redundancy of 2 bits (left), 4 bits (middle) and 8 bits (right), related to Figure 7. On the top, the models were trained using a block size of 3 • 8, and on the bottom, the training was carried out with a block size of 3 • 16.The legend labels are structured in the form of constraint adherence training, encoder, decoder, transcoder, latent redundancy, block size.

Figure 4 :Figure 5 :
Figure 4: Boxplots of the reconstruction accuracy score (left) and the stability score (right) of models trained without (left) and with (right) the stability score as training metric, related to Figure 7.The models were further trained with either 2, 4, or 8 bits of latent redundancy and a block size of either 3 • 8 or 3 • 16 bits.A red line represents the median, a green triangle represents the mean, and the outlier are represented by green dots.

Figure 6 :
Figure 6: Boxplot of the reconstruction accuracy score of models trained before (left) and after (right) fine-tuning by utilizing the stability score as additional training metric, related to Figure 7.The models were further trained with either 2, 4, or 8 bits of latent redundancy and a block size of either 3 • 8 or 3 • 16 bits.A red line represents the median, a green triangle represents the mean, and the outlier are represented by green dots.

1 Configuration parameters Parameters
-dec-optimizer str adam Choose which optimizer to use for the decoder: Adam, SGD or Adagrad.

Table 1 :
Supported parameters of Autoturbo-DNA, related to Figure1.Using CNN as base neural network.cnnkernel inc CNN with an increasing kernel size, first pass is through a CNN with half the kernel size of the parameter value, second is with the full kernel size.resnet1dOnedimensional residual neural net.

Table 5 :
Out-of-the-box supported error rates for the channel simulation, related to Figure1

Table 6 :
Hyperparameters used in the evaluations that deviate from the default Autoturbo-DNA hyperparameters, related to Figure4to 7

Table 7 :
Slopes of the last 200 epochs for the evaluated models using a block size of 3 • 8, related to Figure5.The naming convention is encoder, decoder, transcoder, latent redundancy, block length.

Table 8 :
Slopes of the last 200 epochs for the evaluated models using a block size of 3 • 16, related to Figure6.The naming convention is encoder, decoder, transcoder, latent redundancy, block length.