Reducing cost in DNA-based data storage by sequence analysis-aided soft information decoding of variable-length reads

Abstract Motivation DNA-based data storage is one of the most attractive research areas for future archival storage. However, it faces the problems of high writing and reading costs for practical use. There have been many efforts to resolve this problem, but existing schemes are not fully suitable for DNA-based data storage, and more cost reduction is needed. Results We propose whole encoding and decoding procedures for DNA storage. The encoding procedure consists of a carefully designed single low-density parity-check code as an inter-oligo code, which corrects errors and dropouts efficiently. We apply new clustering and alignment methods that operate on variable-length reads to aid the decoding performance. We use edit distance and quality scores during the sequence analysis-aided decoding procedure, which can discard abnormal reads and utilize high-quality soft information. We store 548.83 KB of an image file in DNA oligos and achieve a writing cost reduction of 7.46% and a significant reading cost reduction of 26.57% and 19.41% compared with the two previous works. Availability and implementation Data and codes for all the algorithms proposed in this study are available at: https://github.com/sjpark0905/DNA-LDPC-codes.

6. Convert the appended binary sequences to DNA oligo sequences.Fig. S2-1 shows the overall encoding algorithm of the proposed method.

Input data and CRC-32
In this work, we encoded an image file of Gyeongbokgung Palace (Supplementary Fig. S1) of size 548.83kB.We converted it to the binary sequence and scrambled the data to avoid too-long homopolymer runs or too-high GC-content.Then, we attached the cyclic redundancy check-32 (CRC-32) checksum at the suffix of the sequence.We use CRC-32 polynomials from "Best CRC Polynomials", by Philip Koopman (https://users.ece.cmu.edu/~koopman/crc/index.html).Then, we partitioned it into 16,572 short binary sequences of length 272 bits and apply the encoding of Reed Solomon-based lowdensity parity-check (RS LDPC) codes in an inter-oligo direction.

RS LDPC codes
LDPC codes are capacity-approaching code that nearly achieves the Shannon capacity with low complexity [1].Thus, employing LDPC codes can achieve high error correction performance even with low computational complexity compared to other error-correcting codes.RS-LDPC codes are a class of LDPC codes, which are (, ) -regular codes with   ×   parity check matrix over GF(2) defined as follows.
Thus, the proposed RS LDPC codes are (64,72)-regular codes that have a codeword length of 18,432 and a dimension of 16,572 with a rate of 0.90 and a minimum distance of at least 66.
To encode the message bits to a codeword, we first make the parity check matrix into the full rank matrix.
Let H be the full rank matrix and defined as Then, we obtain the generator matrix  as and Then, for the message , we can encode the message and obtain the codeword  by If  is the valid codeword, the following equation holds.
In this work, the rank of the parity check matrix  was 1,860, whose size was 1,860 × 18,432 had the size of  was 16,572 × 18,432 .The size of the message is 16,572 bits and the length of the codeword is 18,432 bits.We encoded 272 messages and we obtained 18,432 binary sequences of length 272 bits.

RS encoding of the index
After we determined 18,432 indices, we protect them using RS codes.

Merging, clustering, and alignment
For merging the forward and reverse reads, we used paired-read merger FLASH [5].We used version FLASH-1.2.11 from http://ccb.jhu.edu/software/FLASH/. Table S1 shows the percentages of merged reads according to the length of merged reads using FLASH.Then, we chose reads with lengths 150, 151, and 152 nt that are used for the decoding, after the merging step.To perform index-based clustering, we detached the first 16 nt index from all chosen reads, corrected errors with RS decoding, and performed index and edit distance-based clustering.For the multiple sequence alignment (MSA) algorithm, we employ MUSCLa [6].We used the version MUSCLa version 5.1 from https://github.com/rcedgar/muscle/releases/tag/v5.1.All merging, clustering, and alignment steps are implemented by Python.
Using Q-score values and LDPC re-decoding In this work, we use LDPC soft decision decoder to retrieve the original data using sequenced reads.Unlike a hard decision decoder that operates decoding through definite values (e.g., 0 or 1 for binary codes), a soft decision decoder operates decoding by considering a range of possible values for each bit position.A soft decision decoder considers the reliability of each bit position and this gives additional information to the decoder for better estimation.This characteristic enables us to give a high input value for the reliable base or discard oligo sequences that are likely to be erroneous.In this work, to aid the decoder for better performance, we use Q-score information while computing LLR.Since it is known that asymmetric errors occur in DNA-based data storage as shown in Fig. S3 [8], the last base of the oligo sequence is more prone to errors compared to other positions.Thus, in the practical experiment, the last codeword is the bottleneck of the decoding performance.The last LDPC codeword is likely to have more errors compared to other codewords and we use the Q-score value during the decoding of the last LDPC codeword to cope with this problem.
We first used Q-score values during the clustering step.For clusters of size one with shortened read, we cannot determine where deletion error occurred when only one read exists in the cluster.However, unless the deletion or substitution error occurs in the last base, the last base stays safe.Then, the next step is to figure out whether the error exists in the last base.We use the Q-score value of the last base to do this work.If the Q-score is higher than 30, we trust the last base and if not, we discard the oligo sequence.
Also, during the LLR calculation, we used the Q-score values.The LLR calculation considering the LDPC re-decoding scheme as follows: where  0 ,  1 , and   denote counts of 0ss, 1ss, and erasures at the same position of reads in the same cluster and  0 +  1 +   is the number of reads in the cluster and  denotes the number of iterations in the LDPC re-decoding.However, the erased bit does not affect the value of LLR.When counting the number of 0ss and 1ss in the last codeword, we checked the Q-score value.If the Q-score is lower than 20, we do not trust it and do not include it in the counting.Since the last LDPC codeword is the bottleneck of the experiment, we were able to improve the decoding performance by discarding unreliable bases using the Q-score.
After the first decoding of LDPC codes, we performed the re-decoding stage to maximize the decoding performance.According to the equation, we have to choose the value of the parameter .The initial value is   = 0.02, whose initial LLR value becomes   ( 0 ,  1 ,   ) = ( 0 −  1 ) × 3.8918.
In the re-decoding step, we performed iterative decoding by changing  values by subtracting  by Δ = 0.0005 until the decoding succeeds or  reaches the value 0.001.Since LDPC codes can determine whether the decoded codeword is correct or not by using the parity check matrix, we were able to perform the re-decoding stage only for decoded codewords with errors.Fig. S4 shows the brief flow chart of the decoding procedure.Also, the comparison of previous works and this work is shown in Table S2.

Synthesis
The DNA oligo pool consisting of 18,432 oligo sequences with 200 nt, including primers, was synthesized by Twist Bioscience.

Sequencing
For more accurate and reliable results, we perform three times of DNA sequencing for the same DNA oligo pool.To have the same sequencing environment as previous works [3] and [4], we follow exactly the same sequencing procedure as these works.We use Q5

Information density
The information density is defined as "How many information bits are stored in one base after the encoding?".In this work, the size of the payload is 16,572 × 272 bits.These binary sequences are converted to 18,432 × 152 nt after attaching LDPC parity, RS parity, and index.Let  proposed be the information density of the proposed encoding algorithm and it is defined as  proposed = 16,572 × 272 bits 18,432 × 152 nt = 1.61 bits/nt.
In this work, we compare the information density with the experiment conducted in [4].They apply the encoding method of DNA Fountain codes and measure the decoding performance using arlich's decoder [3] and Jeong's decoder [4].They use LT codes as inter-oligo codes and RS codes as intraoligo codes.They partitioned 513.6 KB of the image into 16,050 binary sequences of length 256 bits.Then, they encoded binary sequences into 18,432 DNA sequences of length 152 nt.Let  [4] be the information density of the DNA Fountain encoding algorithm applied in [4].Then,  [4] is defined as = 16,050 × 256 bits 18,000 × 152 nt = 1.50 bits/nt.

Writing cost
As mentioned in the paper, the writing cost is defined by the number of synthesized bases divided by the number of information bits.This shows that the writing cost is the inverse of the information density.In this work, 18432 × 152 nt of DNA bases are used to store 16572 × 272 bits of data.Let WC be the writing cost of this work.Then it can be written as In the same way, let WC [4] be the writing cost of the work [4].Then WC [4] can be also written as WC [4] = 18000 × 152 16050 × 256 = 1  [4] = 0.67 bases/bit.Also, the writing cost reduction compared to [4] can be calculated as follows: (WC [4] − WC) WC [4] = (0.67 − 0.62) 0.67 ≈ 0.0746 Compared to the previous work [3], this work achieved 7.46% of writing cost reduction with the same synthesis provider (Twist Bioscience)

Reading cost
The reading cost is defined by the number of sequenced bases divided by the number of information bits.In [4], they measure the decoding performance using arlich's [3] and Jeong's [4] decoders for the same input data file.Let RC 1 , RC 2 , and RC 3 be the reading costs of this work in axp #1, #2, and #3.
Then, it can be written as In [7], they have writing and reading costs of 0.67 bases/bit and 3.82 bases/bit for LDPC redundancy of 10% to restore 224 KB of data.

Supplementary Notes 5 Additional discussion
The coverage of the DNA-based data storage experiment In DNA-based data storage, the random sampling and merging steps can be modeled as Poisson distribution.
This coverage distribution can be modeled as follows.
where k is a coverage and λ is the mean of the Poisson distribution.Then, λ is defined as follows: The percentages of merged reads are over 95% on average.Since the percentages of the shortened reads are higher than reads with lengths over 152 nt, the proposed scheme uses additional shortened reads only.Thus, from merged reads, the proposed scheme uses reads with length 150, 151, and 152 nt, which are over 94% of sequenced reads.In other words, the proposed scheme uses 1.3% of additional reads compared to the scheme using only the correct length reads.S2 [3], [4], [7], [8], [9], [10], [11], and the proposed method.Our work achieves the best result in terms of the writing cost and coverage.

Fig. S2-2:
Overall decoding algorithm of the proposed scheme.From merged reads, reads with lengths 150, 151, and 152 nt are selected for decoding.The index of all selected reads is first decoded and reads with the same index are grouped into the same cluster.Then, for all reads in each cluster, the aD between each pair of reads is calculated and reads which have large aDs are discarded.For clusters with VL reads, multiple sequence alignment is applied to convert the reads to binary sequences.Next, LLR values for each aligned read are calculated and become inputs at the decoding of LDPC codes.From the decoded output, the original data after checking the CRC-32 are retrieved.

Fig. S2- 2
Fig. S2-2 shows the overall decoding algorithm of the proposed method.

Fig. S3 :
Fig. S3: The figure from Fig. 3(b) in [8].Per-position average read error profile for the first 150 positions in DNA sequences.DNA-based data storage has asymmetric errors and the error rate of the last base is higher than the error rate in other positions.

Fig. S4 :
Fig. S4: A brief flow chart of the decoding procedure.The top red dotted box shows the clustering step, the middle red dotted box shows the MSA and LLR calculation steps and the bottom red dotted box shows the iterative redecoding of LDPC codes.The orange dotted box shows the edit-distance based clustering algorithm

Fig. S5 :
Fig. S5: The coverage histogram for the ideal Poisson sampling and the three DNA-based data storage experiments.The mean coverages for the three experiments are 3.93, 3.88, and 3.99, respectively.
As will be mentioned later, since we perform index-based clustering during the decoding procedure, it is necessary to protect the index with strong error-correcting codes.RS codes are defined as [, ,  −  + 1] over (), the linear block code length is  with dimension , and the minimum Hamming distance is  −  + 1.