Sequencing Coverage Analysis for Combinatorial DNA-Based Storage Systems

This study introduces a novel model for analyzing and determining the required sequencing coverage in DNA-based data storage, focusing on combinatorial DNA encoding. We seek to characterize the distribution of the number of sequencing reads required for message reconstruction. We use a variant of the coupon collector distribution for this purpose. For any given number of observed reads, <inline-formula> <tex-math notation="LaTeX">$R\in \mathbb {N}$ </tex-math></inline-formula>, we use a Markov Chain representation of the process to compute the probability of error-free reconstruction. We develop theoretical bounds on the decoding probability and use empirical simulations to validate these bounds and assess tightness. This work contributes to understanding sequencing coverage in DNA-based data storage, offering insights into decoding complexity, error correction, and sequence reconstruction. We provide a Python package, with its input being the code design and other message parameters, all of which are denoted as <inline-formula> <tex-math notation="LaTeX">$\boldsymbol {\Theta }$ </tex-math></inline-formula>, and a desired confidence level <inline-formula> <tex-math notation="LaTeX">$1-\delta $ </tex-math></inline-formula>. This package computes the required read coverage, guaranteeing the message reconstruction <inline-formula> <tex-math notation="LaTeX">$R=R(\delta,\boldsymbol {\Theta })$ </tex-math></inline-formula>.

Abstract-This study introduces a novel model for analyzing and determining the required sequencing coverage in DNA-based data storage, focusing on combinatorial DNA encoding.We seek to characterize the distribution of the number of sequencing reads required for message reconstruction.We use a variant of the coupon collector distribution for this purpose.For any given number of observed reads, R ∈ N, we use a Markov Chain representation of the process to compute the probability of error-free reconstruction.We develop theoretical bounds on the decoding probability and use empirical simulations to validate these bounds and assess tightness.This work contributes to understanding sequencing coverage in DNA-based data storage, offering insights into decoding complexity, error correction, and sequence reconstruction.We provide a Python package, with its input being the code design and other message parameters, all of which are denoted as Θ, and a desired confidence level 1 − δ.This package computes the required read coverage, guaranteeing the message reconstruction R = R(δ, Θ).Index Terms-DNA, DNA-based data storage, synthethic biology, computational biology.

I. INTRODUCTION
T HE GROWING volume of the world's digital data and the limitations of existing storage technologies motivate the need for new and innovative storage solutions [1].DNAbased data storage (or DNA-based storage) emerges as a viable solution for some applications, offering unmatched density and durability.This novel approach involves the synthesis, storage, and sequencing of DNA molecules to encode, store, and retrieve information [2], [3].However, challenges such as short, error-prone strands and limitations of current synthesis technologies still remain [4], [5], [6], [7], [8].While DNAbased storage stands as a promising technology and the cost of DNA sequencing is decreasing, it remains significantly more expensive than reading from established archival storage solutions [9], [10], [11], [12].In the context of DNA sequencing costs and throughput, recent work [13] defined the DNA coverage depth problem, which considers the expected sample size, to guarantee the successful decoding of the information.A related concept was suggested by Chandak et al. [14], who explored the balance of writing and reading costs in DNAbased data storage, studying the LDPC-based coding schemes.
In recent years, several studies suggested the use of combinatorial DNA encoding and synthesis as an approach for increasing logical density while reducing overall cost in DNAbased storage systems [15], [16], [17].The combinatorial encoding approach uses a set of clearly distinguishable DNA shortmers to construct large combinatorial alphabets, where each letter is encoded by a subset of shortmers.This scheme is a novel approach for DNA-based data storage, offering an increase in logical density over standard DNA-based storage systems, reduced reconstruction error levels, and scalability.See Section II for more details about combinatorial DNA encoding.Combinatorial DNA encoding can be viewed as an extension of the composite DNA coding schemes (sometimes referred to as degenerate DNA encoding) [4], [18].
This work presents the first model for analyzing the sequencing coverage depth problem under combinatorial DNA encoding.The analysis is based on the general design scheme for combinatorial DNA storage systems.In brief, we assume a 2-dimensional (2D) MDS error correction scheme over a set of short combinatorial DNA sequences.Each combinatorial sequence is encoded using an inner code, to protect against symbol errors, while an outer code adds redundancy to a block of sequences, protecting against sequence-level errors (e.g., sequence dropout).The inner and outer code approach is common in DNA-based storage literature [4], [5], [19], [20].Our analysis follows the reconstruction steps associated with this approach.
We model the reconstruction of a single combinatorial letter as a variant of the coupon collector's problem [13], [21], [22], [23], [24].We use a Markov Chain (MC) representation of the collection process to characterize this distribution.Taking into account the error correction code parameters, we continue by analyzing the read depth requirement for a single combinatorial sequence and then for the entire message.We provide bounds on the decoding probability given the number of analyzed reads, and present an operational algorithm for determining the required coverage of reads.We explore our coverage depth model on various design parameters, and compare the results to Monte Carlo simulation experiments c 2024 The Authors.This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
of combinatorial DNA reading.Lastly, we provide computer code implementing our coverage model that, given a sequence and message design, outputs the read coverage required for recovering the data with a user-defined confidence level.This work combines theoretical progress represented by studying the coverage depth problem for combinatorial DNA-based storage, and also the practical aspect supporting the design and implementation of such systems.This paper is constructed as follows.Section II includes an overview of combinatorial DNA encoding, providing important notations while defining the overall system design and the reconstruction flow.Section III defines the coverage depth problem for combinatorial DNA storage, highlighting the differences compared to the standard DNA coverage depth problem.Section IV describes the analysis of required read depth by breaking it down into three steps associated with the reconstruction process.Section IV-A presents the coupon collector's model for a single combinatorial letter.Sections IV-B and IV-C analyze the decoding of one combinatorial sequence and the complete message, respectively.Section IV-D describes the tool for determining the required coverage depth in combinatorial systems, and demonstrates it using different design parameters.Finally, Section VI discusses the broader implications of our work on DNA-based storage systems and related fields.

II. OVERVIEW OF COMBINATORIAL DNA ENCODING SCHEME
Utilizing combinatorial approaches for DNA synthesis and assembly was recently suggested by several studies as a way to increase logical density and reduce overall costs in DNA-based storage systems [15], [16], [17].While different studies suggest various molecular mechanisms for generating combinatorial sets of DNA sequences, they all share several important characteristics.This section describes these common components, using notations from [15].

A. Definitions
Let Ω be a set of N DNA k-mers.The k-mers in Ω are chosen such that mix-up errors between two k-mers are negligible (e.g., by setting a minimal distance d between each pair).A binomial combinatorial alphabet Σ is defined such that every letter σ ∈ Σ represents a subset of size K of k-mers from Ω.This subset is referred to as the member k-mers of σ.This defines an alphabet of |Σ| ≤ N K letters.Encoding a binary message of length B bits using this extended combinatorial alphabet is done by generating a sequence of M combinatorial letters σ 1 . . .σ M where M = B log 2 (|Σ|) .Fig. 1a presents a schematic view of the combinatorial encoding approach.
Fig. 1b describes the physical properties of a combinatorial DNA channel.First, the combinatorial sequences are written by generating a set of DNA molecules (using combinatorial synthesis or assembly).Each of the molecules includes a sequence of positions where each position represents a single combinatorial letter.In a given position, each sequence should be one of the member k-mers of the combinatorial letter encoded in that position.Next, a sample of the generated molecules are processed and sequenced generating a set of reads that are analyzed for the reconstruction of the combinatorial message.The reconstruction of the combinatorial sequence includes three main steps, as detailed below: 1) Grouping: The reads obtained from the sequencing output are grouped according to the combinatorial sequence they represent.This is often done using a barcode sequence at the beginning of every DNA molecule.2) Sequence reconstruction: The combinatorial sequences are reconstructed from the grouped DNA sequences.
• Often, each sequence is reconstructed separately from the set of reads assigned to it.• Each sequence is treated as a set of independent combinatorial letters and therefore reconstructed one position at a time.• A combinatorial letter is reconstructed by identifying K unique k-members that are observed in the relevant position of the analyzed reads.3) Message decoding: Error correction codes are used to decode the original message.

B. Error Correction and Sequencing Coverage
Standard DNA-based data storage systems incorporate error correction schemes to mitigate common errors in DNA synthesis and sequencing.Symbol-level errors, such as single-base substitutions, insertions, and deletions are the most common errors.Most studies use error correction codes and constrained coding over the DNA alphabet to overcome these errors.Sequence-levels errors are another common type of error in DNA-based storage systems.A common sequence-level error is a sequence dropout, where certain sequences are not observed in the output at all.Sequence dropout occurs mostly due to the sampling step taken before sequencing, which may result in some sequences not being chosen.The molecular biology steps used for processing the DNA molecules may also be biased in a manner not yet fully characterized, which increases the chances of sequence dropouts.To overcome sequence dropouts, a second layer of error correction is applied on the sequence level.The combination of symbol level error correction (inner code) and the sequence level error correction (outer code) is sometimes referred to as a 2-dimensional (2D) error correction scheme.
Another common approach for eliminating both symbol and sequence-level error, is using the inherent multiplicity in DNA synthesis and sequencing technologies.Increasing the sampling rate yields higher sequence coverage, reduces the chances of sequence dropouts, and helps correct symbol level errors using various consensus-based methods.This makes studying the optimal sampling rate (or sequence coverage) a promising research direction for improving DNA-based storage.
Combinatorial DNA-based data storage requires additional considerations when examining errors and designing error correction strategies.Mainly, in every position, the subset of k-mers representing the combinatorial letter must be identified correctly.To do so, at least one copy of each k-mer in the subset must be observed.It is therefore important to understand the effect of sequence coverage on composite DNA systems, and to analyze the effect of different parameters.In order to account for possible errors in identifying the observed k-mers, a suggested approach involves observing each k-mer multiple times before inferring the combinatorial letter based on the subset of K k-mers observed most frequently.The required multiplicity is a tunable parameter (t) in combinatorial DNA systems.To demonstrate the difference between coverage analysis of standard DNA and composite DNA, consider the following simplified example.
Example 1: To recover x standard DNA sequences, we are required to observe at least one copy of each sequence, and at least x × 1 reads in total.To recover x combinatorial DNA sequences with a combinatorial factor K, we need to observe at least K copies of each sequence and at least x × K reads in total.Adding the multiplicity parameter t to the combinatorial reconstruction process, makes the minimal number of reads to be analyzed x × K × t.Note that this is an unrealistic example.In reality, channel noise (i.e., sampling and errors) complicates the coverage requirements in both cases.
Analysis of the sequencing coverage of combinatorial DNA systems was briefly explored in [15] mostly using simulations.

A. Problem Definition
In this study, we address the challenge of determining the required sequencing coverage for DNA-based data storage systems that utilize combinatorial sequences.We assume a 2D MDS error correction scheme as presented in the top panel of Fig. 2. Namely, the message is encoded using a vector of l combinatorial sequences.Each sequence is of length m.Given R all analyzed reads, the decoding process includes the grouping, reconstruction, and decoding process as described in II and demonstrated in the bottom panel of Fig.

B. Comparison With Standard DNA Scheme
Sequencing coverage in standard DNA-based data storage systems was nicely analyzed in [13], where the main focus was on the outer code and the reconstruction of the complete message.Specifically, recovering a single sequence was considered to be a binary function of the number of copies observed for the sequence, t.In this work, we break down this problem, by: 1) Considering the inner code and modeling the probability of a combinatorial sequence to be recovered.2) Modeling the reconstruction of each combinatorial letter (a position in the sequence) using the coupon collector's problem.We also introduce a Markov Chain (MC) model-based calculation for the exact characterization of the coupon collector's distribution.We complete the analysis by considering the required coverage for achieving a desired decoding probability.Finally, we give a practical tool that can be used to design combinatorial DNA systems.

IV. RESULTS
The decoding complexity is analyzed here by breaking the process down into its basic components.First, the decoding probability of a single combinatorial letter is analyzed, considering various design parameters and decoding approaches.Next, this paper addresses the decoding of a single combinatorial sequence, while considering the use of error correction codes with varying redundancy levels.Finally, the decoding of a complete combinatorial DNA message is analyzed, considering a general 2D error correction MDS code (i.e., a code that protects against sequence dropouts as well as errors on each sequence).

A. Reconstruction of a Single Combinatorial Letter
Let Ω be a set of N valid k-mers used for a combinatorial DNA-based data storage system.Consider a binomial combinatorial alphabet Σ with |Σ| ≤ N K letters where each letter σ ∈ Σ consists of a subset of size K of k-mers from Ω.This subset is referred to as the member k-mers of σ.Let R be the number of analyzed reads of a given combinatorial letter.We define a decoding algorithm in which we accumulate reads until we observe at least t copies of K unique k-mers from Ω.These K k-mers are referred to as the inferred member k-mers, and are used to reconstruct a combinatorial letter σ (See Algorithm 1).
To analyze the probability of decoding a single combinatorial letter, we first assume that each read uniformly draws one of the K member k-mers.We note that the size of the k-mer set N does not play a role in this model.We also ignore invalid k-mers as we assume that the k-mers in Ω are selected such that mix-up errors are negligible (refer to Section II and [15] for details).Nonetheless, we allow detectable errors in the k-mer reading with some (low) probability , as described later in this section.Let T K ,t be a random variable representing the number of reads analyzed until the decoding algorithm successfully stops.Let π K ,t (R) be the probability of stopping with a successful inference after at most R reads.
For t = 1, the random variable T K ,t represents the classical coupon collector's model [25] and we get (See Appendix B): where i is the Kth harmonic number.We note that this result for t = 1 is also presented in [15].
For t > 1, we can obtain [24]: To calculate π K ,t (R) for t > 1, we use a Markov Chain (MC) formulation.Each state in the MC represents the status of the member k-mers in σ, in terms of the number of times each has been seen.Specifically, a state is represented by a vector: For 0 ≤ j < t, v(j) indicates the number of member kmers seen exactly j times, while v(t) indicates the number of member k-mers seen t times or more.Clearly, this vector satisfies: And, when v(t) = 0, the inequality in (7) holds as equality We also note that since there are t + 1 values in the vector (v (0), v (1), . . ., v (t)), there are a total of S = K +t t possible solutions to the equation, representing the number of states.
Example 2: Considering K = 10 member k-mers and a threshold t = 2, the following states can be defined: • (10, 0, 0): All 10 k-mers have not been observed yet.This is the case prior to analyzing the reads.• (8, 2, 0): After analyzing two reads, two unique k-mers have been observed exactly once while the remaining eight k-mers have not been observed yet.• (7, 2, 1): After analyzing at least four reads, two unique k-mers have been observed exactly once, one k-mer has been observed two times or more, and the remaining seven k-mers have not been observed yet.The following transition matrix A is defined with dimensions S × S, where each transition is defined by the observation of one read.

A[(v
This represents observing one of the v(i) k-mers that were observed i < t times. And: This represents observing one of the v(t) k-mers that were observed at least t times.
Example 3: Considering K = 10 member k-mers and a threshold t = 2.In the first transition, the first unique k-mer must be observed: For the second transition there are two options: • When one out of the nine yet unseen k-mers is drawn: • When the same k-mer is drawn again: To get to state s 2 = (8, 2, 0), this calculation takes place: To calculate π K ,t (R), we set the initial state to be: where P 0 = (P (s 0 ) = 1, 0, . . ., 0) is the state distribution vector.We derive the distribution vector over the states after R steps: Let s f = (v (0) = 0, v (1) = 0, . . ., v (t) = K ) be the desired state in which all K k-mers have been observed at least t times.Thus: Fig. 3 and Appendix A demonstrate the state distribution vector for several values of R using K = 5 member k-mers and a threshold of t = 1.Clearly, after analyzing the first read, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Finally, at R = 30, the probability of observing all coupons was: This algorithm ignores possible synthesis and sequencing errors as it assumes that all observed k-mers come from the set of K valid k-mers.Introducing an error probability of observing an invalid k-mer requires a modified transition matrix B: This represents observing one of the v(i) (valid) member kmers that were observed i < t times.And: This represents observing one of the v(t) k-mers that were observed at least t times, or observing an invalid k-mer.
Example 4: In the first transition, taking into account , there are two options: • When one out of the ten yet unseen k-mers is drawn: • When an invalid k-mer is drawn: For the second transition, there are three options: • When one out of the 9 yet unseen k-mers is drawn: • When the same k-mer is drawn again: • When an invalid k-mer is drawn: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 4 depicts the decoding probabilities for varying numbers of analyzed reads using different values for the threshold t (See Appendix E for different K and values).The calculated probabilities are compared to a simulation experiment.As expected, as t increases, more reads are required to reconstruct a combinatorial letter.Notably, when R reaches 100 or more, the probability effectively becomes 1, indicating full data recovery.This represents the balance between the threshold level required for achieving precise combinatorial reconstruction and the read depth complexity.
Note that throughout this section, we ignored the possibility of an error that results in k-mer mix-up (i.e., the output of the decoding algorithm is different from the original combinatorial letter, σ = σ).This is due to the assumption that the design parameters render this error type very unlikely (See Section II).We further discuss this issue in Section VI.

B. Reconstruction of a Combinatorial Sequence
Let s = σ (1) σ (2) . . .σ (m) be a sequence of length m over the same binomial alphabet defined in the previous section.Assuming the use of an MDS error correction code, we say that decoding only b ≤ m letters/positions is sufficient for decoding the complete sequence.Let R be the number of analyzed reads, fixing K and t, we denote π K ,t (R) as π(R).Let W be a random variable representing the number of letters in s that were decoded.Assuming independence between the letters in s, we get: We are interested in the probability of decoding the sequence s, P single : Remark: Since there are different technological/molecular approaches for generating combinatorial DNA sequences, no specific assumption can be made regarding the dependence of different positions in the sequence.Every technology aspires to generate sequences with independent positions and we therefore chose to assume independence.From a coding point of view, assuming independence makes the solution general as no constraints are forced on the sequence.
We note that in the case of b = m (i.e., no error correction code), we get the result presented in [15].
We can approximate this probability using the normal estimation (based on the Central Limit theorem): where Φ is the CDF of the standard normal distribution.Fig. 5 presents the decoding probabilities of a combinatorial sequence with length m = 100, examining how the number of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
analyzed reads (R) affects the accuracy of sequence reconstruction across various redundancy levels (b = 100, 95, 90, 85) keeping other parameters constant (K = 7, t = 4).We observe that the probability of successful reconstruction varies significantly with different redundancy levels.Notably, higher redundancy levels (lower b values) enable accurate reconstruction using fewer reads.These results also align with the results obtained from the normal approximation (not shown).The results demonstrate the role of sequence-level redundancy in affecting the likelihood of accurate reconstruction, making it an important tunable parameter in the overall design.

C. Reconstruction of a Complete Combinatorial Message
Let M = {s i } l i=1 be a complete combinatorial message encoded using a binomial alphabet like in the previous sections.The message is encoded using l combinatorial sequences and, assuming an MDS error correction code, a ≤ l of which are sufficient for the decoding of M.
Let R all be the total number of analyzed reads over all sequences.We are interested in the probability of decoding at least a of l sequences using R all reads, P all (R all , l , a).
Fig. 6 presents an overview of the decoding process and the analysis steps for a complete combinatorial message.
First, the R all reads are distributed between the l sequences, using, for example, the barcodes.Then, the decoding probability of each of the l sequences is determined using the derivation from the previous section.The decoding probability of a single letter is analyzed using the coupon collector's model.We now formally define each of these steps and analyze the decoding probability P all (R all , l , a), or simply P all .
Given a specific distribution of the R reads (r 1 , . . ., r l ), r i = R all , to successfully decode the message at least a of the sequences must be decoded: where I i is an indicator of decoding sequence s i using r i reads.
Assuming an unrealistic case of distributing the reads evenly over the l sequences, each sequence is represented by exactly r mean reads as follows: The probability to decode each sequence is: where π rmean are obtained by using r mean in the coupon collector's model.We can define a new binomial random variable X that represents the number of decoded sequences: And: However, the R all reads are not evenly distributed across the l sequences and we therefore model this distribution using a multinomial distribution: Remark: We note that in reality, biases in the combinatorial DNA channel may result in different distributions of the reads across the sequences.Since these biases differ between the technologies used for generating combinatorial DNA molecules, we chose to use the uniform multinomial distribution that does not require specific characterization of the channel.We also note that this distribution was shown to be optimal for some cases [13].
Using the law of total probability and setting P (R 1 = r 1 , . . ., R l = r l ) = P (r 1 , . . ., r l ): Calculating P all directly becomes infeasible even for small values of R all , l and a.We therefore bound this probability.First, we note that for every sequence s i we have: where r min = min j =1,...,l r i and π r min are obtained by using r min in the coupon collector's model.We therefore define X to be: And: Yielding a lower bound on P all (R all , l , a): In the multinomial distribution for (R 1 , . . ., R l ), many possible read distributions are very unlikely.We can further bound P all by setting a constant value ρ and only considering read distributions for which min j =1,...,l (r j ) ≥ ρ.Let X ρ be a random variable representing the number of sequences decoded when the decoding probability of each sequence is calculated using ρ reads.That is, X ρ ∼ Binom(l , P single (ρ, m, b)).
We therefore have: Given a small δ > 0, we check whether R all reads are sufficient to decode the message with 1 − δ confidence level.
This can be achieved by choosing ρ such that: 1) Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

And: 2)
(r 1 ,...,r l ) r i =R all min j r j ≥ρ Since X ρ has a binomial distribution, we can find ρ for which condition (32) holds.For condition (33), we use Sanov's theorem on the multinomial distribution as follows.For more on Sanov's theorem and the behavior of multinomials, see [26].
Sanov's theorem bounds the probability that the distribution of the reads into barcodes significantly deviates from the expected uniform (( 1 l ) for each) distribution, particularly where at least one sequence gets fewer than ρ reads.Fig. 7 demonstrates this using a simulation of 100,000 instances, each drawn from the multinomial distribution with Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.) and R all = 4500 or R all = 5000.The plots show the distribution of the minimal values obtained.Clearly, increasing R all reduces the probability of the minimal value being lower than a fixed threshold ρ.Decreasing the threshold ρ yields a similar effect.
Let E (ρ) be the set of probability vectors equivalent to read distributions (r 1 , . . ., r l ), for which r j = R all , min j =1,...,l (r j ) < ρ: We define ζ(ρ) = min P ∈E (ρ) D(P U ), where D(P U ) is the Kullback-Leibler (KL) divergence: Next, we show that P * is the distribution of reads in which ρ − 1 reads are assigned to one sequence and the remaining R all −ρ+1 reads are uniformly distributed over the remaining l − 1 sequences.
Lemma 1: We have: The proof for this lemma is found in Appendix C. For intuition, this is simply the result of the symmetric nature of the KL divergence function and of U.
Sanov's theorem [26] provides a bound on the probability of observing any distribution within E (ρ). where: This bound implies that the likelihood of observing a significantly non-uniform distribution of reads decreases exponentially as the total number of reads R all increases.We recall that: (r 1 ,...,r l ) r i =R all min j r j ≥ρ And so we get: This gives us an operational algorithm for checking if R all reads are sufficient to ensure successful decoding with confidence 1−δ, as specified in Algorithm 2.
Fig. 8(a) demonstrates the approach, by presenting the probability of successful message decoding P (X ρ ≥ a) and the probability of considering "enough" of the read distribution (1 − P (E )) for a fixed number of overall reads R all = 3000 as a function of the threshold ρ.Clearly, P (X ρ > a) increases as ρ increases since each sequence s i is decoded using more reads.On the other hand, as is demonstrated in  Calculate probability P (E ) for current R all ; Break loop and finalize value of R all ; end end Fig. 5, increasing ρ decreases 1 − P (E (ρ)), since fewer read distributions with min j (r j ) ≥ ρ are expected.
We note that the bound achieved by using Sanov's theorem is not tight, and therefore presents an alternative approach for finding ρ using empirical simulations.Fig. 8(b) presents the probability (1 − P (E (ρ))), calculated as shown in Fig. 7 by 100,000 instances of simulating the multinomial distribution with R all = 1000.Clearly, this method yields a tighter bound on the decoding probability while also requiring the analysis of less reads overall.See Appendix D for details.

D. A Tool for Determining the Required Sequencing Coverage
We have developed a tool designed to calculate the necessary sequencing coverage for DNA-based data storage systems.
1) Design Parameters, Input, and Output: The tool gets as parameters the sequence design and coding schemes, and computes the required sequencing coverage for a desired confidence level.Specifically, Design parameters: • K -Total number of unique k-mers in each position.
• t -Required threshold on the number of observed occurrences of each of the k-mers.• m -Sequence length.
• b -Total number of letters required to be successfully decoded in each sequence.• l -Total number of sequences in the message.
• a -Total number of sequences required for successful decoding.
• -Error probability of observing an invalid k-mer.

Output:
• R all -Required sequencing coverage.
2) Description of Tool Run: Fig. 9 presents a high-level description of the tool workflow.Given the design parameters Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.K , t, , m, b, l , and a, the tool finds a threshold ρ and a total number of reads R all for which conditions (a) and (b) hold for the input confidence level 1 − δ (See Section IV-C).First, ρ is found such that the decoding of at least a sequences is ensured, P (X ρ ≥ a) ≥ √ 1 − δ (See Section IV-C).This calculation requires the probability to decode a single sequence, P single (ρ, m, b) (See Section IV-B), which uses the reconstruction probability of a single combinatorial letter π K ,t (ρ) (See Section IV-A).
Once ρ is determined, the algorithm searches for the required number of overall reads R all that ensures 1 − P (E (ρ)) ≥ √ 1 − δ.This can be achieved using either the bound from Sanov's theorem or the empirical estimation of P (E (ρ)).When a value for R all that satisfies the condition is found, then the tool run exits, outputting R all to the user.
3) Example Runs: To demonstrate the tool's functionality, we used it to determine the required sequencing coverage for different sets of design parameters, similar to those used in [15], and for various confidence levels.These results are presented in Table I.Clearly, increasing the desired confidence level (smaller values for δ) requires the increasing of the sequencing coverage.Scaling up the system's capacity by taking l to be 10 times larger results in a proportional increase in R all .Increasing the redundancy level (lower value for a) reduces the number of required reads to be analyzed.We note that the different design parameters influence both the threshold ρ and the sequencing coverage R all .While R all is affected by all the design parameters, ρ is primarily affected by m and b.These findings underscore the importance of carefully selecting system parameters to optimize the efficiency and reliability of DNA-based data storage systems.Future work may explore the boundaries of these parameters to further enhance system performance.
4) Runtime Analysis: In terms of runtime complexity of evaluating R all for any given set of parameters, we compared our tool to using Monte Carlo simulations.Table II presents the results for fixed values for R all and ρ.Clearly an increase in a or l leads to a drastic increase in the simulation runtime, indicating that our approach is significantly faster and scales better to larger systems.

A. Monte Carlo Simulations 1) Decoding Probability of a Single Combinatorial Letter:
The decoding probability of a single combinatorial letter was calculated by simulating the distribution of R reads over K elements with uniform probability.Where applicable, an error probability was used to discard reads representing invalid k-mers.Then, a successful decoding was considered if all K k-mers where observed at least t times each.The process was repeated Q times to calculate the success rate.The median and boxplot of 50 repeats is presented.See Algorithm 3.  2) Decoding Probability of a Combinatorial Sequence: The decoding probability of a combinatorial sequence of length m was calculated by repeating the simulation for a single letter m times.Then, a successful decoding was considered if at least b of the m letters where successfully decoded.The process was repeated Q times to calculate the success rate.The median and boxplot of 50 repeats is presented.See Algorithm 4.

3) Decoding Probability of a Complete Combinatorial Message:
The decoding probability of a complete combinatorial message with l sequences was calculated by first simulating the distribution of R all reads over the l sequences.Then, the simulation for a single combinatorial sequence was repeated for each sequence using the associated R i reads.A successful decoding of the complete message was considered if at least a of the l sequences where successfully decoded.
The process was repeated Q times to calculate the success rate.The median and boxplot of 50 repeats is presented.See Algorithm 5.

B. Runtime Analysis
The simulations and calculations were performed on a personal computing system equipped with an Intel R Core TM i5-8250U CPU, which has a base clock speed of 1.60 GHz and can boost up to 1.80 GHz.The system was configured with 8.00 GB of RAM (7.84 GB usable) to facilitate the computational demands of the simulation processes.It operated under a 64-bit Windows 11 Home edition, ensuring that the software utilized for simulations could leverage the x64-based processor architecture for optimal performance.

VI. CONCLUSION
Our study presents a novel model for analyzing coverage depth in DNA-based data storage, particularly focusing on combinatorial DNA encoding.We use the coupon collector's problem framework to model the reconstruction of combinatorial letters from sequencing data.We present a Markov Chain (MC) formulation for calculating the decoding probability and provide a tool for computing its probability.This solution is, however, limited in its scale due to the size of the state space.Further work can be done to allow this model to be scaled up, either by developing more efficient computation or by developing an approximation to the model.
One of the key aspects of the combinatorial approach is the strategic selection of Ω that consists of easily distinguishable k-mers.This, together with the use of a threshold t > 1 in the reconstruction algorithm (See Algorithm 1), effectively mitigates k-mer mix-up errors, as was demonstrated in [15].We therefore chose to ignore k-mer mix-up errors in the model used for the reconstruction probability.
We also present a unified model for analyzing coverage depth of a complete combinatorial storage system considering an inner-outer error correction model.We present theoretical bounds on the decoding probability using Sanov's theorem on the multinomial model for read distribution or using an empirical estimation.
We also provide a Python tool for determining the sequencing depth required to achieve a desired confidence level for a system, given design and encoding scheme.We demonstrate the tool's results on a selection of design parameter sets.
Future exploration in DNA-based data storage will significantly benefit from further understanding and optimizing coverage depth and from further improving efficient combinatorial coding.These elements are key to enhancing data storage capacity and reliability, promising exciting advancements in the field.

APPENDIX A EVOLUTION OF PROBABILITY IN THE COUPON COLLECTOR'S PROBLEM VIDEO
The coupon collector's parameters that are shown in the video, are: is the probability of collecting all n unique coupons within R trials.We will show that, The coupon collector's problem can be approached using the principle of inclusion-exclusion.The formula calculates the probability of collecting all n unique coupons within R trials.Let A i be the event that the i-th coupon is not collected in R trials.

P (A
Let K i=1 A i be the probability of not collecting at least one coupon in R trials.Note that we are interested in: And finally, , the total probability for target states, by summing probabilities of target states 13 return π K ,t (R) This follows from: where A I = i∈I A i .For j = 1: For j = 2: And generally: And clearly: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Expressing λ using the primal feasibility (68): Substituting p i and p 1 from (66) and (67): Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Expressing p 1 with λ from (73): Expressing μ: • If p 1 = α, then μ = 0.
• If p 1 = α, then μ can be non-zero.If p 1 = α, we substitute α for p 1 in (74): Using the original expressions for p i from (66), and substituting μ we expressed in (75), we get: and recall that p 1 = α.

APPENDIX D TOOL IMPLEMENTATION
We provide a Python code for calculating the required sequencing depth given the system design parameter.Figure 9 presents an overview of the tool run.The implementation details of the different steps are outlined below.

A. Reconstruction Probability of a Single Combinatorial Letter
For the calculation of the reconstruction probability of a single combinatorial letter, the Markov Chain (MC) representation of the coupon collector's process is used.See Algorithm 6.

B. Reconstruction Probability of a Combinatorial Sequence
For the calculation of the reconstruction probability of a combinatorial sequence, first the reconstruction probability for a single letter, π K ,t (R), was calculated.Then, the reconstruction probability of the sequence is calculated using the binomial distribution or the normal approximation.See Algorithm 7.

C. Reconstruction Probability of a Complete Combinatorial Message
The calculation of the bound for the reconstruction probability of a complete combinatorial message is performed by splitting the calculation in two.First, a threshold ρ is found such that P (X ρ ≥ a) ≥ √ 1 − δ.This is done by an iterative search that uses the decoding probability calculation Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.for a single sequence and the binomial model.Next, the required number of total reads R all that ensures that 1 − P (E (ρ)) ≥ √ 1 − δ is found either by using Sanov's bound or by a numerical simulation of the uniform distribution.See Algorithm 8.

APPENDIX E EXAMPLES FOR RECONSTRUCTION OF A SINGLE COMBINATORIAL LETTER
More examples for decoding probability for varying number of analyzed reads (R) for different thresholds (t), with varying number of K and .

Fig. 1 .
Fig. 1.Schematic view of a combinatorial alphabet process.(a) The combinatorial letters are constructed from a set of N = 16 k-mers, Ω = {X 1 , . . ., X 16 }, creating |Σ| = 4096 letters.Each letter represents a subset of K = 5 k-mers, as seen on the bottom left and depicted in the grayed-out cells.(b) The combinatorial sequence of letters then undergoes a process of synthesis, sampling, and sequencing that represent the combinatorial DNA storage channel.

Algorithm 1 : 6 if
Decoding a Single Combinatorial Letter Input: A set R of R reads, a list of N k-mers from the set Ω, a robustness threshold t Output: A set of K inferred k-mers or FALSE if decoding fails 1 define the binomial combinatorial alphabet Σ = {σ 1 , ..., σ |Σ| } with |Σ| ≤ N K ; 2 initialize a counter for each k-mer in Ω; 3 while |R| > 0 do 4 extract next read r from R; 5 increment the counter for the k-mer observed in the read; the counters for K k-mers are larger or equal to t then 7 return these K k-mers as the inferred member k-mers of σ ; end end 8 return FALSE;

Fig. 2 .
Fig. 2. Combinatorial DNA system design.(a) Error correction scheme design.(b) Message decoding and reconstruction.i. Grouping.ii.Letter reconstruction.iii.Message decoding.(The pink square is the final message.).

Fig. 4 .
Fig. 4. Decoding probability for a varying number of analyzed reads (R) for different thresholds (t).Each subplot corresponds to a different threshold value (t).The analyses were conducted for K = 7 and = 0.01.(a) Results for t = 1, the blue line corresponds to the calculated probability based on the MC model, while the red line represents the median of 50 simulation runs, where each simulation calculates the success rate of a 100 uniform drawing of R reads across K member k-mers.The simulation results are also presented as boxplots.(b-d) Like (a), with t = 2, 3 and 4 respectively.

Fig. 5 .
Fig. 5. Decoding probability of a complete combinatorial sequence with varying redundancy levels.Results shown for a sequence of length m = 100, with K = 7, and requiring t = 4. (a) Calculated decoding probability (blue line) as a function of the number of analyzed reads for redundancy level of b = 85.Median results from 50 simulation runs are presented (red line) with boxplots representing the distribution of the simulation results.Each simulation run represents 100 uniformly drawn sets of R reads, each comprising m letters drawn from K = 7 member k-mers.(b-d) Like (a), with b = 90, 95, and 100, respectively.All analyses incorporate an error rate of = 0.01.(e-h) Like as (a-d), with t = 2.

Fig. 6 .
Fig. 6.Reconstructing a complete combinatorial message.(a) R all reads are distributed between l sequences, and at least a sequences must be decoded.(b) The decoding probability of each of the letters is analyzed using the coupon collector's model (blue bins indicate the member k-mers).(c) Each sequence requires b of the m combinatorial letters to be decoded.

Algorithm 2 :− δ then 4
Finding the Required Sequencing Depth R all for a Complete Message Data: Design parameters.Input: δ (Acceptable failure probability) Output: A value for R all ensuring decoding with probability 1 − δ 1 Initialize ρ to find threshold where P (X ρ ≥ a) ≥ √ 1 − δ; 2 for incrementing values of ρ do 3 if P (X ρ ≥ a) ≥ √ 1 Break loop and use found value of ρ; end end 5 Set R all = ρ × l ; 6 for incrementing values of R all do 7

Fig. 8 .
Fig. 8. Bounding the decoding probability.(a) Overall decoding probability P (Xρ ≥ a) (blue line) and the Sanov's bound on the probability of obtaining a read distribution across the sequences with min j (r j ) ≤ ρ, 1 − P (E (ρ)) (red line) as functions of the threshold T for a fixed number of analyzed reads R all = 3000.The threshold √ 1 − δ on the probability is marked with dotted lines for δ = 0.1 (pink dotted lines) and δ = 0.2 (green dotted lines).Setting ρ to any value between these lines ensures decoding with 1 − δ confidence.All values are calculated for K = 7, t = 4, = 0.01, R = 110, m = 10, b = 8, l = 10, a = 8.(b) Like (a), with P (E (ρ)) calculated using simulations instead of the Sanov's bound and where the total number of reads R all = 1000.

Fig. 9 .
Fig. 9. Complexity calculation tool workflow.(a) Overview of the tool's run including internal dependencies, input parameters, and outputs for each part.(b) Reconstruction probabilities of a single combinatorial position, π K ,t (ρ), calculated using the coupon collector's model (inset, like in Fig. 3) as a function of the threshold ρ.(c) Decoding probability for a full-length combinatorial sequence, P single (ρ, m, b), calculated using the binomial model with the probabilities from (a) as input.Plotted as a function of the threshold ρ.(d) Finding ρ.Full message decoding probability, P (Xρ ≥ a), calculated using the binomial model for Xρ obtained from (c).Plotted as a function of the threshold ρ.The target confidence level √ 1 − δ is presented in the red dotted line.(e) Finding R all given the selected ρ.The probability of considering enough read distributions (across the l sequences), 1 − P (E (ρ)), based on either theoretical bound or the empirical calculation.Plotted as a function of R all .The target confidence level √ 1 − δ is presented in the red dotted line.

Fig. 10 .
Fig. 10.Decoding probability for a varying number of analyzed reads (R) for different thresholds (t).Each subplot corresponds to a different threshold value (t).The analysis was conducted for K = 5 and = 0.01.(a) Results for t = 1, the blue line corresponds to the calculated probability based on the MC model while the red line represents the median of 50 simulation runs, where each simulation calculates the success rate of 100 uniform drawing of R reads across K member k-mers.The simulation results are also presented as boxplots.(b-d) Like (a), with t = 2, 3 and 4, respectively.

Fig. 11 .
Fig. 11.Decoding probability for a varying number of analyzed reads (R) for different thresholds (t).Each subplot corresponds to a different threshold value (t).The analysis was conducted for K = 10 and = 0.01.(a) Results for t = 1, the blue line corresponds to the calculated probability based on the MC model while the red line represents the median of 50 simulation runs, where each simulation calculates the success rate of 100 uniform drawing of R reads across K member k-mers.The simulation results are also presented as boxplots.(b-d) Like (a), with t = 2, 3 and 4, respectively.

Fig. 12 .
Fig. 12. Decoding probability for a varying number of analyzed reads (R) for different thresholds (t).Each subplot corresponds to a different threshold value (t).The analysis was conducted for K = 7 and = 0.1.(a) Results for t = 1, the blue line corresponds to the calculated probability based on the MC model while the red line represents the median of 50 simulation runs, where each simulation calculates the success rate of 100 uniform drawing of R reads across K member k-mers.The simulation results are also presented as boxplots.(b-d) Like (a), with t = 2, 3 and 4, respectively.

Fig. 13 .
Fig.13.Decoding probability for a varying number of analyzed reads (R) for different thresholds (t).Each subplot corresponds to a different threshold value (t).The analysis was conducted for K = 7 and = 0. (a) Results for t = 1, the blue line corresponds to the calculated probability based on the MC model while the red line represents the median of 50 simulation runs, where each simulation calculates the success rate of 100 uniform drawing of R reads across K member k-mers.The simulation results are also presented as boxplots.(b-d) Like (a), with t = 2, 3 and 4, respectively.
Numbers of KSee Fig.10, and Fig.11.B. Decoding Probability for Varying Number of Analyzed Reads (R) for Different Thresholds (t), and Varying Numbers ofSee Fig. 12, and Fig 13.
Sequencing Coverage Analysis for Combinatorial DNA-Based Storage Systems Inbal Preuss , Student Member, IEEE, Ben Galili , Zohar Yakhini , Member, IEEE, and Leon Anavy

and given values for m, b, l, and a, and a given number of analyzed reads R all , what is the probability of recovering the complete combinatorial message? Alternatively, what is the required number of reads R all that guarantees a desired confidence level δ?
2. The error correction scheme is characterized by two parameters, b and a.Every combinatorial sequence is assumed to be correctly recovered if at least b ≤ m letters/positions are correctly reconstructed (inner code).The message is fully recovered if at least a ≤ l sequences are successfully recovered (outer code).

TABLE I REQUIRED
SEQUENCING COVERAGE FOR DIFFERENT SETS OF DESIGN PARAMETERS AND CONFIDENCE LEVELS

then 6
30.See file A. Evolution of Probability in the Coupon Collector Problem Video K=5, t=2, R=30.gif.Number of unique items K , threshold t, number of reads R, number of simulations Q. Result: Success rate of achieving at least t occurrences of each item in R rounds over Q simulations.
Data: K , t, m, k , b, R, Q. Number of k-mers K, threshold t, sequence length m, number of selections n, required successful decodings b, number of reads per position R, number of simulations Q. Result: Success rate of reconstructing at least b positions from m in the sequence over Q simulations. 1 Initialize parameters K , t, m, n, b, R, Q 2 Initialize success_count ← 0 3 for _ ← 1 to Q do 5 if decoded _count ≥ b K , t, m, n, b, l , a, R all , Q. Number of unique k-mers K, threshold t, sequence length m, selections per sequence n, required successful decodings b, total number of sequences l, required sequences decoded a, total reads R all , number of simulations Q. Result: Success rate of decoding at least a sequences out of l in Q simulations. 1 Initialize parameters K , t, m, n, b, l , a, R all , Q Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

)
Algorithm 6: Calculated probability: Reconstruction of a Single Combinatorial Letter Data: K , t, , R. Number of unique k-mers K, t count of each k-mer, error probability , and number of reads R. Result: Probability distribution over states after R reads. 1 Initialize state space S ← K +t Initialize transition matrix A ← 0 (|S |×|S |) 3 for each state s i in S do Find index of new state s j in S Compute probability for transitioning to a new state, A[s i , s j ] ← P (s i , s j ), where P (s i , s j ) = (1 − ) × Compute probability for staying in the same state (without transitioning to a new state), A[s i , s i ] ← P (s i , s i ), where P (s i , s 2 7