Substring Density Estimation From Traces

In the trace reconstruction problem, one seeks to reconstruct a binary string s from a collection of traces, each of which is obtained by passing s through a deletion channel. It is known that <inline-formula> <tex-math notation="LaTeX">$\exp (\tilde {O}(n^{1/5}))$ </tex-math></inline-formula> traces suffice to reconstruct any length-n string with high probability. We consider a variant of the trace reconstruction problem where the goal is to recover a “density map” that indicates the locations of each length-k substring throughout s. We show that when <inline-formula> <tex-math notation="LaTeX">$k = c \log n$ </tex-math></inline-formula> where c is constant, <inline-formula> <tex-math notation="LaTeX">$\epsilon ^{-2}\cdot \text { poly} (n)$ </tex-math></inline-formula> traces suffice to recover the density map with error at most <inline-formula> <tex-math notation="LaTeX">$\epsilon $ </tex-math></inline-formula>. As a result, when restricted to a set of source strings whose minimum “density map distance” is at least <inline-formula> <tex-math notation="LaTeX">$1/\text {poly}(n)$ </tex-math></inline-formula>, the trace reconstruction problem can be solved with polynomially many traces.


I. INTRODUCTION
I N THE trace reconstruction problem, there is an unknown binary string s ∈ {0, 1} n , which we wish to reconstruct based on T subsequences (or traces) of s.Each trace is obtained by passing the source string s through a deletion channel, which deletes each bit of s independently with probability p.The main question of interest is how many traces are needed in order to reconstruct s correctly.
This problem was originally proposed by Batu et al. [1], motivated by problems in sequence alignment, phylogeny, and computational biology [2].Most of the work on the trace reconstruction problem has focused on characterizing the minimum number of traces needed for reconstructing the source string s exactly.The most common formulation of the problem, known as worst-case trace reconstruction [3], requires the reconstruction algorithm to recover s ∈ {0, 1} n exactly with high probability as n → ∞ for any string s ∈ {0, 1} n .While this problem has received considerable attention, there is still a significant gap between upper and lower bounds on the number of traces needed.Currently, the best lower bound is Ω(n 3/2 ), while the best upper bound is exp( Õ(n 1/5 )), both due to Chase [4], [5].
The exponential gap between the best known lower and upper bounds has motivated the formulation of several variants of the trace reconstruction problem where tighter bounds can hopefully be obtained.For example, in the average-case trace reconstruction problem, s is assumed to be drawn uniformly at random from all {0, 1} n strings.In this case, it is known that only T = exp(O(log 1/3 (n))) traces are sufficient [6].An approximate trace reconstruction problem, where a fraction of the recovered bits is allowed to be incorrect, has also been formulated [7], and the problem of finding the maximum likelihood sequence s from a small number of traces (possibly insufficient for exact reconstruction) has been recently studied [8].We can also consider a more modest goal than the reconstruction of the source sequence s itself.One example that is particularly relevant to our discussion is the reconstruction of the k-subword deck of s [9], [10].
The k-subword deck of a binary sequence s ∈ {0, 1} n is the multiset of all length-k substrings, i.e., {s[i : i + k − 1] : i = 1, . . ., n − k + 1}.Equivalently, the k-subword deck can be defined by the counts N s,x of the number of times x appears in s as a substring: As shown in [10], for k = O(log n), the k-subword deck S k (s) can be recovered with poly(n) traces.The k-subword deck of a sequence is an important object in bioinformatics, with applications in error correction [11], [12], sequence assembly [13], [14], and genomic complexity analysis [15], [16].In these contexts, the k-subword deck S k (s) is often referred to as the k-spectrum, and each length-k substring is called a k-mer.Intuitively, as long as k is large enough, the ksubword deck can uniquely determine the source sequence s.
In fact, a classical result by Ukkonen [17] provides a necessary and sufficient condition for S k (s) to uniquely determine s based on the length of the "interleaved repeats" in s [18].In particular, if there are no repeats of length k − 1 in s, one can reconstruct s from S k (s) by simply merging k-mers with a prefix-suffix match of length k − 1.More generally, given S k (s), one can build the de Bruijn graph, where nodes correspond to (k − 1)-mers and edges correspond to k-mers, and s is guaranteed to be an Eulerian path in the graph [13], [19].This is illustrated in Figure 1.
While the k-subword deck is a natural intermediate goal towards the reconstruction of s (and can be recovered with only poly(n) traces), it does not capture all the information present in the traces.For example, the k-subword deck S k (s) in Figure 1 also admits the reconstruction s ′ = 000111000111000000000111, even though s ′ should be easy to distinguish from s based on traces (by estimating the length of the second and third runs of zeros).Motivated by this shortcoming of the k-subword deck, we propose the idea of a k-mer density map, as a kind of localized k-subword deck where, in addition to knowing the number of times a given k-mer appears in s, we have some information about where it occurs.
For a k-mer x ∈ {0, 1} k , let I s,x ∈ {0, 1} n−k+1 be the indicator vector of the occurrences of x in s; i.e., I s,x [j] = 1{s j:j+k−1 = x}, as illustrated in Figure 2. Notice that recovering the k-subword deck can be seen as recovering j I s,x [j] for each x ∈ {0, 1} k .Also notice that recovering s is equivalent to recovering I s,x for all x ∈ {0, 1} k .A k-mer density map can be obtained by computing for some "smoothing kernel" h(i, j), as illustrated in Figure 2.
Intuitively, for a given x, K s,x gives a coarse indication of the occurrences of x in s.Moreover, if h is such that i h(i, j) = 1 for each j, it holds that which means that the k-subword deck S k (s) is a function of K s,x , and the density map K s,x can be thought of as a generalization of the k-subword deck that provides information about k-mer location.
We will focus on a specific choice of h(i, j) that will render K s,x easier to estimate from the traces.We will let h(i, j) be the probability that a binomial random variable with j−1 trials and probability parameter 1−p is equal to i−1; i.e., h(i, j) = j−1 i−1 (1 − p) i−1 p j−i where p is the deletion probability.This is also the probability that the jth bit of s (if not deleted) ends up as the ith bit of a trace.Hence we have for i ∈ {1, . . ., n − k + 1}, where j−1 i−1 = 0 if j < i.Notice that the maximum value of h(i, j) for a fixed j occurs when i ≈ j(1−p) so the kernel h(•, j) has its peak shifted to the left and K s,x is a density map of occurrences of x in s shifted to the left.Operationally, (1 − p) k K s,x [i] is the probability that a fully preserved copy of x in s appears in position i on a given trace of s.
We define the k-mer density map of s as K s = [K s,x : x ∈ {0, 1} k ] (the concatenation of all vectors K s,x ).If the k-mer density map K s is known exactly, s can also be recovered exactly.This can be seen by noticing the invertibility of the upper-triangular matrix F that transforms the binary vector I s,x into the vector K s,x (for a fixed x).In particular, The matrix F is upper triangular with non-zero entries on the main diagonal, which makes it invertible.While invertible, F can be ill-conditioned since some of the entries on the diagonal can be close to 0, making the transformation from K s,x to I s,x potentially sensitive to noise in K s,x .To summarize, we choose to focus on the kernel h(i, j) = j−1 i−1 (1 − p) i−1 p j−i for several reasons: it allows us to estimate K s efficiently from traces; it leads to a K s that determines s uniquely, which ultimately leads to a consistent estimator for s; it has an intuitive operational definition; it is a generalization of the k-subword deck, which is an important existing tool in the literature; and, as we will later see, it is strictly more powerful than the k-subword deck for distinguishing between source strings with poly(n) traces using currently known theorems.
We present an algorithm that, given T traces, constructs an estimate Ks for the k-mer density map.Our main result establishes that, for k = c log n where c is constant and for p < 0.5, we can achieve estimation error Hence, the density map K s can be estimated with maximum error ϵ n = 1/g(n) for g(n) ∈ poly(n) using polynomially many traces.In particular, given a set of candidate source strings A ⊂ {0, 1} n such that, for any s, s ′ ∈ A, the true source sequence s ∈ A can be recovered with ϵ −2 • poly(n) traces.This adds to the existing literature on classes of strings recoverable/distinguishable with polynomially many traces [20], [21], [22].
We point out that typical examples of pairs of strings that are hard to distinguish from traces can be verified to have a kmer density map distance that decays no faster than 1/poly(n).where n = 2m.The Hamming distance between s and s ′ is 2 and the Levenshtein distance is 2. Note that the k-subword decks are identical for k ≤ m and, hence, recovering the k-subword deck from the traces is not sufficient to distinguish between s and s ′ .However, we can verify that where g(n) ∈ poly(n), which implies that s and s ′ are distinguishable with poly(n) traces.To see that, notice that for k ∈ {1, 2, . . ., m}, the k-mer density maps of s and s ′ for the k-mer x = 1 00 . . .
We examine i = m(1−p) because it is the most likely position for x to end up in a trace of s if x itself is preserved.After simple manipulations, it can be seen that One can similarly verify that other pairs of strings s and s ′ that are considered hard to distinguish (such as alternating strings of 0s and 1s that differ in only one position) have a k-mer density map distance that satisfies where g(n) ∈ poly(n) for any s, s ′ ∈ A appears to be very dense, and thus far, we have not been able to find a pair of strings s and s ′ so that ∥K s − K s ′ ∥ ∞ = o(1/g(n)) for any g(n) ∈ poly(n).While we do not theoretically analyze the maximum cardinality of A, we plot in Figure 3 the minimum k-mer density map distance, i.e., min s,s ′ ∈{0,1} n ∥K s − K s ′ ∥ ∞ , for k = 3, n ∈ {4, 5, . . ., 15}, and a range of p values in the range [0, 0.5].We choose k = 3 because it is closest to log(15) ≈ 2.71, and our results are focused on k = c log(n).Observe that all points plotted remain above 1/n, and though the curves are not completely smooth, the log-log plot suggests that, at least for small values of n, the minimum distance decays as 1/poly(n).Observe that for all values of n and p considered, the curves are above the plot of 1/n given by the black curve.This means that for every string pair considered, the 3-mer density map distance is above 1/n.Furthermore, while the curves for various p are not completely smooth, the log-log plot shown on the right suggests that the minimum distance scales as 1/poly(n).
Since I s,x and K s,x are related through an invertible linear transformation K s,x = F I s,x , a natural way to attempt to reconstruct the source string s is by first computing Îs,x = F −1 Ks,x (or, by considering a regularized least squares formulation, if F is ill-conditioned).The solution Îs,x can then be converted into a reconstructed string ŝ ∈ {0, 1} n through an approach that greedily selects candidate k-mers for each position based on Îs,x .We describe the details of a concrete approach to reconstruct s from Ks,x , prove the approach yields a consistent estimator of s, and provide some simulation Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
results demonstrating good performance in Section IV.Hence, in contrast to much of the theoretical literature on the trace reconstruction problem, the k-mer density map leads to new reconstruction approaches.
Our main result relies on a nontrivial estimator for K s,x that simultaneously uses count information for all binary strings y that are supersequences of x.The estimator is obtained by first deriving a recursive formula for K s,x , then applying a known result in the combinatorics of strings on the expansion of the recursive formula to obtain a non-recursive formula.An application of McDiarmid's inequality is then used to prove the estimator is successful with high probability.To the best of our knowledge, these techniques have not appeared in the trace reconstruction literature, where most recent results have been based on complex analysis [3], [4], [5], [6], [23].Our techniques also lead to an improvement on a previously known upper bound [10] on the number of traces needed for reconstructing the k-subword deck of s for p < 0.5.
In particular, we re-derive an estimator from [10] using proof techniques similar to those used in proving our main result, and use McDiarmid's inequality as opposed to the Chernoff bound used in [10] in order to obtain an improved upper bound on the number of traces needed by the estimator.

A. Related Work
The current best upper bounds for worst-case trace reconstruction [3], [5], [23] are obtained using algorithms that consider each pair of possible source strings y, z ∈ {0, 1} n , and decide whether the set of traces looks more like a set of traces from y, or more like a set of traces from z (we formalize this shortly).Then if there are enough traces, with high probability the true source string s will be the unique string such that the set of traces looks more like it came from s than from any other string in {0, 1} n .Therefore, the trace reconstruction problem is closely related to the trace-distinguishing problem [21], [22], where we want to decide from the set of traces whether the source string is either y or z with high probability.The best existing upper bound of exp(O(n 1/5 )) for worst-case trace reconstruction [5] is proved using the fact that any two strings can be distinguished using exp(O(n 1/5 )) traces.It has also been shown that string pairs at constant Hamming distance can be distinguished using poly(n) traces by Krishnamurthy et al. [20], and separately by Grigorescu et al. using different techniques [21].It was recently shown that strings at constant Levenshtein distance can be distinguished using poly(n) traces by Sima and Bruck [22].
To distinguish between two strings y and z from a set of traces, current state of the art algorithms identify a function f y,z such that |E[f y,z ( Ỹ )] − E[f y,z ( Z)]| is sufficiently large where Ỹ denotes a trace of y.Given T traces S1 , . . ., ST of a source string s, E[f y,z ( S)] can be estimated as 1 T T i=1 f y,z ( Si ) and we say that y beats z if 1 | is large enough, then assuming the source string is y or z, we can distinguish between the two cases given a reasonable number of traces.If there is a unique string u such that u beats all other strings, then u is output as the reconstruction.
The first such function f y,z introduced by Nazarov and Peres [23], and independently by De et al. [3], is simply the value of a single bit in the trace, i.e., f y,z ( S) = S[i] for source string s and some index i that depends on y and z ([9] also uses a single bit approach).An argument based on an existing result on complex-valued polynomials was used to show that for any pair of strings y, z, there is some index i y,z such that Using this result along with a standard concentration inequality, the upper bound of exp (O(n 1/3 )) traces is obtained.This choice of f y,z is known in the literature as a single bit statistic.The current best upper bound by Chase [5] picks a more complicated function f y,z that involves multiple bits, and proves a novel result on complex polynomials in order to prove a gap of )), which yields an upper bound of exp ( Õ(n 1/5 )) traces.A choice of f y,z that uses multiple bits in combination (e.g. that of Chase [5]) is known in the literature as a multi-bit statistic.There are similarities between the approach used in [5] and the approach used in this paper, and it may be possible to obtain results similar to those in this paper based on the complex analysis techniques explored in [3], [5], and [23].
Obtaining lower bounds for the number of traces needed in the trace reconstruction problem amounts to proving a lower bound on the number of traces required to distinguish two strings (trace-distinguishing problem) since any algorithm to solve the trace reconstruction problem can be used to solve the trace-distinguishing problem.For a string a, let a i denote the string where a is repeated i times.The current best lower bound of Ω(n 3/2 ) for trace reconstruction discovered by Chase [4] is obtained by analyzing the string pair (01) m 1(01) m+1 and (01) m+1 1(01) m where n = 2m + 3.
The trace reconstruction problem has also been considered in the smoothed complexity model by Chen et al. [10], in which a worst-case string is passed through a binary symmetric channel before trace generation, and the noise-corrupted string needs to be reconstructed with high probability (recall that a binary symmetric channel flips each bit in the string independently with some fixed probability).Chen et al. proved that in the smoothed complexity model, trace reconstruction requires poly(n) traces.This result relies on the simple fact that if there are no repeated substrings of length k − 1 or greater in the source string s, then the k-subword deck uniquely determines s.The authors prove that for an arbitrary string, the (log n)-subword deck can be reconstructed with high probability using poly(n) traces, and prove there will not be any repeats in the source string of length at least (log n − 1) with high probability after it is passed through a binary symmetric channel, thereby proving that poly(n) traces suffice for trace reconstruction in the smoothed complexity model.
The bulk of the work done in [10] is proving that the (log n)-subword deck can be reconstructed with high probability using poly(n) traces for any p < 1.In order to prove this, a formula for the number of times a substring x is present in the source string s is derived in terms of the expected number of times a trace contains x, and the expected number of times a trace contains supersequences of x.The expected values of these statistics are then estimated from the set of traces in order to estimate the number of times x appears in s.Concentration inequalities are used to prove that the number of times x appears is estimated correctly with high probability.Narayanan and Ren [24] provide a similar result on reconstructing the (100 log n)-subword deck for circular strings in poly(n) time.Our result showing that the (log n)subword deck can be computed using poly(n) traces with high probability for p < 0.5 was originally proved independently, without knowledge of [10] and [24].

B. Notation
Strings in this paper are binary and indexed starting from 1.
Let s ∈ {0, 1} n be the length-n string we are trying to recover.The string s will be called the source string.A trace of s is denoted by S, and is generated by deleting each bit of s independently with probability p.
For a given string x, we let |x| denote the length of x.For a string a and integer r, a r denotes the string formed by concatenating r copies of a.A subsequence of x is a string that can be formed by deleting elements from x, and a supersequence of x is a string that can be formed by inserting elements into x.This is in contrast to a substring of x, which is a string that appears in x.We let ) be the substring of x that begins at position i and ends at position j.For example, if s = 10101, then s[1 : 3] = 101 and s[2 : −2] = 010.
For two stings x and y, the number of ways we can make |y| − |x| deletions on y to form x is denoted by y x [8], [25].This is a generalization of the binomial coefficient for strings.Observe that y x ≤ |y| |x| .To simplify our notation, for strings x, y, we will also let y x ′ = y[2: −2]  x[2: −2] .

II. MAIN RESULTS
Let s ∈ {0, 1} n denote the source string and x ∈ {0, 1} k denote the target k-mer, whose density K s,x we wish to estimate.To simplify the notation, we fix a constant c > 0 and define  Similarly, as c increases, more traces are needed.For all values of c, the limit of βc(p) as p → 0 is equal to 3. (b) Plot of exponent in (6) versus plot of exponent in (7) for c = 1.Observe that our bound is significantly tighter than the existing bound.probability 1−δ.Moreover, given log 2n 1+c log 2 δ •ϵ −2 n •f c (n) traces, an estimator for the entire density map Ks can be constructed so that ∥ Ks − K s ∥ ∞ < ϵ n with probability 1 − δ.
In particular, Theorem 1 implies that for p < 0.5 and ϵ n = 1/g(n) where g(n) ∈ poly(n), all entries of the (c log n)-mer density map K s,x can be estimated with error at most ϵ n using poly(n) traces (and poly(n) time as discussed in Section III).Theorem 1 also implies that the trace reconstruction problem restricted to a set of binary strings with a bounded minimum density map distance can be solved with poly(n) traces.
Corollary 1: Suppose p < 0.5 and k = c log n, and let A ⊂ {0, 1} n be such that, for any s, s ′ ∈ A, traces from some source string s ∈ A, s can be correctly identified with probability 1 − δ.
Consequently, the trace reconstruction problem restricted to a set of binary strings with minimum density map distance 1/g(n) where g(n) ∈ poly(n) can be solved with poly(n) traces.We have not been able to find a pair of strings s and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
and to the best of our knowledge, no such example of s and s ′ is known.Note that the restriction p < 0.5 in our results arises from a limitation in our application of McDiarmid's inequality to upper bounding the number of traces needed by our estimator for K s,x .While our results are derived for k = c log(n), it is straightforward to show that the results still hold if k = O(log(n)) (e.g., if k is constant) using the same application of McDiarmid's inequality.
Recall that, from [10], one can recover the (c log n)subword deck using poly(n) traces with high probability.A pair of strings s, s ′ can be distinguished based on their (c log n)-subword deck alone if and only if their (c log n)subword decks are distinct, which is equivalent to requiring Our main result also implies that as long as |∥K s,x ∥ 1 −∥K s ′ ,x ∥ 1 | ≥ 1 for some x, s and s ′ can be distinguished with poly(n) traces, since due to the equivalence of ℓ ∞ and ℓ 1 norms and the reverse triangle inequality, In other words, whenever s and s ′ are distinguishable based on their (c log n)-subword decks, the (c log n)-mer density maps of s and s ′ have distance at least 1/n, which implies that s and s ′ can be distinguished with poly(n) traces by Theorem 1.
On the other hand, there are examples where the (c log n)mer density maps of s and s ′ differ by at least 1/poly(n), implying that s and s ′ be distinguished with poly(n) traces, while the (c log n)-subword decks of s and s ′ are identical.One such example is given in the introduction, where we show that s = 0 m−1 10 m and s = 0 m 10 m−1 with n = 2m are such that ∥K s − K s ′ ∥ ∞ ≥ 1/poly(n).Thus, using currently known theorems, the k-mer density map is strictly more powerful for distinguishing strings with poly(n) traces than the k-subword deck.
A special case of Corollary 1 with an explicit condition is given below and proved in the appendix.
Corollary 2: Suppose p < 0.5 and k = c log n.If the strings s, s ′ are such that x ∈ {0, 1} k begins at position i in s and x does not appear in s ′ at an index in the range where a > 0.5, then s can be distinguished from s ′ with high probability using poly(n) traces.
Corollaries 1 and 2 are an addition to the literature on conditions for distinguishing strings from traces, which includes the facts that strings at constant Hamming distance can be distinguished using poly(n) traces [20], [21], and strings at constant Levenshtein distance can be distinguished using poly(n) traces [22].
Observe that there are string pairs that our results immediately prove are poly-trace distinguishable that the results of [10], [20], [21], and [22] do not.Consider the string pair s = 0 m 1 log(m) 0 1.1m and s ′ = 0 1.1m 1 log(m) 0 m where n = 2.1m + log(m).Observe that s, s ′ do not have constant Hamming or Levenshtein distance, and have the same c log(n) spectrum for any constant c, so previous results [10], [20], [21], [22] do not apply.
Improved upper bound for k-subword deck reconstruction: Using our proof technique for estimating K s,x , we also give a novel proof 1 that the (c log n)-subword deck can be reconstructed using poly(n) traces for p < 0.5, which yields an improved upper bound on the required number of traces compared to the analysis of the algorithm for p < 0.5 in [10].
Theorem 2: For p < 0.5, we can reconstruct the (c log n)subword deck of any source string s ∈ {0, 1} n from traces in poly(n) time with high probability.
In contrast, the analysis in [10] proves that traces are sufficient for this task.Comparing the exponents for p < 0.5, we prove in the appendix that which shows that asymptotically, our upper bound is tighter for any c and any p < 0.5.See Figure 4(b) for a plot showing the comparison.In particular, for c close to zero, ( 6) is close to linear in n, nearly matching the following lower bound.Theorem 3: For deletion probability p and any source string s ∈ {0, 1} n , we have that Ω(np(1 − p)) traces are necessary for recovering the g(n)-subword deck for any function g such that g(n) ≥ 2.
Proof: Suppose we want to decide whether the k-subword deck of the source string is the k-subword deck of s 1 = 1 n/2 0 n/2 , or whether it is the k-subword deck of s 2 = 1 (n/2)+1 0 (n/2)−1 .Observe that for any k ≥ 2, the only length n string that has the k-subword deck of s 1 is s 1 itself, and likewise, the only length n string that has the k-subword deck of s 2 is s 2 itself.Thus, distinguishing between these two ksubword decks is equivalent to distinguishing between s 1 and s 2 .From Section 4.2 of [1], we know that Ω(np(1−p)) traces are necessary for distinguishing between s 1 and s 2 .
We did not compare our upper bound on trace complexity to the analysis of the algorithm in [10] that reconstructs the (log n)-subword deck using poly(n) traces for p < 1 because an explicit upper bound for this algorithm has not appeared in the literature to the best of our knowledge.As discussed in Section IV, we also prove in the appendix that the initial algorithm for k-subword deck reconstruction presented in [10] only needs poly(n) traces for reconstructing the (log n)subword deck of a string for p < 0.5, which was previously unknown.

III. AN ESTIMATOR FOR THE k-MER DENSITY MAP
In this section, we describe our estimator for the k-mer density map, prove Theorem 1, and discuss the runtime of computing the estimator.We first introduce some additional notation.For a string x, let Y i (x) be the set of length-i supersequences of x that have the same first and last bit of x.For example, if x = 101, then Y 4 (x) = {1011, 1101, 1001}.
For source string s and k-mer x, let P s,x [i] = Pr( S[i : i + |x| − 1] = x), i.e., the probability that x appears at position i in a trace S of s.Notice that it is straightforward to estimate P s,x [i] from the set of traces as Ps,x Recall that the entry in the k-mer density map K s corresponding to the substring x at index i of s is defined as In order to estimate K s,x [i], we first write it in terms of P s,x [i] and P s,y [i] for all y of length greater than k.This will then allow us to estimate K s,x [i] using the estimates of P s,x [i] and P s,y [i], which can be obtained directly from the set of traces.
Lemma 1: For source string s and k-mer x, we have Proof: We begin by deriving a recursive formula for K s,x [i].We first notice that This follows because, in order for x to appear at position i in a trace, for some superstring y ∈ Y ℓ (x) that appears at position j ≥ i in s, |y| − |x| bits from y must be deleted, and j − i bits in front of y must be deleted.Notice that y x is the probability that a copy of y in s becomes x in S, and K s,y [i] is the probability that the beginning of a copy of y in s is shifted to position i in S.
Notice that, for ℓ = k, the only term in the summation in (11) is . This allows us to rewrite (11) as By recursively applying (13) into itself, we write K s,x [i] in terms of P s,x [i] terms.This yields where a s,x,y ∈ Z is a constant that depends on s, x, y.Observe that a s,x,y obeys the following recursion: for y ∈ Y ℓ (x), we have that This is because as we expand (13) one step at a time to eventually obtain ( 14), we observe that every time we obtain a new term involving P s,y [i] in the expansion with coefficient , in the next step of the expansion we obtain a One step of the expansion is shown in the appendix to illuminate this argument.We proceed to prove that for any s, x, y, we have that We will use the following lemma, which appears as Corollary (6.3.9) in [25].Lemma 2: For any two strings f, g over an alphabet Σ, where Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

= y x
where (22) follows from Lemma 2. By plugging in this formula for a s,x,y into (14), we obtain Lemma 1 allows us to obtain an unbiased, consistent estimator for K s,x [i] given by is defined analogously.
One way to analyze the performance of our estimator Ks,x [i] would be to apply a standard concentration inequality such as the Chernoff bound to each of the terms Ps,y [i] and use that to bound the error of Ks,x [i].However, this yields a suboptimal analysis as we do not need to guarantee the accuracy of each Ps,y [i] term.Directly analyzing the accuracy of Ks,x [i] is more subtle, as Ks,x [i] is not a sum of independent random variables.To that end, we apply McDiarmid's inequality to analyze the deviation of Ks,x [i] from K s,x [i] directly.
Lemma 3: For source string s, a k-mer x with |x| = c log n, and a set of T traces, we have Proof: In order to apply McDiarmid's inequality, we will view the estimator Ks,x [i] as a function of T n independent random variables corresponding to the indicator random variables that indicate whether a particular bit is deleted in a particular trace.To apply McDiarmid's inequality, we have to upper bound how much the estimator can change by changing the value of one of these indicator random variables.Changing the value of the indicator random variable for a particular bit in the tth trace St changes the estimator by at most where the first inequality follows because changing a single bit in a single trace can change Ps,x [i] by at most 1/T , and can change Ps,y [i] by at most 1/T for exactly two values of y ∈ Y ℓ (x) for each ℓ.To analyze b when k = c log(n) for c constant, and p < 0.5 we have that We have that because the maximizing ℓ is given by ℓ * = log(n c ) 1−p/(1−p) .This is because 2 xH(c log(n)/x) q x is differentiable for x ∈ (c log(n) + 1, n) where q = p 1−p , and x = log(n c ) 1−q is the only zero of and is a local maximum.It is easy to see that x = log(n c ) 1−q is a local maximum by observing that the derivative is positive at x = log(n c ) 1−q − ϵ and negative at x = log(n c ) 1−q + ϵ for ϵ > 0 small enough.We therefore have that Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
when k = c log(n) for c constant and p < 0.5.Let Applying McDiarmid's inequality, we have traces suffices for recovering K s,x [i] with error less than ϵ with probability at least 1 − δ.
Moreover, by the union bound, we have that Setting Pr ∥ Ks − K s ∥ ∞ ≥ ϵ , we conclude that traces suffice for recovering K s with infinity norm error less than ϵ with probability at least 1 − δ.
Observe that we can compute the estimate for K s,x [i] in (24) by iterating through all T traces and constructing a linked list for each ℓ ∈ {k + 1, . . ., n} that stores all the lengthℓ strings that start at position i in a trace, along with the number of times it is seen in the set of traces.This requires O(T 2 n 3 ) time and Õ(T n 3 ) space because there are at most n substrings y starting at position i in each trace, and y x ′ can be computed in O(n 2 ) time using dynamic programming (see Proposition 6.3.2 in [25]).It is worth noting that the expected runtime would be faster in practice if a hash table is used to store each y seen for each ℓ.
In conjunction with our trace complexity analysis, the approach described above for computing the estimator implies that for x of length c log(n), we can estimate K s,x [i] with error at most 1/poly(n) with high probability in poly(n, 1/ϵ) time, and can therefore estimate the c log(n)-mer density map with maximum error 1/poly(n) with high probability in poly(n, 1/ϵ) time.

IV. STRING RECONSTRUCTION BASED ON k-MER DENSITY MAP ESTIMATION
In this section we show, through simulations, that estimating the k-mer density map using our estimator is indeed useful for reconstructing s.We implemented the estimator (24) used to derive our main result, including the optimization involving hash tables to speed up computation described at the end of Section III.Also, when computing the estimator, we used a hash table to store previously seen string binomial coefficients y x in order to speed up computation and allow better scaling with n.We point out that, in addition to computing an estimate of K s using the random traces, we also compute an estimate of K s ′ where s ′ is the reverse of s using the reverse of the traces, and the reconstruction ŝ is obtained from both K s and K s ′ as we explain below.This is done because we have found that estimates of K s,x [i] are more accurate for smaller values of i compared to larger values of i.
Recall that K s,x = F I s,x where F ∈ R (n−k+1)×(n−k+1) is an invertible upper triangular matrix that has entries given by We obtain an estimate of I s,x using two different approaches.In the first approach, we directly compute Îs,x = F −1 Ks,x since F is invertible.Since it is possible for F to be ill-conditioned depending on the choice of n, k, p, in the second approach we try estimating I s,x given Ks,x by solving for some regularization parameter δ ≥ 0. This is a convex program if Îs,x is not restricted to be integer-valued.Observe that if δ = 0, then this approach reduces to the first approach where we set Îs,x = F −1 Ks,x .Note that in both approaches, we solve for Îs,x , and Îs ′ ,x , using Ks,x and Ks ′ ,x respectively.Given our estimates Îs,x for x ∈ {0, 1} k , we use a greedy approach for assigning k-mers to positions in our estimated string ŝ.In this approach, ŝ is initialized to be the length-n string * * • • • * that contains only the wildcard symbol * (the wildcard symbol is used to represent any symbol).At each step of the greedy algorithm, we set ŝ[i * : i * + k] = x * where

Îs,x [i]
and (i, x) is restricted to choices that would make ŝ[i : i + k] = x consistent with the current estimated string ŝ.In other words, at each step of the greedy algorithm, the k-mer x is placed at position i where i and x are chosen to maximize Îs,x [i] such that the placement of x at i fills in new entries of ŝ, and agrees with any previously filled entries in ŝ.Notice Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.that if for each position i, we have that Îs,s[i:i+k] [i] > Îs,x [i] for all x ̸ = s[i : i + k], then s is reconstructed correctly.Note that we create a reconstruction of both s and s ′ using this approach.In specific, we first construct an estimate of s ′ using the greedy approach with Îs ′ .We then initialize s to be the length-n string * * • • • * , and set ŝ[i] = ŝ′ [n − i + 1] for i ∈ {⌈n/2⌉, ⌈n/2⌉ + 1, . . ., n}.We then run the greedy algorithm to construct ŝ with Îs using this initialization of ŝ that already has the second half filled in.
Observe that if Îs,x = F −1 Ks,x is used, our approach for reconstructing s is such that the probability that s is correctly reconstructed approaches one as T → ∞.In other words, our approach is a consistent estimator of s.This is because Ks,x is a consistent estimator for K s,x , the transformation f (x) = F −1 x is continuous, and our greedy approach for constructing ŝ is guaranteed to work if Îs,s However, the exact number of samples needed for s to be reconstructed with probability 1 − ϵ does not appear to be easy to analyze, and we discuss the specific difficulties in Section VI.
In our first experiment, we use Îs,x = F −1 Ks,x to estimate I s,x .We test the reconstruction pipeline on a string s 1 with a single one in the middle, an alternating string s 2 , two strings s 3 , s 4 each with a block of ones in the middle, a string s 5 with its second half being all ones, along with five distinct string r 1 , r 2 , r 3 , r 4 , r 5 selected uniformly at random from the set of all strings of length n.The different strings are shown below: We ran all tests with n = 42, k = 3, deletion probability p = 0.1, and number of traces T ∈ {5, 10, 20, 50, 100, 200, 500, 1000}.For each T , we generated 50 random sets of T traces.The results are shown in Figure 5.
Observe that for all strings, the performance improves as T increases, which is expected because our approach is a consistent estimator of s.Furthermore, observe that all strings have reasonable reconstructions on average even at T = 5.
Our second experiment is identical to our first experiment, except we use Plots showing performance of reconstruction algorithm where Îs,x = F −1 Ks,x as the number of traces T increases.The data was generated using a deletion probability of p = 0.1, and k-mer length of k = 3, and strings of length n = 42.We tested T ∈ {5, 10, 20, 50, 100, 200, 500, 1000}, and for each T , ran 50 trials for each string.The top plot shows the fraction of trials where s was reconstructed perfectly, while the bottom plot shows the average Levenshtein distance between the reconstructed string ŝ and s.The curves labeled by "point," "alt," "block1," "block2," "block3" corresponds to the strings s 1 , s 2 , s 3 , s 4 , s 5 respectively.The curves labeled by "rand1", "rand2," "rand3," "rand4," "rand5" correspond to the random strings r 1 , r 2 , r 3 , r 4 , r 5 respectively.
with δ = 1 in order to observe the effect of regularization on ŝ.We plot our results in Figure 6.Observe that for small T , it appears that using the regularization improves performance compared to the first experiment where δ = 0.However, as T increases, it appears that the performance does not improve after a certain point for all strings other than s 3 , s 4 , and s 5 .Interestingly, s 3 , s 4 , and s 5 all contain only long runs of 0s and 1s, whereas all other strings tested also contain short runs.Therefore, it does not appear that using δ = 1 leads to a consistent estimator of s for general strings (recall that using δ = 0 leads to a consistent estimator of s as argued formally above and demonstrated in Figure 5).This is expected because using a regularization term leads to approximation error that is not present if no regularization is added.In other words, by adding the regularization term, we are changing the optimization problem so that the optimal solution is no longer I s,x when no noise is present in Ks,x .
For reference, we give the approximation error and noise error precisely for regularized least squares [26].Suppose that Ks,x = K s,x + e for some error vector e.Let F = U ΣV T be the singular value decomposition (SVD) of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 6.
Plots showing performance of reconstruction algorithm where Îs,x is estimated using regularized least squares with δ = 1 as the number of traces T increases.The data was generated using a deletion probability of p = 0.1, and k-mer length of k = 3, and strings of length n = 42.We tested T ∈ {5, 10, 20, 50, 100, 200, 500, 1000}, and for each T , ran 50 trials for each string.The top plot shows the fraction of trials where s was reconstructed perfectly, while the bottom plot shows the average Levenshtein distance between the reconstructed string ŝ and s.The curves labeled by "point," "alt," "block1," "block2," "block3" corresponds to the strings s 1 , s 2 , s 3 , s 4 , s 5 respectively.The curves labeled by "rand1", "rand2," "rand3," "rand4," "rand5" correspond to the random strings r 1 , r 2 , r 3 , r 4 , r 5 respectively.be the vector of left singular vectors of F , V = [v 1 , v 2 , . . ., v n−k+1 ] be the matrix of right singular vectors of F , and Σ be the diagonal matrix containing the singular values σ i of F with entries Σ[i, i] = σ i .It is known that for regularized least squares, the reconstruction error is given by The first summation is known as the noise error, and the second summation is known as the approximation error, which is dependent on I s,x .Notice that if δ = 0, the approximation error is not present.As suggested by the formula above, gaining an understanding of the spectral properties F could lead to insight into the reconstruction error for I s,x , which in turn could allow us to establish guarantees for the regularized estimator.

V. A NEW BOUND FOR THE k-SUBWORD DECK RECONSTRUCTION PROBLEM
In this section, we derive an estimator for the k-subword deck using techniques similar to those used to derive the estimator for the k-mer density map, and prove Theorem 2. Let N s,x denote the number of times the k-mer x appears in the source string s.When s is clear from context, we write N s,x as N x .In this section, let E s,x denote the expected number of times that k-mer x appears in a trace S of source string s.When s is clear from context, we write E s,x as E x .Notice that we can easily estimate E x from a set of T traces as Êx = 1 T T i=1 (# of times x appears as a substring in trace Si ).Similar to the k-mer density map estimator, we derive a formula for N x in terms of E x and E y for all supersequences y of x.This allows us to estimate N x directly from the set of traces.Our formula for N x is given in the following lemma.Refer to the beginning of the previous section for the definition of Y i (x).
Lemma 4: For source string s ∈ {0, 1} n , deletion probability p ∈ [0, 1), and x ∈ {0, 1} k , we have by linearity of expectation applied to all ways x can be formed from a substring y in s.This can be rewritten as the following recursive formula for N This formula is identical to the recursive formula for the k-mer density map in (13) with K s,x [i] replaced by N x and E s,y [i] replaced by E y .Therefore, the same steps can be taken to obtain a non-recursive formula for N x , which is given by Our estimator is given by Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
which we then round to the nearest integer.Observe that if the error in the estimate of N x is less than 0.5, then we will estimate N x correctly after rounding.We can implement this estimator efficiently using a similar approach to the one described in Section III, allowing us to compute the k-subword deck in poly(n, 1/ϵ) time and space for k = c log(n).
Chen, De, Lee, Servedio, and Sinha [10] derive a formula for N x that is equivalent to ours, but written in a different form.Their formula is given by where * is the wildcard symbol and #(x[1] * α [1] x[2] * α [2] x [3] is the number of times a substring of the form x[1] * α [1] x[2] * α [2] x [3] appears in a trace S of s, and |α| ≥0 .In order to prove this formula, the authors define a complex polynomial that is a generalization of N x , perform a Taylor expansion on the polynomial, and relate the partial derivatives of the polynomial to expected values of functions of traces of s.The formula above is then a special case of this more general result.Our significantly shorter proof is very different from theirs in that we first derive a recursive estimator of N x and then use structural knowledge of y x ′ to prove its equivalence to the polynomial above.
The initial algorithm presented in [10] is to use (50) directly for estimation of N x .For p < 0.5, the authors bound the error of the estimate of each E #(x[1] * α [1] x[2] * α [2] x [3] term using a Chernoff bound and the fact that #(x[1] * α [1] x[2] * α [2] x [3] This analysis of the algorithm yields an upper bound of n O(k) • log(1/δ) traces.For k = c log(n), this is superpolynomial in n.In the appendix, we employ McDiarmid's inequality to prove that poly(n) traces suffice for reconstructing N x from (50) with high probability.
In the improved algorithm [10] for p < 0.5 that uses poly(n) traces, the authors estimate N x using only estimates of E #(x[1] * α [1] x[2] * α [2] x [3] 1/2−p + log(n) .The truncated formula they use is The authors then prove that the sum of the terms in (50) for strings y of length > d + k is small, and again bound the error of the estimate of each E #(x[1] * α [1] x[2] * α [2] x [3] for |α| ≤ d using a Chernoff bound and the fact that #(x[1] * α [1] x[2] * α [2] x [3] In specific, they prove Lemma 5 to upper bound the magnitude of the sum of the truncated terms in (50), and prove that the truncated polynomial (51) can be estimated with error at most 0.2 with high probability.This leads to an estimator of (50) with error at most 0.3 with high probability, which is sufficient for estimating N x correctly after rounding.1/2−p + log(n) , The number of traces used by the truncated algorithm to reconstruct N x with probability at least 1 − δ for p < 1/2 is proved to be O((M 2 2 k ) 2 ) log M δ in Section 7.1 of [10], and it is proved that M ≤ n .It follows that number of traces used in the analysis of [10] is  We show that by using McDiarmid's to analyze sample complexity, we can significantly improve the upper bound on the number of samples needed by the truncated estimator for reconstructing the k-subword deck.This result immediately yields Theorem 2.
Lemma 6: For deletion probability p < 0.5, and any source string s ∈ {0, 1} n , we can reconstruct N x for any k-mer x where k = c log n with probability 1 − δ using traces.
Proof: Recall that the truncated estimator from [10] paper written in our form is given by Nx To analyze b when k = c log(n) for c constant, and p < 0.5 we have that b We have that where the maximizer in (60) is given by i = log(n c ) 1−p/(1−p) as proved in Lemma 3. We therefore have that when k = c log(n) for c constant and p < 0.5.Let Due to Lemma 5, it suffices to estimate N x with error at most 1/5.Applying McDiarmid's inequality, and setting ϵ = 1/5, we have Setting δ = Pr Nx − N x ≥ 1 5 , we have The number of traces needed to recover N x with probability at least 1 − δ for p < 1/2 using the truncated Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
estimator is traces where Thus, the number of traces needed by our algorithm is upper bounded by Plugging in the number of traces is upper bounded by

VI. CONCLUSION AND FUTURE WORK
In this paper, we introduced the k-mer density map, which generalizes the k-subword deck to give an indication of where k-mers appear in the source string.We proved that we can estimate the (c log n)-mer density map with error at most 1/poly(n) using poly(n) traces.Thus, the trace reconstruction problem restricted to sets of strings whose minimum (c log n)-mer density map distance is 1/poly(n) can be solved using poly(n) traces.Our results also imply that certain pairs of strings can be distinguished using poly(n) traces, which cannot be proved using existing theorems involving Hamming distance [20], [21], Levenshtein distance [22], or the k-subword deck [10].We also design a consistent estimator of the source string that makes use of the k-mer density map estimator, and show that this estimator gives good performance in simulations.Finally, we derive an improved upper bound on the number of traces needed to reconstruct the (c log n)subword deck using proof techniques similar to those used to establish results for the (c log n)-mer density map.Some open problems include determining whether there are string pairs for which the (c log n)-mer density maps have distance that decays faster than 1/poly(n), and more broadly, determining whether k-mer density maps (possibly with a different kernel h) can be used to improve upper bounds on the number of traces needed for the worst-case trace reconstruction problem.While making progress in the worst-case trace reconstruction problem is likely very challenging, k-mer density maps may also be useful in characterizing explicit classes of strings that can be reconstructed with poly(n) traces.In particular, it may be possible to establish that pairs of strings that have a small (say, constant) Hamming distance must have k-mer density map distance that decays as 1/poly(n).Such a result would allow us to recover the result in [20] and [21] using the framework of k-mer density maps.
Another direction for future work is to build on the consistent estimator for s (based on the k-mer density map estimator) discussed in Section IV.One interesting question is to characterize how many traces are needed for the output of this consistent estimator of the source string to be exactly correct (or within some small distance of s) with high probability.However, as mentioned in Section IV, it does not appear to be easy to analyze the number of traces needed by the consistent estimator to succeed with high probability.Recall that we estimate I s,x as Îs,x = F −1 Ks,x .The difficulty appears to lie in bounding ∥ Îs,x − Îs,x ∥ ∞ in terms of ∥ Ks,x − K s,x ∥ ∞ .We know that where σ n+k−1 is the smallest singular value of F [26].This implies that Therefore, a lower bound on σ n+k−1 would allow us to bound ∥ Îs,x − I s,x ∥ ∞ .We also have the bound which can be seen by observing that F −1 Ks,x = F −1 (K s,x + e) = I s,x + F −1 e where e is the error vector contained in the estimate Ks,x .Finding a general upper bound on max i,j |F −1 [i, j]| also appears to be difficult.Obtaining an upper bound on the number of traces used by this algorithm using one of these two methods, or some other method, is an interesting direction for future work.
where Bin(n, q) is a binomial random variable with n trials and success probability q.In specific, at i = ⌈(1 − p)κ(n)⌉, we have that Then, for any t > 0, we have that Applying this to our problem, we set f = Nx and we take each random variable X i to be an indicator of whether a particular bit in s is deleted in a particular trace.Recall that our estimator is given by  Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Fig. 1 .
Fig. 1.(a) Example of a source binary string s and its k-subword deck (or k-spectrum) S k (s), for k = 4.(b) Given S 4 (s) in (a), one can build a de Bruijn graph where the elements in S 4 (s) correspond to edges (with multiplicities) and the nodes correspond to 3-mers.Notice that s corresponds to an Eulerian path on the de Bruijn graph, but such a path is not unique; for example, s ′ = 000111000111000000000111 corresponds to another Eulerian path.

Fig. 2 .
Fig. 2. For each x ∈ {0, 1} k , Is,x indicates the occurrences of x in s.The density map Ks,x can be obtained via Ks,x[i] = n−k+1 j=1 h(i, j)Is,x[j].

Fig. 3 .
Fig. 3. Plots of the minimum 3-mer density map distance among all length n strings for n ∈ {4, 5, . . ., 15} and various p in the range [0, 0.5].The bottom plot is the log-log plot (base 10) of the data.Observe that for all values of n and p considered, the curves are above the plot of 1/n given by the black curve.This means that for every string pair considered, the 3-mer density map distance is above 1/n.Furthermore, while the curves for various p are not completely smooth, the log-log plot shown on the right suggests that the minimum distance scales as 1/poly(n).
where H(•) is the binary entropy function.The function f c (n) can be upper bounded by a polynomial of degree β c (p) = 2α c (p)−2c log(1−p)+1, which can be numerically computed as shown in Figure4(a).The following theorem is proved in Section III.Theorem 1: Suppose p < 0.5 and k

Fig. 4 .
Fig. 4. (a) Plot of βc(p) for various c.Observe that as p increases, the algorithm requires more traces to achieve the same level of performance.Similarly, as c increases, more traces are needed.For all values of c, the limit of βc(p) as p → 0 is equal to 3. (b) Plot of exponent in (6) versus plot of exponent in(7) for c = 1.Observe that our bound is significantly tighter than the existing bound.

2 1/ 2 −p k log e 2
k = c log(n), the number of traces used is

1 T 1 ( 1
random variable corresponding to a single bit in the jth trace Sj changes the estimator by at most b ≤ − p) k min(k, n − k + 1) + n i=k+1 2 min(i, n − i + 1) max y∈Yi(x)

2 p 1
analyze b when k = c log(n) for c constant, and p < 0.5 we have that b≤ 1 T 1 (1 − p) c log(n) c log(n) + 2 n 2 max i∈[c log(n)+1,n] i − 2 c log(n) − − p i−c log(n) The set of independent random variables we use in McDiarmid's inequality is the set of T n indicator random variables that indicate whether a particular bit is deleted in a particular trace.To apply McDiarmid's inequality, we have to upper bound how much the estimator can change by changing the value of one of the indicator random variables.Changing the value of the indicator random variable for a particular bit in the tth trace St changes the estimator by at most b