Lower Density Selection Schemes via Small Universal Hitting Sets with Short Remaining Path Length

Universal hitting sets (UHS) are sets of words that are unavoidable: every long enough sequence is hit by the set (i.e., it contains a word from the set). There is a tight relationship between UHS and minimizer schemes, where minimizer schemes with low density (i.e., efficient schemes) correspond to UHS of small size. Local schemes are a generalization of minimizer schemes that can be used as replacement for minimizer scheme with the possibility of being much more efficient. We establish the link between efficient local schemes and the minimum length of a string that must be hit by a UHS. We give bounds for the remaining path length of the Mykkeltveit UHS. In addition, we create a local scheme with the lowest known density that is only a log factor away from the theoretical lower bound.


INTRODUCTION
W e study the problem of finding Universal Hitting Sets (UHS) (Orenstein et al., 2016). A UHS is a set of words, each of length k, such that every long enough string (say of length L or longer) contains as a substring element from the set. We call such a set a UHS for parameters k and L. They are sets of unavoidable words, that is, words that must be contained in any long strings, and we are interested in the relationship between the size of these sets and the length L.
More precisely, we say that a k-mer a (a string of length k) hits a string S if a appears as a substring of S. A set A of k-mers hits S if at least one k-mer of A hits S. A UHS for length L is a set of k-mers that hits every string of length L. Equivalently, the remaining path length of a universal set is the length of the longest string that is not hit by the set (L -1 here).
The study of UHS is motivated, in part, by the link between UHS and the common method of minimizers (Schleimer et al., 2003;Roberts et al., 2004a,b). The minimizer method is a way to sample a string for representative k-mers in a deterministic way by breaking a string into windows, each window containing w k-mers, and selecting in each window a particular k-mer (the ''minimum k-mer,'' as defined by a preset order on the k-mers). This method is used in many bioinformatic software programs (Ye et al., 2012;Grabowski and Raniszewski, 2013;Chikhi et al., 2015;Deorowicz et al., 2015;Jain et al., 2017) to reduce the amount of computation and improve run time (see Marçais et al., 2019 for usage examples). The minimizer method is a family of methods parameterized by the order on the k-mers used to find the minimum. The density is defined as the expected number of sampled k-mers per unit length of sequence. Depending on the order used, the density varies.
In general, a lower density (i.e., fewer sampled k-mers) leads to greater computational improvements and is therefore desirable. For example, a read aligner such as Minimap2 (Li and Birol, 2018) stores all the locations of minimizers in the reference sequence in a database. It then finds all the minimizers in a read and searches in the database for these minimizers. The locations of these minimizers are used as seeds for the alignment. Using a minimizer scheme with a reduced density leads to a smaller database and fewer locations to consider, hence an increased efficiency, while preserving the accuracy.
There is a two-way correspondence between minimizer methods and UHS: each minimizer method has a corresponding UHS, and a UHS defines a family of compatible minimizer methods (Marçais et al., 2017(Marçais et al., , 2018. This correspondence also links the remaining path length of a UHS and the window size of a compatible minimizer scheme: the remaining path length of the UHS is upper bounded by the number of bases in each window in the minimizer scheme (L w + k -1).
Moreover, the relative size of the UHS, defined as the size of UHS over the number of possible k-mers, provides an upper bound on the density of the corresponding minimizer methods: the density is no more than the relative size of the UHS. Precisely, 1 w d jUj r k , where d is the density, U is the UHS, r k is the total number of k-mers on an alphabet of size r, and w is the window length. In other words, the study of UHS with small size leads to the creation of minimizer methods with provably low density.
Local schemes (Mykkeltveit, 1972) and forward schemes are generalizations of minimizer schemes. These extensions are of interest because they can be used in place of minimizer schemes while sampling k-mers with lower density. In particular, minimizer schemes cannot have density close to the theoretical lower bound of 1=w when w becomes large, while local and forward schemes do not suffer from this limitation (Marçais et al., 2018). Understanding how to design local and forward schemes with low density will allow us to further improve the computation efficiency of many bioinformatic algorithms.
The previously known link between minimizer schemes and UHS relied on the definition of an ordering between k-mers, and therefore is not valid for local and forward schemes that are not based on any ordering. Nevertheless, UHS play a central role in understanding the density of local and forward schemes.
Our first contribution is to describe the connection between UHS, local and forward schemes. More precisely, there are two connections: first, between the density of the schemes and the relative size of the UHS, and second, between the window size w of the scheme and the remaining path length of the UHS (i.e., the maximum length L of a string that does not contain a word from the UHS). This motivates our study of the relationship between the size of a UHS U and the remaining path length of U.
There is a rich literature on unavoidable word sets (Lothaire, 2002). The setting for UHS is slightly different for two reasons. First, we impose that all the words in the set U have the same length k, as a k-mer is a natural unit in bioinformatic applications. Second, the set U must hit any string of a given finite length L, rather than being unavoidable only by infinitely long strings. Mykkeltveit (1972) answered the question of what is the size of a minimum unavoidable set with k-mers by giving an explicit construction for such a set. The k-mers in the Mykkeltveit set are guaranteed to be present in any infinitely long sequence, and the size of the Mykkeltveit set is minimum in the sense that for any set S with fewer k-mers, there is an infinitely long sequence that avoids S. On the contrary, the construction gives no indication on the remaining path length.
The DOCKS (Orenstein et al., 2016) and ReMuVal (DeBlasio et al., 2019) algorithms are heuristics to generate unavoidable sets for parameters k and L. Both of these algorithms use the Mykkeltveit set as a starting point. In many practical cases, the longest sequence that does not contain any k-mer from the Mykkeltveit set is much larger than the parameter L of interest (which for a compatible minimizer scheme corresponds to the window length). Therefore, the two heuristics extend the Mykkeltveit set to cover every L-long sequence. These greedy heuristics do not provide any guarantee on the size of the unavoidable set generated compared with the theoretical minimum size and are only computationally tractable for limited ranges of k and L.
Our second contribution is to give upper and lower bounds on the remaining path length of the Mykkeltveit sets. These are the first bounds on the remaining path length for minimum size sets of unavoidable k-mers.
Defining local or forward schemes with a density of O(1=w) (i.e., within a constant factor of the theoretical lower bound) is not only of practical interest to improve the efficiency of existing algorithms, but it is also interesting for a historical reason. Both Roberts et al. (2004a) and Schleimer et al. (2003) used 396 ZHENG ET AL.
a probabilistic model to suggest that minimizer schemes have an expected density of 2=w. Unfortunately, this simple probabilistic model does not correctly model the minimizer schemes outside of a small range of values for parameters k and w, and minimizers do not have an O(1=w) density in general. Although the general question of whether a local scheme with O(1=w) exists is still open, our third contribution is an almost-optimal forward scheme with density of O( ln (w)=w) density. This is the lowest known density for a forward scheme, beating the previous best density of O( ffiffiffi ffi w p =w) (Marçais et al., 2018), and hinting that O(1=w) might be achievable.
Understanding the properties of UHS and their many interactions with selection schemes (minimizer and forward and local schemes) is a crucial step toward designing schemes with lower density and improving the many algorithms using these schemes. In Section 2, we give an overview of the results, and in Section 3, we give detailed proofs. Further research directions are discussed in Section 4.

RESULTS
2.1. Notation 2.1.1. Universal hitting sets. Consider a finite alphabet S = f0‚ . . . ‚ r -1g with r ! 2 elements. If a 2 S, a k denotes the letter a repeated k times. We use S k to denote the set of strings of length k on alphabet S, and call them k-mers. If S is a string, S [n‚ l] denotes the substring starting at position n and of length l. For a k-mer a 2 S k and an l-long string S 2 S l , we say ''a hits S'' if a appears as substring of S [a = S [i‚ k] for some i]. For a set of k-mers A S k and S 2 S l , we say ''A hits S'' if there exists at least one k-mer in A that hits S. A set A S k is a UHS for length L if A hits every string of length L.
2.1.2. de Bruijn graphs. Many questions regarding strings have an equivalent formulation with graph terminology using de Bruijn graphs. The de Bruijn graph B S‚ k on alphabet S and of order k has a node for every k-mer, and an edge (u, v) for every string of length k + 1 with a prefix u and the suffix is v. There are r k vertices and r k + 1 edges in the de Bruijn graph of order k.
There is a one-to-one correspondence between strings and paths in B S‚ k : a path with w nodes corresponds to a string of L = w + k -1 characters. A UHS A corresponds to a depathing set of the de Bruijn graph: a UHS for k and L intersects with every path in the de Bruijn graph with w = Lk + 1 vertices. We say ''A is a (a‚ l)-UHS'' if A is a set of k-mers that is a UHS, with relative size a = jAj=r k and hits every walk of l vertices (and therefore every string of length L = l + k -1).
A de Bruijn sequence is a particular sequence of length r k + k -1 that contains every possible k-mer once and only once. Every de Bruijn graph is Hamiltonian and the sequence spelled out by a Hamiltonian tour is a de Bruijn sequence.

Selection schemes.
A local scheme (Schleimer et al., 2003) is a method to select positions in a string. A local scheme is parameterized by a selection function f. It works by looking at every w-mer of the input sequence S: S [0‚ w]‚ S [1‚ w]‚ . . ., and selecting in each window a position according to the selection function f. The selection function selects a position in a window of length w, that is, it is a function f : S w ! [0 : w -1]. The output of a forward scheme is a set of selected positions: A forward scheme is a local scheme with a selection function such that the selected positions form a nondecreasing sequence. That is, if x 1 and x 2 are two consecutive windows in a sequence S, then f (x 2 ) ! f (x 1 ) -1.
A minimizer scheme is a scheme where the selection function takes in the sequence of w consecutive k-mers and returns the ''minimum'' k-mer in the window (hence the name minimizers). The minimum is defined by a predefined order on the k-mers (e.g., lexicographic order) and the selection function is f : See Figure 1 for examples of all three schemes. The local scheme concept is the most general as it imposes no constraint on the selection function, while a forward scheme must select positions in a nondecreasing way. A minimizer scheme is the least general and also selects positions in a nondecreasing way.
Local and forward schemes were originally defined with a function defined on a window of w k-mers, f : S w + k -1 ! [0 : w -1], similarly to minimizers. Selection schemes are schemes with k = 1, and have a single parameter w as the word length. While the notion of k-mer is central to the definition of the LOWER DENSITY SELECTION SCHEMES minimizer schemes, it has no particular meaning for a local or forward scheme: these schemes select positions within each window of a string S, and the sequence of the k-mers at these positions is no more relevant than a sequence elsewhere in the window to the selection function.
There are multiple reasons to consider selection schemes. First, they are slightly simpler as they have only one parameter, namely the window length w. Second, in our analysis, we consider the case where w is asymptotically large, therefore w ) k and the setting is similar to having k = 1. Finally, this simplified problem still provides information about the general problem of local schemes. Suppose that f is the selection function of a selection scheme, for any k > 1 we can define g k : . That is, g k is defined from the function f by ignoring the last k -1 characters in a window. The functions g k define proper selection functions for local schemes with parameters w and k, and because exactly the same positions are selected, the density of g k is equal to the density of f. In the following sections, unless noted otherwise, we use forward and local schemes to denote forward and local selection schemes.
2.1.4. Density. Because a local scheme on string S may pick the same location in two different windows, the number of selected positions is usually less than jSjw + 1. The particular density of a scheme is defined as the number of distinct selected positions divided by jSjw + 1 (Fig. 1). The expected density, or simply the density, of a scheme is the expected density on an infinitely long random sequence. Alternatively, the expected density is computed exactly by computing the particular density on any de Bruijn sequence of order ! 2w -1. In other words, a de Bruijn sequence of large enough order ''looks like'' a random infinite sequence with respect to a local scheme (see Marçais et al., 2017 and Section 3.1).

Main results
The density of a local scheme is in the range [1=w‚ 1], as 1=w corresponds to selecting exactly one position per window, and 1 corresponds to selecting every position. Therefore, the density goes from a low value with a constant number of positions per window [density is O (1=w), which goes to 0 when w gets large], to a high with constant value [density is O(1)] where the number of positions per window is proportional to w. When the minimizers and winnowing schemes were introduced, both articles used a simple probabilistic model to estimate the expected density to 2=(w + 1), or about 2 positions per window. Under this model, this estimate is within a constant factor of the optimal, O(1=w).
Unfortunately, this simple model properly accounts for the minimizer behavior only when k and w are small. For large k-that is, k ) w-it is possible to create an almost-optimal minimizer scheme with a density *1=w. More problematic, for large w-that is, w ) k-and for all minimizer schemes, the density   (Marçais et al., 2018). In other words, minimizer schemes cannot be optimal or within a constant factor of optimal for large w, and the estimate of 2=(w + 1) is very inaccurate in this regime. This motivates the study of forward schemes and local schemes. It is known that there exist forward schemes with a density of O(1= ffiffiffi ffi w p ) (Marçais et al., 2018). This density is not within a constant factor of the optimal density but at least shows that forward and local schemes do not have constant density such as minimizer schemes for large w and that they can have much lower density.
2.2.1. Connection between UHS and selection schemes. In the study of selection schemes, as for minimizer schemes, UHS play a central role. We describe the link between selection schemes and UHS, and show that the existence of a selection scheme with low density implies the existence of a UHS with a small relative size.
2.2.2. Almost-optimal relative size UHS for linear path length. Conversely, because of their link to forward and local selection schemes, we are interested in UHS with remaining path length O(w). Necessarily a universal hitting hits any infinitely long sequences. On de Bruijn graphs, a set hitting every infinitely long sequence is a decycling set: a set that intersects with every cycle in the graph. In particular, a decycling set must contain an element in each of the cycles obtained by the rotation of the w-mers (e.g., cycle of the type 001 ! 010 ! 100 ! 001). The number of these rotation cycles is known as the ''necklace number'' Consequently, the relative size of a UHS, which contains at least one element from each of these cycles, is lower bounded by O(1=w). The smallest previously known UHS with O(w) remaining path length has a relative size of O( ffiffiffi ffi w p =w) (Marçais et al., 2018). We construct a smaller UHS with relative size O( ln (w)=w): Theorem 2. For every sufficiently large w, there is a forward scheme with density of O( ln (w)=w) and a corresponding (O( ln (w)=w)‚ w)-UHS.
2.2.3. Remaining path length bounds for the Mykkeltveit sets. Mykkeltveit (1972) gave an explicit construction for a decycling set with exactly one element from each of the rotation cycles, and thereby proved a long-standing conjecture (Golomb, 2014) that the minimal size of decycling sets is equal to the necklace number. Under the UHS framework, it is natural to ask what the remaining path length for Mykkeltveit sets is. Given that the de Bruijn graph is Hamiltonian, there exist paths of length exponential in w: the Hamiltonian tours have r w vertices. Nevertheless, we show that the remaining path length for Mykkeltveit sets is upper and lower bounded by polynomials of w: Theorem 3. For sufficiently large w, the Mykkeltveit set is a (N r‚ w =r w ‚ g(w))-UHS, having the same size as minimal decycling sets, while c 1 w 2 g(w) c 2 w 3 for some constants c 1 and c 2 .

METHODS AND PROOFS
For a location i in sequence S, the context at this location is defined as c i = S [iw + 1‚ 2w -1], a (2w -1)-mer whose last w-mer starts at i. Whether f picks a new position in S [i‚ w] is entirely determined by its context, as the conditions only involve w-mers as far back as S [iw + 1‚ w], which are all included in the context. This means that instead of counting selected positions in S, we can count the contexts c satisfying f (c [w -1‚ w]) + w -1 6 ¼ f (c [j‚ w]) + j for all 0 j w -2, which are the contexts such that f on the last w-mer of c picks a new location. We denote by C f & S 2w -1 the set of contexts that satisfy this condition.
Definition 1. For given w and local selection scheme f : The expected density of f is computed as the number of selected positions over the length of the sequence for a random sequence, as the sequence becomes infinitely long. For a sufficiently long random sequence (jSj ) w), the distribution of its contexts converges to a uniform random distribution over (2w -1)-mers. Because the distribution of these contexts is exactly equal to the uniform distribution on a circular de Bruijn S sequence of order at least 2w -1, we can calculate the expected density of f as the density of f on S, or as jC f j=r 2w -1 .
3.1.2. UHS from selection schemes. The set C f over (2w -1)-mers is the UHS needed for Theorem 1. Intuitively, it is a UHS with remaining path length of at most w -1, because one location must be picked every w window, meaning there is a window that picked a new location. The context that is prefix of this window is in C f by definition.
Lemma 1. C f is a UHS with remaining path length of at most w -1.
Proof. By contradiction, assume there is a path of length w in the de Bruijn graph of order (2w -1), say fc 0 ‚ c 1 ‚ Á Á Á ‚ c w -1 g, that avoids C. We construct the sequence S 0 corresponding to the path: S 0 2 S 3w -2 such that S 0 [i‚ 2w -1] = c i .
Since c w -1 = 2 C and S 0 include c w -1 , it means f on the last w-mer of c w -1 (which is S 0 [2w -2‚ w]) picks a location that has been picked before on S 0 . The coordinate l of this selection in S 0 satisfies l ! 2w -2. As 0 f (x) w -1, the first w-mer S 0 [m‚ w] in S 0 such that f picks S 0 [l] (i.e., m + f (S 0 [m‚ w]) = l) satisfies m ! w -1. The context c mw + 1 = S 0 [m -(w -1)‚ 2w -1] then satisfies that a new location l is picked when f is applied to its last w-mer, and by definition c mw + 1 2 C, contradiction.
, This results is also a direct consequence of the definition of C. An alternative direct proof is available in Supplementary Section S1.
When f is a forward scheme, to determine if a new location is picked in a window, looking back one window is sufficient. This is because if we do not pick a new location, we have to pick the same location as in the last window. This means the context with two w-mers, or as a (w + 1)-mer, is sufficient, and our other arguments involving contexts still hold. Combining the pieces, we prove the following theorem: Theorem 4. Given a local scheme f on w-mers with density d f , we can construct a (d f , w) -UHS on (2w -1)-mers. If f is a forward scheme, we can construct a (d f , w) -UHS on (w + 1)-mers.

Forbidden word depathing set
3.2.1. Construction and path length. In this section, we construct a set that is (O( ln (w)=w)‚ w) -

UHS.
Definition 2 (Forbidden Word UHS). Let d = º log r (w= ln (w))ß -1. Define F r‚ w as the set of w-mers that satisfies either of the following clauses: (1) 0 d is the prefix of x (2) 0 d is not a substring of x.
We assume that w is sufficiently large such that d ! 1.
Lemma 2. The longest remaining path in the de Bruijn graph of order w after removing F r‚ w is wd.
Proof. Let fx 0 ‚ x 1 ‚ Á Á Á ‚ x wd g be a path of length wd + 1 in the de Bruijn graph. If x 0 does not have a substring equal to 0 d , it is in F r‚ w . Otherwise, let c be the index such that On the contrary, let S = 1 wd 0 d 1 wd -1 2 S 2wd -1 and x i = S [i‚ w] for 0 i < wd. None of fx i g is in F r‚ w , meaning there is a path of length wd in the remaining graph.
, The number of w-mer satisfying clause 1 is r wd = O( ln (w)r w =w). For the rest of this section, we focus on counting w-mers satisfying clause 2 in Definition 2, that is, the number of w-mers not containing 0 d .

3.2.2.
Number of w-mers not containing 0 d . We construct a finite state machine (FSM) that recognizes 0 d as follows. The FSM consists of d + 1 states labeled ''0'' to ''d,'' where ''0'' is the initial state and ''d'' is the terminal state. The state ''i'' with 0 i d -1 means that the last i characters were 0 and di more zeroes are needed to match 0 d . The terminal state ''d'' means that we have seen a substring of d consecutive zeroes. If the machine is at nonterminal state ''i'' and receives the character 0, it moves to state ''i + 1,'' otherwise it moves to state ''0''; once the machine reaches state ''d,'' it remains in that state forever. Now, assume we feed a random w-mer to the FSM. The probability that the machine does not reach state ''d'' for the input w-mer is the relative size of the set of w-mer satisfying clause 2. Denote p k 2 R d such that p k (j) is the probability of feeding a random k-mer to the machine and ending up in state ''j,'' for 0 j < d (note that the vector does not contain the probability for the terminal state ''d''). The answer to our problem is then p w k k 1 = P d -1 i = 0 p w (i), that is, the sum of the probabilities of ending at a nonterminal state. Define l = 1=r. Given that a randomly chosen w-mer is fed into the FSM, that is, each base is chosen independently and uniformly from S, the probabilities of transition in the FSM are: ''i'' ! ''i + 1'' with probability l, ''i'' ! ''0'' with probability 1 -l. The probability matrix to not recognize 0 d is a d · d matrix, as we discard the row and column associated with terminal state ''d'': Starting with p 0 = (1‚ 0‚ . . . ‚ 0) 2 R d as initially no sequence has been parsed and the machine is at state ''0'' with probability 1, we can compute the probability vector p w as p w = A d p w -1 = A w d p 0 .
3.2.3. Bounding p w k k 1 . We start by deriving the characteristic polynomial p A d (k) of A d and its roots, which are the eigenvalues of A d : The characteristic polynomial of A d satisfies the following recursive formula, obtained by expanding the determinant over the first column and using the linearity of the determinant:

LOWER DENSITY SELECTION SCHEMES 401
For d = 1, we have p A 1 (k) = 1 -l -k. Assuming k 6 ¼ l for now, we repeatedly expand the recursive formula to obtain a closed-form formula for p A d (k): The value for the characteristic polynomial when k = l can be derived by plugging k = l in the line marked with (*) to obtain , Now we fix d and focus on the polynomial f d (k) = k d + 1 -k d -l d + 1 + l d . Since this is a polynomial of degree d + 1, it has d + 1 roots and except for l, which is a root of f d but not of p A d , f d and p A d have the same roots.
Lemma 4. For sufficiently large d, f d (k) has a real root k 0 satisfying 1 -l d < k 0 < 1 -l d + 1 .
Proof. We show f d has opposite signs on the lower and upper bound of this inequality for sufficiently large d.
For the last line, if r = 2 the first two terms cancel out and dl 2d + 2 becomes dominant and positive, otherwise l d = rl d + 1 > 2l d + 1 . Since f d is polynomial, f d is continuous and thus has a root between 1 -l d and 1 -l d + 1 . , Lemma 5. Let s = l=k 0 . 0 = (1‚ s‚ s 2 ‚ Á Á Á ‚ s d -1 ) is the right eigenvector of A d corresponding to eigenvalue k 0 , and 0 k k 1 < 3 for sufficiently large d.
Proof. For the first part, we need to verify For the first element in the vector, we have: This verifies A d 0 = k 0 0 . For the second part, note that for sufficiently large d we have k 0 > 1 -l d > 0:9 and since l 0:5, we have s = l=k 0 < 2=3. Every element of 0 is positive, so 0

ZHENG ET AL.
Proof. Let g 0 = 0p 0 = (0‚ s‚ s 2 ‚ Á Á Á ‚ s d -1 ), where s = l=k 0 from last lemma. Because k 0 > 0, the elements of g 0 and A d are all non-negative, then the elements of A w d g 0 and k 0 g 0 are also non-negative. Now, recall that d = º log r (w= ln (w))ß -1, which implies that l d + 1 ! ln (w)=w.
, This lemma implies that the relative size for the set F r‚ w is dominated by the w-mers satisfying clause 1 of Definition 2 and F r‚ w is of relative size O( ln (w)=w). This completes the proof that F r‚ w is (O( ln (w)=w)‚ w) -UHS.

Construction of the Mykkeltveit sets
In this section, we construct the Mykkeltveit set M r‚ w and prove some important properties of the set. We start with the definition of the Mykkeltveit embedding of the de Bruijn graph.
Definition 3 (Modified Mykkeltveit Embedding). For a w-mer x, its embedding in the complex plane is defined as P(x) = P w -1 i = 0 x i r i + 1 w , where r w is a wth root of unity, r w = e 2pi=w .
Intuitively, the position of a w-mer x is defined as the following center of mass. The w roots of unity form a circle in the complex plane, and a weight equal to the value of the base x i is set at the root r i + 1 w . The position of x is the center of mass of these w points and associated weights. Originally, Mykkeltveit defined the embedding with weight r i w (Mykkeltveit, 1972). This extra factor of r w in our modified embedding rotates the coordinate and is instrumental in the proof.
Define the successor function S a (x) = x 1 x 2 Á Á Á x w -1 a, where a 2 S. The successor function gives all the neighbors of x in the de Bruijn graph. A pure rotation of x is the particular neighbor R(x) = S x 0 (x), that is, the sequence of R(x) is a left rotation of x.
We focus on a particular kind of cycle in the de Bruijn graph. A pure cycle in the de Bruijn graph, also known as conjugacy class, is the sequence of w-mers obtained by repeated rotation: (x‚ R(x)‚ R 2 (x)‚ . . . ). Each pure cycle consists of w distinct w-mer s, unless x 0 x 1 Á Á Á x w -1 is periodic, and in this case, the size of the cycle is equal to its shortest period.
The embeddings from pure rotations satisfy a curious property: Lemma 7 (Rotations and Embeddings). P(R(x)) on the complex plane is P(x) rotated clockwise around origin by 2p=w. P(S a (x)) is P(R(x)) shifted by d = ax 0 on the real part, with the imaginary part unchanged.
Proof. By Definition 3 and the definition of successor function S a (x): Note that for pure rotations d = 0, and r -1 w P(x) is exactly P(x) rotated clockwise by 2p=w. , The range for d is [ -r + 1‚ r -1]. In particular, d can be negative. In a pure cycle, either all w-mers satisfy P(x) = 0, or they lie equidistant on a circle centered at origin. Figure 2a shows the embeddings and pure cycles of 5-mers. It is known that we can partition the set of all w-mers into N r‚ k disjoint pure cycles. This means any decycling set that breaks every cycle of the de Bruijn graph will be at least this large. We now construct our proposed depathing set with this idea in mind.
Definition 4 (Mykkeltveit set). We construct the Mykkeltveit set M r‚ w as follows. Consider each conjugacy class, we will pick one w-mer from each of them by the following rule: 1. If every w-mer in the class embeds to the origin, pick an arbitrary one. 2. If there is one w-mer x in the class such that Re(P(x)) <0 and Im( p(x) = 0) (on the negative real axis), pick that one. 3. Otherwise, pick the unique w-mer x such that Im( p(x) < 0) and Im(P(R(x))) >0. Intuitively, this is the w-mer in the cycle right below the negative real axis.
This set breaks every pure cycle in the de Bruijn graph by its construction, with an interesting property: Lemma 8. Let {x i } be a path on the de Bruijn graph that avoids M r‚ w . If Im(P(x i )) £0, then for all j ‡ 1, Im(P(x j )) £0.
Proof. It suffices to show that in the remaining de Bruijn graph after removing M r‚ w , there are no edges x / y such that Im(P(x)) £0 and Im(P(y)) >0. The edge x / y means that y = S a (x) for some a. By Lemma 7, Im(P(R(x))) = Im(P(S a (x))) = Im(P(y)) >0.

Upper bounding the remaining path length in Mykkeltveit sets
In this section, we show that the remaining path after removing M r‚ w is at most O(w 3 ) long. This polynomial bound is a stark contrast to the number of remaining vertices after removing the Mykkeltveit set-that is, r w -N r‚ w *(1 -1 w )r w , which is exponential in w. Our main argument involves embedding a w-mer to point in the complex plane, similar to Mykkeltveit's construction. 3.4.1. From w-mers to embeddings. In this section, we formulate a relaxation that converts paths of w-mers to trajectories in a geometric space. Precisely, we model S a in Lemma 7 as a rotation operating on a complex embedding with attached weights, where the weights restrict possible moves. Formally, given a pair (z, t) where z is a complex number and t an integer, define the family of operations Z d (z‚ t) = (r -1 w z + d‚ t + d). When z = P(x) is the position of a w-mer x, t = W(x) = P w -1 i = 0 x i is its weight, and when 0 d + x 0 < r, Z d (P(x)‚ W(x)) = (P(S d + x 0 (x))‚ W(S d + x 0 (x))). This means Z d is equivalent to finding the position and weight of the successor S d + x 0 .
We are now looking for the length of the longest path by repeated application of Z d that satisfies 0 t W max , where W max = (r -1) w is the maximum weight of any w-mer. This is a relaxation of the original problem of finding a longest path as some choices of d and some pairs (z‚ t) on these paths may not correspond to the actual transition or w-mer in the de Bruijn graph (when d + x 0 is negative or greater than r -1, then it is not a valid transition). In some sense, the pair (z‚ t) is a loose representation of a w-mer where the precise sequence of the w-mer is ignored and only its weight is considered. On the contrary, every valid path in the de Bruijn graph corresponds to a path in this relaxation, and an upper bound on the relaxed problem is an upper bound of the original problem.
3.4.2. Weight-in embedding and relaxation. The weight-in embedding maps the pair x = (z‚ w) to the complex plane. This transforms the original longest remaining path problem into a geometric problem of bounding the length in the complex plane under some operation S d .

Definition 5 (Weight-In Embedding). The weight-in embedding of
The Z d operations in this embedding correspond to a rotation, and, maybe surprisingly, this rotation is independent of the value d. Proof. By definition of weight-in embedding and the operation Z d : In the complex plane, the rotation formula around center c and of angle h is c + e ih (zc). Therefore, the operations Z d is a rotation around c = ( -t‚ 0) of angle h = -2p=w.
, Figure 2b shows the weight-in embedding of a de Bruijn graph. The set C r‚ w = f( -j‚ 0)j0 j W max g is the set of all the possible centers of rotation, and is shown by large gray dots on the x-axis of Figure 2b. Because all the w-mers in a given conjugacy class have the same weight, say t 0 , the conjugacy classes form a circle around a particular center (t 0 ‚ 0). The image after application of S d is independent of the parameter d, but dependent on the weight t of the underlying pair (z‚ t).
Multiple pairs of x = (z‚ t) can share the same weight-in embedding Q(x). As seen in Figure 2b, every node belongs to two circles with different centers, meaning there are two embeddings with the same Q(x) but different t.
Lemma 8 naturally divides any path in the de Bruijn graph avoiding M r‚ w into two parts, the first part with Im(P(x)) >0, and the second part with Im(P(x)) ‡0. Thanks to the symmetry of the problems, we focus on the upper half-plane, defined as the region with Im(P(x)) ‡0. With the weight-in embedding, as long as the path is contained in the upper half-plane, it is always traveling to the right (toward large real value) or stays unmoved, as stated below: Lemma 10 (Monotonicity of Re(Q($))). Assume Q(x) and Q(Z d (x)) are both in the upper half-plane. If Q(x) does not coincide with its associated rotation center ( -t‚ 0), then Re(Q(Z d (x))) > Re(Q(x)), otherwise Q(Z d (x)) = Q(x).
Proof. The operation is a clockwise rotation where the rotation center is on the x-axis and the two points are on the non-negative half-plane. Necessarily, the real part increased, unless the point is on the fix point of the rotation [which is when Q(x) = ( -t‚ 0)]. , We further relax the problem by allowing rotations from any of the centers in C r‚ w , not just from some ( -t‚ 0) corresponding to the weight in the weight-in embedding. Lemma 10 still applies in this case and the points in the upper half-plane move from left to right. We are now left with a purely geometric problem involving no w-mers or weights to track: What is the longest path fz i g possible where z i + 1 is obtained from z i by a rotation of 2p=w clockwise around a center from C r‚ w , while staying in the upper halfplane at all times (Im(z i ) ! 0‚ 8i)?
We now break the problem into smaller stages as the weight-in embedding pass through rotation centers, defined as C r‚ w = f( -j‚ 0) j 0 j W max g, the set of points that Q(x) could possibly rotate around regardless of t. As there are W max + 1 rotation centers and the maximum Re(Q(x)) = Re(P(x)) for any w-mer is also W max , we define 2W max subregions, two between any adjacent pair. Formally: Definition 6 (Half Subregions). A subregion is defined as the area [ -j‚j + 0:5) · [0‚ W max ] called a left subregion or [j + 0:5‚j + 1) · [0‚ W max ] called a right subregion, for 0 < j W max .
We now define the problem of finding the longest path, localized to one left subregion, as follows: Definition 7 (Longest Local Trajectory Problem). Define the feasible region (0‚ 0:5) · [0‚ W max ], and relaxed rotation centers C 0 = f(j‚ 0) j -W max j W max g. A feasible trajectory is a list of points fz i g such that each point is in the feasible region, and z i can be obtained by rotating z i -1 around c 2 C 0 clockwise by 2p=w degrees. The solution is the longest feasible trajectory.
Again, note that this new definition is a purely geometric problem involving no w-mers and no weights W(x) to track. z i might stagnate if it coincides with one of the rotation centers, so we do not allow Re(z i ) =j in this geometric problem. Still, it suffices to solve this simpler problem, as indicated by the following lemma: of contexts yielding new selections. However, it is not always possible to go the other way: there are UHS that are not equal to a set of contexts C f for any function f.
We are thus interested in the following questions. Given a UHS U with relative size d, is it possible to create another UHS U 0 from U that has the same relative size d and corresponds to a local scheme (i.e., there exists f such that U 0 = C f )? If not, what is the smallest price to pay (extra density compared to relative size of the UHS) to derive a local scheme from UHS U?

Existence of ''perfect'' selection schemes
One of the goals in this research is to confirm or deny the existence of asymptotically ''perfect'' selection schemes with a density of 1=w, or at least O(1=w). A study of UHS might shed light on this problem. If such a perfect selection scheme exists, asymptotic perfect UHS defined as (O(1=w)‚ w)-UHS would exist. On the contrary, if we denied the existence of an asymptotic perfect UHS, this would imply nonexistence of a ''perfect'' forward selection scheme with density O(1=w).

Asymptotic results and practical uses of minimizer schemes
This line of research places more focus on asymptotic densities of minimizers (and naturally, asymptotic densities of local schemes, and asymptotic relative proportion and path length of UHS). That is, we focus on characterization of these quantities in the limit where w and k, the window length (number of k-mers in a window) and the length of a k-mer, go to infinity. On the contrary, for current practices, the values of w and k are relatively small, with w below 100 and k below 30 for the vast majority of use cases. While our results are not immediately useful to analyze these practical scenarios as we do not attempt to determine the constants behind the big-O notation, we believe that further refinement of our approaches can close the gap between theory and practice, and make rigorous analyses possible for practical minimizer and/or local schemes.

Remaining path length of minimum decycling sets
There is more than one decycling set of minimum size (MDS) for given w. The Mykkeltveit set (Mykkeltveit, 1972) is one possible construction, and a construction based on very different ideas is given in Champarnaud et al. (2004). The number of MDSs is much larger than the two sets obtained by these two methods. Empirically, for small values of w, we can exhaustively search all the MDSs on the binary alphabet: for 2 w 7 the number of MDSs is, respectively, 2, 4, 30, 28, 68 288, and 18 432.
While experiments suggest that the longest remaining path in a Mykkeltveit depathing set defined in the original article is around Y(w 3 ), matching our upper bound, we do not know if such bound is tight across all possible minimal decycling sets. The Champarnaud set seems to have a longer remaining path than the Mykkeltveit set, although it is unknown if it is within a constant factor, bounded by a polynomial of w of different degree, or is exponential. More generally, we would like to know what is the range of possible remaining path lengths as a function of w over the set of all MDSs.

AUTHOR DISCLOSURE STATEMENT
H.Z. declares no competing financial interests. C.K. is a cofounder of Ocean Genomics, Inc. G.M. is V.P. of software development at Ocean Genomics, Inc.

FUNDING INFORMATION
This work was partially supported, in part, by the Gordon and Betty Moore Foundation's Data-Driven Discovery Initiative through grant GBMF4554 to Carl Kingsford, by the U.S. National Science Foundation (CCF-1256087, CCF-1319998), and by the U.S. National Institutes of Health (R01GM122935).