WGT: Tools and algorithms for recognizing, visualizing, and generating Wheeler graphs

Summary A Wheeler graph represents a collection of strings in a way that is particularly easy to index and query. Such a graph is a practical choice for representing a graph-shaped pangenome, and it is the foundation for current graph-based pangenome indexes. However, there are no practical tools to visualize or to check graphs that may have the Wheeler properties. Here, we present Wheelie, an algorithm that combines a renaming heuristic with a permutation solver (Wheelie-PR) or a Satisfiability Modulo Theory (SMT) solver (Wheelie-SMT) to check whether a given graph has the Wheeler properties, a problem that is NP-complete in general. Wheelie can check a variety of random and real-world graphs in far less time than any algorithm proposed to date. It can check a graph with 1,000s of nodes in seconds. We implement these algorithms together with complementary visualization tools in the WGT toolkit, available as open source software at https://github.com/Kuanhao-Chao/Wheeler_Graph_Toolkit.


INTRODUCTION
A Wheeler graph is a class of directed, edge-labeled graph that is particularly easy to index and query. It is a generalization of the Burrows-Wheeler-Transform (BWT)-based FM index, 1 and partly forms the basis for existing pangenome alignment tools such as vg. 2,3 A graph is a Wheeler graph when its nodes can be totally ordered according to the co-lexicographical order of the sets of strings spelled out on all paths leading into the nodes. Formally: an edge-labeled, directed graph is a Wheeler graph if and only if there exists a total ordering over its nodes such that 0-indegree nodes come before all other nodes in the ordering, and for all pairs of edges, ðu; vÞ and ðu 0 ; v 0 Þ labeled a and a 0 , respectively: (i) a3a 0 /v < v 0 , and (ii) a = a 0^u < u 0 /v % v 0 .
Many graph-and tree-shaped structures relevant to genomics either have the Wheeler graph properties or can be modified to have them. These include certain trees (via the XBWT), 4 De Bruijn graphs, 5 and reverse deterministic graphs derived from multiple alignments. 6 The discovery of this unified definition of Wheeler graphs will help researchers to develop still more BWT variants, e.g., for pangenome indexing. For instance, the GCSA study proposed building a reverse deterministic graph from a multiple alignment, then modifying the graph through a repeated path doubling procedure, ultimately yielding a Wheeler graph. 6 This approach can generate smaller graphs while preserving genome coordinates, compared to the more widely used De Bruijn graph, as demonstrated in Figure S3.
Despite the utility of the the Wheeler graphs mentioned previously, 2,6-8 there are no tools or libraries that make it easy to use Wheeler graphs or to check if a particular graph has the requisite properties. This problem is NP-complete in general and hard to approximate. 9 An exponential-time algorithm was proposed by Gibney & Thanckachan,9 but no implementation is available.
We present Wheeler graph toolkit (WGT), an open source suite for generating, recognizing, and visualizing Wheeler graphs. WGT includes functionality for generating graphs that do or do not have the Wheeler properties. Two generators produce De Bruijn graphs and tries derived from one or more input sequences provided as FASTA. Another generator produces reverse deterministic graphs 6 from multiple sequence alignments. A fourth generator produces random graphs parameterized by the desired Figure 1. Illustration of the renaming heuristic (A) An 8-node graph with nodes divided into four groups according to in-coming edge label (with O representing 0-indegree nodes). (B) presents the workflow of the renaming heuristic. The first table in (B) shows the initialized in-node lists for eight nodes. After initialization, the algorithm sorts and relabels nodes in each group until convergence. Then, it passes the range information to either Wheelie-Pr or Wheelie-SMT.

Comparing Wheelie with Gibney & Thanckachan
Gibney and Thankachan's recognition algorithm 9 (henceforth ''G & T'') works by enumerating all possible values for the O, I, and L arrays making up the Wheeler graph structure as described by Gagie et al. 11 The O bitarray is a concatenation of unary codes describing the outdegrees of each node. I is a similar bitarray that does the same for indegrees. L is a sequence of characters labeling the edges in the order they appear in the O array. Further, the inner loop of the algorithm must check if a given assignment for I, O, and L is isomorphic to the input graph provided.
While the G & T algorithm explores an exponential-sized space, Wheelie explores the factorial-sized space of node permutations. To consider how this affects the size of an exhaustive search, we compared the search space for the worst-case scenarios for both algorithms in Table S1. We hypothesized that Wheelie could be made faster with the help of strategies for pruning the search space. Wheelie prunes its search by assigning labels to nodes according to their rough positions in the order, a strategy we call the ''renaming heuristic''. This allows Wheelie to arrive rapidly at a rough ordering that either (a) reveals a conflict that prevents the graph from having the Wheeler properties or (b) reduces the problem size for the downstream solving algorithm. An 8-node example illustrating the renaming heuristic is shown in Figure 1, and the full algorithm is described in STAR methods wheelie and the renaming heuristic, satisfiability modulo theories (SMT) solver.
Here, we use a version of the algorithm called Wheelie-Pr, which begins with the renaming heuristic then  We conducted a 30-s timeout test on both algorithms using graphs generated from four generators (WGT's graph generating algorithms) including both Wheeler and non-Wheeler graphs. Rather than implement G & T's entire algorithm, we implemented the enumeration of the I, O, and L arrays but omitted the graph isomorphism check in the inner loop. We reasoned that if Wheelie-Pr was faster than the G & T algorithm without the (rather complex) isomorphism check, it would also be faster than the full G & T algorithm. To compare the algorithms, we configured both to perform an exhaustive search, without the possibility of early stopping if a solution is found. This differs from Wheelie-Pr's default behavior, which allows it to stop upon finding a node ordering for which the Wheeler properties are satisfied. Early stopping is still possible for Wheelie-Pr in these experiments, since it might identify a conflict that proves the graph is non-Wheeler.
We took 25 multiple ortholog alignments, both their DNA and amino acid (AA) sequences, and extracted the first 4 rows of each. To reduce graph size, we truncated the graphs with respect to the multiple-alignment columns. We tested on three types of graphs, De Bruijn graphs, tries, and random Wheeler graphs, that are known Wheeler graphs, and two types of graphs, pseudo-De Bruijn graphs and reverse deterministic graphs, that are not guaranteed Wheeler graphs. Pseudo-De Bruijn graphs are graphs where the nodes correspond to k-1-mers in the multiple alignment, but where we do not collapse identical k-1-mers into single nodes.
For De Bruijn graphs and pseudo-De Bruijn graphs, we took columns 1 to 200 and set k to 3 to 9; for tries, we took columns 1 to 200; for reverse deterministic graphs, we took columns 2 to 41. We also benchmarked with a series of randomly generated graphs with n set to 3 to 33, e from 3 to n, and s from 1 to 21. The number of each type of graphs and their node and edge numbers are shown in Figure 2B, and the arguments of each generator can be found in the WGT Github repository: https://github.com/Kuanhao-Chao/Wheeler_Graph_Toolkit. Figure 2A shows that Wheelie-Pr is significantly faster, allowing it to recognize a range of Wheeler and non-Wheeler graphs. Wheelie-Pr runtimes generally range from 100 to 1,000 ms, with 784 random-graph inputs causing Wheelie-Pr to time out. In sum, the only type of graph that caused Wheelie-Pr to time out is the random graph whereas 8,461 graphs distributed in all types of benchmarked graphs caused G & T to time out.

Visualizing and characterizing challenging graphs
We selected a De Bruijn graph with edges being k-mers and nodes being k-1-mers where k = 4 from the Figure 2 benchmarks. This graph was derived from the first four rows of the multiple alignment of STAU2 DNA orthologs with sequence length 4. We first visualized it using Graphviz 13 ( Figure 3A). We ran Wheelie-Pr to find an ordering for which the Wheeler properties hold ( Figure 3B). Finally, we visualized the graph using WGT's Python-based visualizer, which draws the ordered nodes in two replicas, with

OPEN ACCESS
outgoing edges leaving one replica ( Figure 4, top rows) and entering the other (bottom rows). For a valid Wheeler ordering, nodes with no incoming edges will appear leftmost, nodes with incoming edges of the smallest character will come next, nodes with incoming edges of the next-smallest character next, etc. Further, no two same-color edges will cross each other. In this way, the diagram, first described by Boucher et al., 14 makes it visually obvious when an ordering has yielded the Wheeler properties.
We sought to understand which graphs require the most time for recognition. After investigating the ''outlier'' graphs where Wheelie-PR timed out with these tools, we found that the graphs requiring the most recognition time tended to have nodes with many outgoing same-label edges. Following Alanko et al., 15 we use the term d-NFA to describe a Wheeler graph where all nodes have % d outgoing same-label edges, and at least one node has exactly d outgoing same-label edges. The De Bruijn graph shown in Figure 3 is a 1-NFA. Figures 4A-4C are 2-NFAs with s equal to 1, 2, and 3, respectively. Figure 4D is a 3-NFA with s = 1. Note that the d R 5 case is the one proven to be NP-complete. 9 De Bruijn graphs and tries are 1-NFAs.

Recognizing challenging graphs with Wheelie-SMT
Motivated by previous work that showed how Boolean satisfiability formulations can solve special cases of the recognition problem, 15 we hypothesized that SMT solvers 10 could solve all or part of the Wheeler graph recognition problem. SMT has found many uses in artificial intelligence and formal methods for hardware and software development. As a generalization of the Boolean satisfiability, 16 SMT allows us to encode the Wheeler graph properties in a fairly straightforward way, building from the propositional logic formulas in the definition.
We conducted two series of 1,000-s timeout tests using graphs generated from the random generator comparing (1) Wheelie-Pr (renaming heuristic plus permutation) versus (2) Wheelie-SMT (renaming heuristic plus SMT) on different types of d-NFA (Recognizing d-NFAs) and various sizes of random graphs (Recognizing random Wheeler graphs).

Recognizing d-NFAs
We fixed n = 1000, e = 3000, and s = 4 and randomly generated d-NFAs with d from 1 to 8 and each group with 20 graphs. Figure 5 shows that both solvers can solve graphs swiftly when d is 1 and 2; as d grows beyond 2, all tools require much more time, demonstrating that d impacts the hardness of recognition problem in practice. Wheelie-SMT outperforms Wheelie-Pr and avoids any timeouts; Wheelie-Pr has some timeouts starting at d = 3 (4 out of 20 graphs), and consistently times out when d R 4. iScience Article Further, we observed that when d R 6, the median curve for Wheelie-SMT plateaus. This is because n and e are too small for the d-NFA generator to produce uniformly distributed d-NFAs under the given parameters. More precisely speaking, the hardness of the recognition problem is a function of the distribution of nodes having d À 1, d À 2, ., 1 outgoing edges with the same labels. As an example, take a d-NFA G that has one node with d same-label outgoing edges, and the rest of the nodes having at most one outgoing same-label edge. Recognizing G is not harder than recognizing a uniformly distributed d À 1-NFA. In short, we observed that higher ds generally led to a harder recognition problem, but the true level of hardness was also a function of n, e, and s.

Recognizing random Wheeler graphs
We defined ''graph size'' as n and ''label density'' as e=s. We then benchmarked various sizes of random graphs while varying these parameters. We first fixed the number of edges (e = 8; 000) and labels (s = 4) while scaling graph size n from 2; 000 to 8; 000. Figure 6A shows that as n grows, Wheelie-SMT outperforms Wheelie-Pr significantly. Wheelie-Pr starts to time out in some cases when n = 2; 000, and most cases when n R 2; 500. In contrast, Wheelie-SMT can solve all cases with n up to 4; 000, and most cases when n = 4; 500.
We then fixed the graph size (n = 4; 000) and number of edges (e = 20; 000) and varied the label density (e=s from 1; 000 to 4; 000). Figure 6B shows that as the label density increases, the graphs take more time to solve. Comparing Wheelie-Pr and Wheelie-SMT, we can see that there are more timeout cases in Wheelie-Pr from 1; 200 to 4; 000 (most are timeouts when e=s R 2; 500) whereas the timeout cases only occur in Wheelie-SMT when label density is 4; 000.
In a third experiment, we fixed the graph size (n = 2; 000) and varied the number of edges (e) and labels (s) while fixing the label density ratio (e=s = 2; 000). Figure 6C shows that as more edges and labels are added, the recognition problem becomes easier. In short, this is because adding more constraints to G breaks more of the ties that would otherwise obstruct Wheelie's renaming heuristic. Comparing Wheelie-Pr to Wheelie-SMT, Figure 6C shows that the solvers perform similarly, with Wheelie-Pr performing slightly better when s=s min ratio gets larger (R 5). These are likely cases where the graph is sufficiently easy to recognize that the overhead of setting up the SMT setup problem becomes harmful. When s=s min gets smaller (3 and 4), Wheelie-SMT is able to solve all 15 cases, whereas Wheelie-Pr's times out for about half the cases.
Benchmarking Wheelie-SMT alone To isolate the effect of the Wheelie renaming heuristic, we conducted a 30-s timeout test with 60 s timeout penalties on (1) Wheelie-SMT (renaming heuristic plus SMT) and (2) a pure SMT solver starting from iScience Article scratch, without the constraints it would otherwise receive from the renaming heuristic. We benchmarked these using two generators from DNA alignments: (1) De Bruijn graphs generated with options -k from 5 to 8, -l from 100 to 2; 000, and -a from 6 to 10, 225 graphs in total and (2) reverse deterministic graphs generated with options -l from 100 to 500 and -a from 4 to 6, in total 225 graphs. Figure 7 shows cactus plots on De Bruijn graphs and reverse deterministic graphs. A cactus plot is an aggregated sorted time plot widely used in solver competitions. It shows how many problems a solver can solve in a limited time period. In Figure 7A, Wheelie-SMT solved the whole De Bruijn graph set in around 6.5 s whereas the pure SMT approach solved it in around 820 s. For reverse deterministic graphs ( Figure 7B), Wheelie-SMT solved the whole set in less than 9 s whereas the pure SMT approach solved it in around 10,170 s.
We concluded that the renaming heuristic is a crucial step, since it greatly narrows the space of possible node ordering that must be resolved by the SMT solver. Wheelie-SMT can solve graphs several orders of magnitude larger than a pure SMT approach.

DISCUSSION
We demonstrated that Wheelie-SMT is the fastest and most robust algorithm available for the Wheeler graph recognition problem. We showed this across a variety of graph types, including large graphs (thousands of nodes and edges) and challenging graphs, such as those that are d-NFAs with values of d up to 8. We also demonstrated WGT's facilities for visualizing and understanding these graphs.
While current pangenome representations tend to be based on De Bruijn graphs, which are Wheeler graphs, other relevant pangenome graph representations are not necessarily Wheeler. For example, the reverse deterministic automata of the GCSA study 6 are not Wheeler, though they can be made Wheeler through a ''path doubling'' process. In the future, we expect that WGT and the Wheelie algorithm will be useful for studying alternative pangenome graph representations that might improve upon De Bruijn graphs in various ways. In Figure S3, we provide a simple illustration of why another Wheeler graph (besides a De Bruijn graph) might be better suited as a pangenome representation, since it (a) uses fewer nodes and edges than a corresponding De Bruijn graph, (b) does not require that we select a particular value for k (the k-mer length), and (c) avoids collapsing sequences that are distinct with respect to the coordinate system of a given genome. iScience Article We noted a relationship between the renaming heuristic proposed here and the ''forward algorithm'' of Alanko et al. 17 In the future, it will be important to clarify the relationship between these two algorithms, which have similar goals but take different approaches to partitioning and ordering the graph nodes. Appealingly, the forward algorithm has polynomial running time when the input is already a Wheeler graph; we do not have a similar guarantee for the renaming heuristic. However, the forward algorithm is not directly usable for the same purpose as the renaming heuristic, since it is possible for the forward algorithm to collapse a non-Wheeler input in a way that produces a Wheeler output. Another question for future work is whether the renaming heuristic could be combined with the forward algorithm to obtain an algorithm with strong guarantees (like the forward algorithm) but that is directly applicable to the recognition problem (like the renaming heuristic).
When Wheelie determines that a graph is is Wheeler graph, it is able to report a node ordering that can then be used to index the graph. In the future, it will be important to extend Wheelie to report other useful information, including when the graph is not a Wheeler graph. For instance, when Wheelie encounters a conflict that proves the graph to be non-Wheeler, Wheelie could supply the user with an explanation for why the graph cannot be Wheeler. Such an explanation could also allow Wheelie to suggest modifications to the graph that would make it a Wheeler graph, without changing which strings it encodes. A trivial example would be a node with two incoming edges having two distinct labels. This violates the Wheeler graph properties, but also suggests a potential solution: the node could be duplicated, with outgoing edges also duplicated. The initial inbound edges could be redrawn to point to the distinct duplicates, possibly restoring the Wheeler properties. A more general approach for understanding Wheeler violations could work by extracting conflicting sets of clauses from the SMT algorithm, and converting them into a human-understandable or other actionable form.
It may also be possible to encode the renaming heuristic as a set of clauses in the SMT solver, potentially allowing the entire algorithm to execute within the SMT solver. Finally, as different SMT solvers such as CVC5 18 or Z3 19 adopt different heuristics, they could potentially be substituted into WGT, or combined for increased efficiency. 20

Limitations of the study
A theoretical limitation is that we cannot claim to have improved on the worst-case bounds already established for the recognition problem. The problem is NP-complete in general, though our work suggests that the kinds of graphs that appear in pangenome applications (e.g., when they are d-NFAs with small d) may be of a class that are easier to check in practice. A further practical limitation is the fact that Wheelie cannot yet scale to large pangenome graphs. Wheelie has been demonstrated to effectively identify graphs with up to 4,000 nodes, 20,000 edges, and 5 edge labels within a time frame of around 1000 s. We are still far from being able to efficiently check pangenomes derived from, for example, whole eukaryotic genomes. A final limitation is the fact that our bipartite visualization approach works only when a particular ordering has been proposed. A more general approach would render a useful visualization when such an ordering is either not known or partially known.

METHOD DETAILS
The search space of Wheelie's permutation approach While the G & T algorithm explores an exponential-sized space of possible array assignments, Wheelie explores a factorial-sized space of node permutations. This may or may not lead to a larger search space for Wheelie, depending on the graph's properties. To be specific, the G & T's algorithm may have to consider all 2 2ðe+nÞ+e logðsÞ assignments for I, O and L. Our approach might need to consider n! node permutations in the worst case. We sought a rough comparison between the approaches in light of the fact that G & T's space depends not only on n but also on e and s. Let C : = eð2 + log sÞ Derivation 1: The relationship between C and n In Derivation 1, we fixed n for both, defining a new variable C as eð2 + log sÞ. We then found some values for C that equalize the algorithms' search space size under various values for of ns (Table S1). For instance, when n = 100, C can be at most 324 in order for G & T's algorithm has an equal or smaller search space than Wheelie-Pr, which is a strict threshold, and furthermore, this comparison is done with Wheelie-Pr skipping the renaming heuristic, which in reality makes Wheelie-Pr superiorly faster (ResultsBenchmarking Wheelie-SMT alone).
To gain a further advantage over the G & T algorithm, Wheelie further strives to prune the search space, using a renaming heuristic, an SMT solver, or both, as detailed in MethodsSTAR Methods.
Wheelie and the renaming heuristic Wheelie explores the space of possible node orderings until arriving either at a conflict (e.g., a node with distinctly labeled incoming edges) or an ordering for which the Wheeler properties hold. While this is a large (n!-sized) search space, Wheelie prunes the space by assigning labels to nodes according to their rough position in the overall order. Initially, a rough ordering is determined according to the labels of the immediate incoming edges for each node, following the Wheeler requirement that a3a 0 /v < v 0 for all edge pairs. This rough ordering is refined over the course of a procedure that iterates either until the rough ordering becomes total ordering, or until the rough ordering stabilizes. In the latter case, the remaining ambiguities are resolved by a non-heuristic solver. This procedure is detailed in Algorithm 1 and illustrated in Figure 1.
As the renaming heuristic iterates, it repeatedly visits the nodes in groupings according to the label of their incoming edge(s). For each of these groupings, it sorts the edges by sources and destinations in every label group, requiring O Q g˛S eðgÞlog 2 eðgÞ ! time, where eðgÞ is the number of edges labeled as g. We observed that many non-Wheeler graphs can be recognized as such directly by the renaming heuristic, without requiring a downstream solver.
At each iteration, the algorithm gathers a list of sorted unique temporary orders of nodes that go into it, which we term the ''in-node list.'' By the Wheeler graph property that requires all edge pairs to satisfy a = a 0^u < u 0 /v % v 0 , we can find rough orders by sorting the nodes by their in-node lists. Once this has been done for each node group, we reach the end of the current iteration and we check if the rough order changed since the previous iteration. If not, then we say the algorithm has converged and forward any remaining ambiguities to the downstream solver as necessary.
We note that there are similarities between the renaming heuristic and the ''forward algorithm'' of Alanko et al. 17 (Algorithm 2 in that paper). While our renaming heuristic performs an explicit sort within each of its rough grouping, the forward algorithm of Alanko et al. uses a pair of nested loops over alphabet characters to visit and partition the nodes in a way that maintains their sorted order implicitly. We discuss this relationship further in Discussion Discussion below.

Wheelie contains two solvers
Wheelie-PR and Wheelie-SMT. Wheelie-PR takes the output from the renaming heuristic and, for any remaining ties in the ordering, simply tries all possible permutations among the tied nodes. The Wheelie-SMT solver is explained in the next subsection Satisfiability Modulo Theories (SMT) solver.

Satisfiability Modulo Theories (SMT) solver
Motivated by the use of boolean satisfiability formulations to solve special cases of the recognition problem, 15 we hypothesized that Satisfiability Modulo Theory (SMT) solvers 10 could be used to solve all or part of the Wheeler-graph recognition problem. SMT has found many uses in artificial intelligence and formal methods for hardware and software development. iScience Article be traversed from the source. Edges are labeled according to the label of the parent. The De Bruijn graph generator constructs a distinct k À 1-mer dictionary from the sequences. It connects edges between adjacent two nodes and label the edge with the first character in the k À 1-mer of the child node. Reverse deterministic graphs are usually invalid Wheeler graphs but might be valid when the graphs are small, and once violations occur, adding more nodes and edges cannot turn them back to Wheeler graphs.
The reverse deterministic graph generator iterates through columns of a multiple sequence alignments from right to left. At a column i, it creates distinct nodes for the characters found there, connecting them to the current node with the node of the previous ungapped character with the direction pointing to the end of the alignments and the label of the previous ungapped character. This follows the procedure described in the GCSA study. 6 Last, three generators initializes the names of nodes with the breadth first search orders and outputs the constructed graph in DOT format.
We also implemented two random generators, a complete Wheeler graph generator and a d-NFA Wheeler graph generator. We first fix the ordering of nodes and then try to select edges such that both user-specified constraints and Wheeler graph properties are satisfied. Let N i be the nodes with incoming edges labeled i and E i be the edges labeled i where i = 1; 2;.;s, and also let n i = jN i j and e i = jE i j. In both generated graphs, we assume that n i z n À r s and e i z e s where r is the number of nodes without incoming edges.
We say a Wheeler graph G is complete if no more edges can be added to G while maintaining the Wheeler graph properties.
Property 1. Given number of nodes n and number of labels s, the number of edges of a Wheeler graph is upper bounded by e max where e max = n 3 s + n À s À r = ðn À 1Þðs + 1Þ À r + 1 (Equation 5) Proof. Consider the bipartite representation of a Wheeler graph G with number of nodes n and number of labels s. Note that X s i = 1 n i = n À r:

(Equation 6)
Observe that for each label i, the number of edges that is labeled i is at most n + n i À 1. Taking the sum of edges of each label and applying Equation 6, we have e max = X s i = 1 ðn + n i À 1Þ = n 3 s À s + X s i = 1 n i = n 3 s À s + n À r: (Equation 7) One way of generating complete Wheeler graphs is to have all nodes connect to the first node of N i and the last node additionally connect to the rest of the nodes in N i for each label i (the last node has n i outgoing edges in total). By randomly selecting n i À 1 nodes from N and connecting the selected nodes to consecutive nodes in N i , a new complete Wheeler graph can be generated by appropriately shifting the destination node of each edge such that the Wheeler graph property is maintained. Figure S2 shows an example of a complete Wheeler graph with ðn;e;s;rÞ = ð7; 18; 2; 1Þ. With a complete Wheeler graph of n nodes and s labels, we are able to generate random Wheeler graphs with e < e max edges by sampling e distinct edges from the complete Wheeler graph.
For generation of d-NFA Wheeler graphs, let x k be the number of nodes with k outgoing edges of the same label i. Thus, given e and s we have iScience Article Note that any solution x k that satisfies Equations 8 and 9 for all edge label i represents a set of valid Wheeler graphs. To see this first notice that given x k , we can always order the nodes such that x 1 nodes with one outgoing edge are placed at the front, followed by x 2 nodes with two outgoing edges, followed by x 3 nodes with three outgoing edges and so on. By construction, this gives a valid d-NFA Wheeler graph. Moreover, by swapping the nodes and reconnecting the corresponding edges accordingly, different d-NFA Wheeler graphs can be obtained. An example is shown in Figure S3.
To obtain a concrete instance, we find a valid solution for x k and then determine the node ordering. In our case, we set all x k to be the same, and if not possible assign the residual to x 1 to satisfy Equation 8. We believe that this reflects the hardness of different benchmarks with different d's. For node ordering, all the nodes are shuffled and edges are distributed such that each node in N i has at least one incoming edge while maintaining the Wheeler properties.
ll OPEN ACCESS