Optimal and Perfectly Parallel Algorithms for On-demand Data-Flow Analysis

Interprocedural data-flow analyses form an expressive and useful paradigm of numerous static analysis applications, such as live variables analysis, alias analysis and null pointers analysis. The most widely-used framework for interprocedural data-flow analysis is IFDS, which encompasses distributive data-flow functions over a finite domain. On-demand data-flow analyses restrict the focus of the analysis on specific program locations and data facts. This setting provides a natural split between (i) an offline (or preprocessing) phase, where the program is partially analyzed and analysis summaries are created, and (ii) an online (or query) phase, where analysis queries arrive on demand and the summaries are used to speed up answering queries. In this work, we consider on-demand IFDS analyses where the queries concern program locations of the same procedure (aka same-context queries). We exploit the fact that flow graphs of programs have low treewidth to develop faster algorithms that are space and time optimal for many common data-flow analyses, in both the preprocessing and the query phase. We also use treewidth to develop query solutions that are embarrassingly parallelizable, i.e. the total work for answering each query is split to a number of threads such that each thread performs only a constant amount of work. Finally, we implement a static analyzer based on our algorithms, and perform a series of on-demand analysis experiments on standard benchmarks. Our experimental results show a drastic speed-up of the queries after only a lightweight preprocessing phase, which significantly outperforms existing techniques.


Introduction
Static data-flow analysis. Static program analysis is a fundamental approach for both analyzing program correctness and performing compiler optimizations [25,39,44,64,30]. Static data-flow analyses associate with each program location a set of data-flow facts which are guaranteed to hold under all program executions, and these facts are then used to reason about program correctness, report erroneous behavior, and optimize program execution. Static data-flow analyses have numerous applications, such as in pointer analysis (e.g., pointsto analysis and detection of null pointer dereferencing) [46,57,61,62,66,67,69], in detecting privacy and security issues (e.g., taint analysis, SQL injection analysis) [3,37,31,33,47,40], as well as in compiler optimizations (e.g., constant propagation, reaching definitions, register allocation) [50,32,55,13,2].
Interprocedural analysis and the IFDS framework. Data-flow analyses fall in two large classes: intraprocedural and interprocedural. In the former, each procedure of the program is analyzed in isolation, ignoring the interaction between procedures which occurs due to parameter passing/return. In the latter, all procedures of the program are analyzed together, accounting for such interactions, which leads to results of increased precision, and hence is often preferable to intraprocedural analysis [49,54,59,60]. To filter out false results, interprocedural analyses typically employ call-context sensitivity, which ensures that the underlying execution paths respect the calling context of procedure invocations. One of the most widely used frameworks for interprocedural data-flow analysis is the framework of Interprocedural Finite Distributive Subset (IFDS) problems [50], which offers a unified formulation of a wide class of interprocedural data-flow analyses as a reachability problem. This elegant algorithmic formulation of data-flow analysis has been a topic of active study, allowing various subsequent practical improvements [36,45,8,3,47,56] and implementations in prominent static analysis tools such as Soot [7] and WALA [1].
On-demand analysis. Exhaustive data-flow analysis is computationally expensive and often unnecessary. Hence, a topic of great interest in the community is that of on-demand data-flow analysis [4,27,36,51,48,68,45]. On-demand analyses have several applications, such as (quoting from [36,48]) (i) narrowing down the focus to specific points of interest, (ii) narrowing down the focus to specific data-flow facts of interest, (iii) reducing work in preliminary phases, (iv) sidestepping incremental updating problems, and (v) offering demand analysis as a user-level operation. On-demand analysis is also extremely useful for speculative optimizations in just-in-time compilers [24,43,5,29], where dynamic information can dramatically increase the precision of the analysis. In this setting, it is crucial that the the on-demand analysis runs fast, to incur as little overhead as possible.
Example 1. As a toy motivating example, consider the partial program shown in Figure 1, compiled with a just-in-time compiler that uses speculative optimizations. Whether the compiler must compile the expensive function h depends on whether x is null in line 6. Performing a null-pointer analysis from the entry of int *x = NULL , *y = NULL ; 3 if (b > 1) 4 y = &b; 5 g(x ,y ); 6 if (x == NULL ) 7 h (); 8 } f reveals that x might be null in line 6. Hence, if the decision to compile h relies only on an offline static analysis, h is always compiled, even when not needed. Now consider the case where the execution of the program is in line 4, and at this point the compiler decides on whether to compile h. It is clear that given this information, x cannot be null in line 6 and thus h does not have to be compiled. As we have seen above, this decision can not be made based on offline analysis. On the other hand, an on-demand analysis starting from the current program location will correctly conclude that x is not null in line 6. Note however, that this decision is made by the compiler during runtime. Hence, such an on-demand analysis is useful only if it can be performed extremely fast. It is also highly desirable that the time for running this analysis is predictable, so that the compiler can decide whether to run the analysis or simply compile h proactively.
The techniques we develop in this paper answer the above challenges rigorously. Our approach exploits a key structural property of flow graphs of programs, called treewidth.
Treewidth of programs. A very well-studied notion in graph theory is the concept of treewidth of a graph, which is a measure of how similar a graph is to a tree (a graph has treewidth 1 precisely if it is a tree) [52]. On one hand the treewidth property provides a mathematically elegant way to study graphs, and on the other hand there are many classes of graphs which arise in practice and have constant treewidth. The most important example is that the flow graph for goto-free programs in many classic programming languages have constant treewidth [63]. The low treewidth of flow graphs has also been confirmed experimentally for programs written in Java [34], C [38], Ada [12] and Solidity [15].
Treewidth has important algorithmic implications, as many graph problems that are hard to solve in general admit efficient solutions on graphs of low treewidth. In the context of program analysis, this property has been exploited to develop improvements for register allocation [63,9] (a technique implemented in the Small Device C Compiler [28]), cache management [18], on-demand algebraic path analysis [16], on-demand intraprocedural data-flow analysis of concurrent programs [20] and data-dependence analysis [14].
Problem statement. We focus on on-demand data-flow analysis in IFDS [50,36,48]. The input consists of a supergraph G of n vertices, a data-fact domain D and a data-flow transformer function M . Edges of G capture control-flow within each procedure, as well as procedure invocations and returns. The set D defines the domain of the analysis, and contains the data facts to be discovered by the analysis for each program location. The function M associates with every edge (u, v) of G a data-flow transformer M (u, v) : 2 D → 2 D . In words, M (u, v) defines the set of data facts that hold at v in some execution that transitions from u to v, given the set of data facts that hold at u.
On-demand analysis brings a natural separation between (i) an offline (or preprocessing) phase, where the program is partially analyzed, and (ii) an online (or query) phase, where on-demand queries are handled. The task is to preprocess the input in the offline phase, so that in the online phase, the following types of on-demand queries are answered efficiently: 1. A pair query has the form (u, d 1 , v, d 2 ), where u, v are vertices of G in the same procedure, and d 1 , d 2 are data facts. The goal is to decide if there exists an execution that starts in u and ends in v, and given that the data fact d 1 held at the beginning of the execution, the data fact d 2 holds at the end. These are known as same-context queries and are very common in data-flow analysis [23,50,16]. 2. A single-source query has the form (u, d 1 ), where u is a vertex of G and d 1 is a data fact. The goal is to compute for every vertex v that belongs to the same procedure as u, all the data facts that might hold in v as witnessed by executions that start in u and assuming that d 1 holds at the beginning of each such execution.
Previous results. The on-demand analysis problem admits a number of solutions that lie in the preprocessing/query spectrum. On the one end, the preprocessing phase can be disregarded, and every on-demand query be treated anew. Since each query starts a separate instance of IFDS, the time to answer it is O(n·|D| 3 ), for both pair and single-source queries [50]. On the other end, all possible queries can be pre-computed and cached in the preprocessing phase in time O(n 2 · |D| 3 ), after which each query costs time proportional to the size of the output (i.e., O(1)) for pair queries and O(n ·|D|) for single-source queries). Note that this full preprocessing also incurs a cost O(n 2 · |D| 2 ) in space for storing the cache table, which is often prohibitive. On-demand analysis was more thoroughly studied in [36]. The main idea is that, instead of pre-computing the answer to all possible queries, the analysis results obtained by handling each query are memoized to a cache table, and are used for speeding up the computation of subsequent queries. This is a heuristic-based approach that often works well in practice, however, the only guarantee provided is that of same-worst-case-complexity, which states that in the worst case, the algorithm uses O(n 2 · |D| 3 ) time and O(n 2 · |D| 2 ) space, similarly to the complete preprocessing case. This guarantee is inadequate for runtime applications such as the example of Figure 1, as it would require either (i) to run a full analysis, or (ii) to run a partial analysis which might wrongly conclude that h is reachable, and thus compile it. Both cases incur a large runtime overhead, either because we run a full analysis, or because we compile an expensive function.
Our contributions. We develop algorithms for on-demand IFDS analyses that have strong worst-case time complexity guarantees and thus lead to more predictable performance than mere heuristics. The contributions of this work are as follows: 1. We develop an algorithm that, given a program represented as a supergraph of size n and a data fact domain D, solves the on-demand same-context IFDS problem while spending (i) O(n · |D| 3 ) time in the preprocessing phase, and (ii) O( |D|/ log n ) time for a pair query and O(n · |D| 2 / log n) time for a single-source query in the query phase. Observe that when |D| = O(1), the preprocessing and query times are proportional to the size of the input and outputs, respectively, and are thus optimal § . In addition, our algorithm uses O(n · |D| 2 ) space at all times, which is proportional to the size of the input, and is thus space optimal. Hence, our algorithm not only improves upon previous state-of-the-art solutions, but also ensures optimality in both time and space. 2. We also show that after our one-time preprocessing, each query is embarrassingly parallelizable, i.e., every bit of the output can be produced by a single thread in O(1) time. This makes our techniques particularly useful to speculative optimizations, since the analysis is guaranteed to take constant time and thus incur little runtime overhead. Although the parallelization of data-flow analysis has been considered before [41,42,53], this is the first time to obtain solutions that span beyond heuristics and offer theoretical guarantees. Moreover, this is a rather surprising result, given that general IFDS is known to be P-complete. 3. We implement our algorithms on a static analyzer and experimentally evaluate their performance on various static analysis clients over a standard set of benchmarks. Our experimental results show that after only a lightweight preprocessing, we obtain a significant speedup in the query phase compared to standard on-demand techniques in the literature. Also, our parallel implementation achieves a speedup close to the theoretical optimal, which illustrates that the perfect parallelization of the problem is realized by our approach in practice. Recently, we exploited the low-treewidth property of programs to obtain faster algorithms for algebraic path analysis [16] and intraprocedural reachability [21]. Data-flow analysis can be reduced to these problems. Hence, the algorithms in [16,21] can also be applied to our setting. However, our new approach has two important advantages: (i) we show how to answer queries in a perfectly parallel manner, and (ii) reducing the problem to algebraic path properties and then applying the algorithms in [16,21] yields O(n · |D| 3 ) preprocessing time and O(n · log n · |D| 2 ) space, and has pair and single-source query time O(|D|) and O(n · |D| 2 ). Hence, our space usage and query times are better by a factor of log n ¶ . Moreover, when considering the complexity wrt n, i.e. considering D to be a constant, these results are optimal wrt both time and space. Hence, no further improvement is possible. Remark. Note that our approach does not apply to arbitrary CFL reachability in constant treewidth. In addition to the treewidth, our algorithms also exploit specific structural properties of IFDS. In general, small treewidth alone does not improve the complexity of CFL reachability [14].

Preliminaries
Model of computation. We consider the standard RAM model with word size W = Θ(log n), where n is the size of our input. In this model, one can store W bits in one word (aka "word tricks") and arithmetic and bitwise operations between pairs of words can be performed in O(1) time. In practice, word size is a property of the machine and not the analysis. Modern machines have words of size at least 64. Since the size of real-world input instances never exceeds 2 64 , the assumption of word size W = Θ(log n) is well-realized in practice and no additional effort is required by the implementer to account for W in the context of data flow analysis. Graphs. We consider directed graphs G = (V, E) where V is a finite set of vertices and E ⊆ V × V is a set of directed edges. We use the term graph to refer to directed graphs and will explicitly mention if a graph is undirected. For two vertices u, v ∈ V, a path P from u to v is a finite sequence of vertices P = (w i ) k i=0 such that w 0 = u, w k = v and for every i < k, there is an edge from w i to w i+1 in E. The length |P | of the path P is equal to k. In particular, for every vertex u, there is a path of length 0 from u to itself. We write P : u v to denote that P is a path from u to v and u v to denote the existence of such a path, i.e. that v is reachable from u. Given a set V ⊆ V of vertices, the induced subgraph of G on V is defined as G[V ] = (V , E ∩ (V × V )). Finally, the graph G is called bipartite if the set V can be partitioned into two sets V 1 , V 2 , so that every edge has one end in V 1 and the other in

The IFDS Framework
IFDS [50] is a ubiquitous and general framework for interprocedural data-flow analyses that have finite domains and distributive flow functions. It encompasses a wide variety of analyses, including truly-live variables, copy constant propagation, possibly-uninitialized variables, secure information-flow, and gen/kill or bitvector problems such as reaching definitions, available expressions and live variables [50,7]. IFDS obtains interprocedurally precise solutions. In contrast to intraprocedural analysis, in which precise denotes "meet-over-all-paths", interprocedurally precise solutions only consider valid paths, i.e. paths in which when a function reaches its end, control returns back to the site of the most recent call [58]. Flow graphs and supergraphs. In IFDS, a program with k procedures is specified by a supergraph, i.e. a graph G = (V, E) consisting of k flow graphs G 1 , . . . , G k , one for each procedure, and extra edges modeling procedure-calls. Flow graphs represent procedures in the usual way, i.e. they contain one vertex v i for each statement i and there is an edge from v i to v j if the statement j may immediately follow the statement i in an execution of the procedure. The only exception is that a procedure-call statement i is represented by two vertices, a call vertex c i and a return-site vertex r i . The vertex c i only has incoming edges, and the vertex r i only has outgoing edges. There is also a call-to-return-site edge from c i to r i . The call-to-return-site edges are included for passing intraprocedural information, such as information about local variables, from c i to r i . Moreover, each flow graph G l has a unique start vertex s l and a unique exit vertex e l .
The supergraph G also contains the following edges for each procedure-call i with call vertex c i and return-site vertex r i that calls a procedure l: (i) an interprocedural call-to-start edge from c i to the start vertex of the called procedure, i.e. s l , and (ii) an interprocedural exit-to-return-site edge from the exit vertex of the called procedure, i.e. e l , to r i .
Example 2. Figure 2 shows a simple C++ program on the left and its supergraph on the right. Each statement i of the program has a corresponding vertex v i in the supergraph, except for statement 7, which is a procedure-call statement and hence has a corresponding call vertex c 7 and return-site vertex r 7 .  Interprocedurally valid paths. Not every path in the supergraph G can potentially be realized by an execution of the program. Consider a path P in G and let P be the sequence of vertices obtained by removing every v i from P , i.e. P only consists of c i 's and r i 's. Then, P is called a same-context valid path if P can be generated from S in this grammar: Moreover, P is called an interprocedurally valid path or simply valid if P can be generated from the nonterminal S in the following grammar: For any two vertices u, v of the supergraph G, we denote the set of all interprocedurally valid paths from u to v by IVP(u, v) and the set of all same-context valid paths from u to v by SCVP(u, v). Informally, a valid path starts from a statement in a procedure p of the program and goes through a number of procedure-calls while respecting the rule that whenever a procedure ends, control should return to the return-site in its parent procedure. A same-context valid path is a valid path in which every procedure-call ends and hence control returns back to the initial procedure p in the same context. IFDS [50]. An IFDS problem instance is a tuple I = (G, D, F, M, ) where: is a supergraph as above.
-D is a finite set, called the domain, and each d ∈ D is called a data flow fact.
-The meet operator is either intersection or union.
for each function f ∈ F and every two sets of facts Intuitively, the solution is defined by taking meet-over-all-valid-paths. If the meet operator is union, then MVP v is the set of data flow facts that may hold at v, when v is reached in some execution of the program. Conversely, if the meet operator is intersection, then MVP v consists of data flow facts that must hold at v in every execution of the program that reaches v. Similarly, we define the same-context solution of I as the collection of values {MSCP v } v∈V main defined as follows: MSCP The intuition behind MSCP is similar to that of MVP, except that in MSCP v we consider meet-over-same-context-paths (corresponding to runs that return to the same stack state).

Remark 1.
We note two points about the IFDS framework: -As in [50], we only consider IFDS instances in which the meet operator is union. Instances with intersection can be reduced to union instances by dualization [50]. -For brevity, we are considering a global domain D, while in many applications the domain is procedure-specific. This does not affect the generality of our approach and our algorithms remain correct for the general case where each procedure has its own dedicated domain. Indeed, our implementation supports the general case.
defined as: Bounded Bandwidth Assumption. Following [50], we assume that the bandwidth in function calls and returns is bounded by a constant. In other words, there is a small constant b, such that for every edge e that is a call-to-start or exit-toreturn-site edge, every vertex in the graph representation H M (e) has degree b or less. This is a classical assumption in IFDS [50,7] and models the fact that every parameter in a called function is only dependent on a few variables in the callee (and conversely, every returned value is only dependent on a few variables in the called function).
Composition of distributive functions. Let f and g be distributive functions and R f and R g their succinct representations. It is easy to verify that g • f is also distributive, hence it has a succinct representation R g•f . Moreover, we have  Figure 4 shows contracting of corresponding vertices in H f and H g (left) and using reachability to obtain A path P in G is (same-context) valid, if the path P in G, obtained by ignoring the second component of every vertex in P , is (same-context) valid. As shown in [50], for a data flow fact Hence, the IFDS problem is reduced to reachability by valid paths in G. Similarly, the same-context IFDS problem is reduced to reachability by same-context valid paths in G.
Example 5. Consider a null pointer analysis on the program in Figure 2. At each program point, we want to know which pointers can potentially be null. We first model this problem as an IFDS instance. Let D = {x,ȳ}, wherex is the data flow fact that x might be null andȳ is defined similarly. Figure 5 shows the same program and its exploded supergraph. At point 8, the values of both pointers x and y are used. Hence, if either of x or y is null at 8, a null pointer error will be raised. However, as evidenced by the two valid paths shown in red, both x and y might be null at 8. The pointer y might be null because it is passed to the function f by value (instead of by reference) and keeps its local value in the transition from c 7 to r 7 , hence the edge ((c 7 ,ȳ), (r 7 ,ȳ)) is in G. On the other hand, the function f only initializes y, which is its own local variable, and does not change x (which is shared with main).

Trees and Tree Decompositions
Trees. A rooted tree T = (V T , E T ) is an undirected graph with a distinguished "root" vertex r ∈ V T , in which there is a unique path P u v between every pair {u, v} of vertices. We refer to the number of vertices in V T as the size of T . For an arbitrary vertex v ∈ V T , the depth of v, denoted by d v , is defined as the length of the unique path P r v : r v. The depth or height of T is the maximum depth among its vertices. A vertex u is called an ancestor of v if u appears in P r v . In this case, v is called a descendant of u. In particular, r is an ancestor of every vertex and each vertex is both an ancestor and a descendant of itself. We denote the set of ancestors of v by A ↑ v and its descendants by D ↓ v . It is straightforward to see that for every 0 ≤ d ≤ d v , the vertex v has a unique ancestor with depth d. We denote this ancestor by a d v . The ancestor i.e. the part of T that consists of v and its descendants. Finally, a vertex v ∈ V T is called a leaf if it has no children. Given two vertices u, v ∈ V T , the lowest common ancestor lca(u, v) of u and v is defined as argmax w∈A ↑ In other words, lca(u, v) is the common ancestor of u and v with maximum depth, i.e. which is farthest from the root.

Lemma 1 ([35]). Given a rooted tree T of size n, there is an algorithm that preprocesses T in O(n)
and can then answer lowest common ancestor queries, i.e. queries that provide two vertices u and v and ask for lca (u, v), in O(1).
Tree decompositions [52]. Given a graph G = (V, E), a tree decomposition of G is a rooted tree T = (B, E T ) such that: For clarity, we call each vertex of T a "bag" and reserve the word vertex for G. Informally, each vertex must appear in some bag.
edge should appear in some bag. (iii) For any pair of bags b i , b j ∈ B and any bag b k that appears in the path i.e. each vertex should appear in a connected subtree of T . The width of the tree decomposition T = (B, E T ) is defined as the size of its largest bag minus 1. The treewidth tw(G) of a graph G is the minimal width among its tree decompositions. A vertex v ∈ V appears in a connected subtree, so there is a unique bag b with the smallest possible depth such that v ∈ V (b). We call b the root bag of v and denote it by rb(v). Fig. 6: A Graph G (left) and its Tree Decomposition T (right).
It is well-known that flow graphs of programs have typically small treewidth [63]. For example, programs written in Pascal, C, and Solidity have treewidth at most 3, 6 and 9, respectively. This property has also been confirmed experimentally for programs written in Java [34], C [38] and Ada [12]. The challenge is thus to exploit treewidth for faster interprocedural on-demand analyses. The first step in this approach is to compute tree decompositions of graphs. As the following lemma states, tree decompositions of low-treewidth graphs can be computed efficiently.

Lemma 2 ([11]). Given a graph G with constant treewidth t, a binary tree decomposition of size O(n) bags, height O(log n) and width O(t) can be computed in linear time.
Separators [26]. The key structural property that we exploit in low-treewidth flow graphs is a separation property.  If (A, B) is a separation, the set A ∩ B is called a separator. The following lemma states such a separation property for low-treewidth graphs. [26]). Let T = (B, E T ) be a tree decomposition of G = (V, E) and e = {b, b } ∈ E T . If we remove e, the tree T breaks into two connected components, T b and T b , respectively containing b and b . Let Then (A, B) is a separation of G and its corresponding separator is

Lemma 3 (Cut Property
Example 6. Figure 6 shows a graph and one of its tree decompositions with width 2. In this example, we have rb( For the separator property of Lemma 3, consider the edge {b 2 , b 4 }. By removing it, T breaks into two parts, one containing the vertices

Problem definition
We consider same-context IFDS problems in which the flow graphs G i have a treewidth of at most t for a fixed constant t. We extend the classical notion of same-context IFDS solution in two ways: (i) we allow arbitrary start points for the analysis, i.e. we do not limit our analyses to same-context valid paths that start at s main ; and (ii) instead of a one-shot algorithm, we consider a two-phase process in which the algorithm first preprocesses the input instance and is then provided with a series of queries to answer. We formalize these points below. We fix an IFDS instance I = (G, D, F, M, ∪) with exploded supergraph G = (V , E). Meet over same-context valid paths. We extend the definition of MSCP by specifying a start vertex u and an initial set Δ of data flow facts that hold at u. Formally, for any vertex v that is in the same flow graph as u, we define: The only difference between (2) and (1) is that in (1), the start vertex u is fixed as s main and the initial data-fact set Δ is fixed as D, while in (2), they are free to be any vertex/set. Reduction to reachability. As explained in Section 2.1, computing MSCP is reduced to reachability via same-context valid paths in the exploded supergraph G. This reduction does not depend on the start vertex and initial data flow facts. Hence, for a data flow fact d ∈ D, we have d ∈ MSCP u,Δ,v iff in the exploded supergraph G the vertex (v, d) is reachable via same-context valid paths from a vertex (u, δ) for some δ ∈ Δ ∪ {0}. Hence, we define the following types of queries: Pair query. A pair query provides two vertices (u, d 1 ) and (v, d 2 ) of the exploded supergraph G and asks whether they are reachable by a same-context valid path.
Hence, the answer to a pair query is a single bit. Intuitively, if d 2 = 0, then the query is simply asking if v is reachable from u by a same-context valid path in G. Otherwise, d 2 is a data flow fact and the query is asking whether Single-source query. A single-source query provides a vertex (u, d 1 ) and asks for all vertices (v, d 2 ) that are reachable from (u, d 1 ) by a same-context valid path. Assuming that u is in the flow graph G i = (V i , E i ), the answer to the single source query is a sequence of |V i | · |D * | bits, one for each (v,

Preprocessing
The original solution to the IFDS problem, as first presented in [50], reduces the problem to reachability over a newly constructed graph. We follow a similar approach, except that we exploit the low-treewidth property of our flow graphs at every step. Our preprocessing is described below. It starts with computing constant-width tree decompositions for each of the flow graphs. We then use standard techniques to make sure that our tree decompositions have a nice form, i.e. that they are balanced and binary. Then comes a reduction to reachability, which is similar to [50]. Finally, we precompute specific useful reachability information between vertices in each bag and its ancestors. As it turns out in the next section, this information is sufficient for computing reachability between any pair of vertices, and hence for answering IFDS queries.
Overview. Our preprocessing consists of the following steps: (1) Finding Tree Decompositions. In this step, we compute a tree decomposition T i = (B i , E Ti ) of constant width t for each flow graph G i . This can either be done by applying the algorithm of [10] directly on G i , or by using an algorithm due to Thorup [63] and parsing the program. (2) Balancing and Binarizing. In this step, we balance the tree decompositions T i using the algorithm of Lemma 2 and make them binary using the standard process of [22]. (3) LCA Preprocessing. We preprocess the T i 's for answering lowest common ancestor queries using Lemma 1. (4) Reduction to Reachability. In this step, we modify the exploded supergraph G = (V , E) to obtain a new graphĜ = (V ,Ê), such that for every pair of vertices (u, d 1 ) and (v, d 2 ), there is a path from (u, d 1 ) to (v, d 2 ) in Ĝ iff there is a same-context valid path from (u, d 1 ) to (v, d 2 ) in G. So, this step reduces the problem of reachability via same-context valid paths in G to simple reachability inĜ. (5) Local Preprocessing. In this step, for each pair of vertices (u, d 1 ) and (v, d 2 ) for which there exists a bag b such that both u and v appear in b, we compute and cache whether (u, d 1 ) (v, d 2 ) inĜ. We write (u, d 1 ) local (v, d 2 ) to denote a reachability established in this step. (6) Ancestors Reachability Preprocessing. In this step, we compute reachability information between each vertex in a bag and vertices appearing in its ancestors in the tree decomposition. Concretely, for each pair of vertices (u, d 1 ) and (v, d 2 ) such that u appears in a bag b and v appears in a bag b that is an ancestor of b, we establish and remember whether (u, d 1 ) (v, d 2 ) inĜ and whether (v, d 2 ) (u, d 1 ) inĜ. As above, we use the notations (u, d 1 ) anc (v, d 2 ) and (v, d 2 ) anc (u, d 1 ).
Steps (1)-(3) above are standard and well-known processes. We now provide details of steps (4)- (6). To skip the details and read about the query phase, see Section 4.3 below.

Step (4): Reduction to Reachability
In this step, our goal is to compute a new graphĜ from the exploded supergraph G such that there is a path from (u, d 1 ) to (v, d 2 ) inĜ iff there is a same-context valid path from (u, d 1 ) to (v, d 2 ) in G. The idea behind this step is the same as that of the tabulation algorithm in [50].
Summary edges. Consider a call vertex c l in G and its corresponding return-site vertex r l . For d 1 , d 2 ∈ D * , the edge ((c l , d 1 ), (r l , d 2 )) is called a summary edge if there is a same-context valid path from (c l , d 1 ) to (r l , d 2 ) in the exploded supergraph G. Intuitively, a summary edge summarizes the effects of procedure calls (same-context interprocedural paths) on the reachability between c l and r l . From the definition of summary edges, it is straightforward to verify that the graphĜ obtained from G by adding every summary edge and removing every interprocedural edge has the desired property, i.e. a pair of vertices are reachable inĜ iff they are reachable by a same-context valid path in G. Hence, we first find all summary edges and then computeĜ. This is shown in Algorithm 1.
We now describe what Algorithm 1 does. Let s p be the start point of a procedure p. A shortcut edge is an edge ((s p , d 1 ), (v, d 2 )) such that v is in the same procedure p and there is a same-context valid path from (s p , d 1 ) to (v, d 2 ) in G. The algorithm creates an empty graph H = (V , E ). Note that H is implicitly represented by only saving E . It also creates a queue Q of edges to be added to H (initially Q = E) and an empty set S which will store the summary edges. The goal is to construct H such that it contains (i) intraprocedural edges of G, (ii) summary edges, and (iii) shortcut edges.
It constructs H one edge at a time. While there is an unprocessed intraprocedural edge e = ((u, d 1 ), (v, d 2 )) in Q, it chooses one such e and adds it to H (lines 5-10). Then, if (u, d 1 ) is reachable from (s p , d 3 ) via a same-context valid Algorithm 1: ComputingĜ in Step (4) ) is an interprocedural edge, i.e. a call-to-start or exit-to-return-site edge then 8 continue; if u and v are not in the same procedure then 23Ĝ =Ĝ − {e}; 24Ĝ ←Ĝ ∪ S; path, then by adding the edge e, the vertex (v, d 2 ) also becomes accessible from (s p , d 3 ). Hence, it adds the shortcut edge ((s p , d 3 ), (v, d 2 )) to Q, so that it is later added to the graph H. Moreover, if u is the start s p of the procedure p and v is its end e p , then for every call vertex c l calling the procedure p and its respective return-site r l , we can add summary edges that summarize the effect of calling p (lines [14][15][16][17][18][19]. Finally, lines 20-24 computeĜ as discussed above. Correctness. As argued above, every edge that is added to H is either intraprocedural, a summary edge or a shortcut edge. Moreover, all such edges are added to H, because H is constructed one edge at a time and every time an edge e is added to H, all the summary/shortcut edges that might occur as a result of adding e to H are added to the queue Q and hence later to H. Therefore, Algorithm 1 correctly computes summary edges and the graphĜ.  O(|n| · |D| 3 ). For a more detailed analysis, see [50,Appendix].

Step (5): Local Preprocessing
In this step, we compute the set R local of local reachability edges, i.e. edges of the form ((u, d 1 ), (v, d 2 )) such that u and v appear in the same bag b of a tree decomposition T i and (u, d 1 ), (v, d 2 )) ∈ R local . Note thatĜ has no interprocedural edges. Hence, we can process each T i separately. We use a divide-and-conquer technique similar to the kernelization method used in [22]  (v, d 2 ) in H l , the algorithm adds the edge ((u, d 1 ), (v, d 2 )) to both R local andĜ (lines 7-9). Note that this does not change reachability relations in G, given that the vertices connected by the new edge were reachable by a path before adding it. Then, if b l is not the only bag in T , the algorithm recursively calls itself over the tree decomposition T −b l , i.e. the tree decomposition obtained by removing b l (lines 10-11). Finally, it repeats the reachability computation on H l (lines [12][13][14]. The running time of the algorithm is O(n · |D * | 3 ).
Example 7. Consider the graph G and tree decomposition T given in Figure 6 and let D * = {0}, i.e. letĜ andḠ be isomorphic to G. Figure 7 illustrates the steps taken by Algorithm 2. In each step, a bag is chosen and a local all-pairs reachability computation is performed over the bag. Local reachability edges are added to R local and toĜ (if they are not already inĜ).
We now prove the correctness and establish the complexity of Algorithm 2. Correctness. We prove that when computeLocalReachability(T ) ends, the set R local contains all the local reachability edges between vertices that appear in the same bag in T. The proof is by induction on the size of T. If T consists of a single bag, then the local reachability computation on H l (lines 7-9) fills R local correctly. Now assume that T has n bags. Let Intuitively, H −l is the part ofĜ that corresponds to other bags in T , i.e. every bag except the leaf bag b l . After the local reachability computation at lines 7-9, (v, d 2 ) is reachable from (u, d 1 ) in H −l only if it is reachable inĜ. This is because (i) the vertices of H l and H −l form a separation ofĜ with separator (V (b l ) ∩ V (b p )) × D * (Lemma 3) and (ii) all reachability information in H l is now replaced by direct edges (line 8). Hence, by induction hypothesis, line 11 finds all the local reachability edges for T − b l and adds them to both R local and G. Therefore, after line 11, for every u, v ∈ V (b l ), we have (u, d 1 ) ( Hence, the final all-pairs reachability computation of lines 12-14 adds all the local edges in b l to R local . Complexity. Algorithm 2 performs at most two local all-pair reachability computations over the vertices appearing in each bag, i.e. O(t · |D * |) vertices. Each such computation can be performed in O(t 3 · |D * | 3 ) using standard reachability algorithms. Given that the T i 's have O(n) bags overall, the total runtime of Algorithm 2 is O(n · t 3 · |D * | 3 ) = O(n · |D * | 3 ). Note that the treewidth t is a constant and hence the factor t 3 can be removed.

Step (6): Ancestors Reachability Preprocessing
This step aims to find reachability relations between each vertex of a bag and vertices that appear in the ancestors of that bag. As in the previous case, we compute a set R anc and write (u, This step is performed by Algorithm 3. For each bag b and vertex (u, d) such that u ∈ V (b) and each 0 ≤ j < d v , we maintain two sets : F (u, d, b, j) and F (u, d, b, j) each containing a set of vertices whose first coordinate is in the ancestor of b at depth j. Intuitively, the vertices in F (u, d, b, j) are reachable from (u, d). Conversely, (u, d) is reachable from the vertices in F (u, d, b, j). At first all F and F sets are initialized as ∅. We process each tree decomposition T i in a top-down manner and does the following actions at each bag: -If a vertex u appears in both b and its parent b p , then the reachability data computed for (u, d) at b p can also be used in b. So, the algorithm copies this data (lines 4-7). -If (u, d 1 ) local (v, d 2 ), then this reachability relation is saved in F and F (lines [10][11]. Also, any vertex that is reachable from (v, d 2 ) is reachable from (u, d 1 ), too. So, the algorithm adds F (v, d 2 , b, j) to F (u, d 1 , b, j) (line 13). The converse happens to F (line 14).  Figure 6 After the execution of Algorithm 3, we have (v, . Algorithm 3 has a runtime of O(n · |D| 3 · log n). See [17] for detailed proofs. In the next section, we show that this runtime can be reduced to O(n · |D| 3 ) using word tricks.

Word Tricks
We now show how to reduce the time complexity of Algorithm 3 from O(n · |D * | 3 · log n) to O(n · |D * | 3 ) using word tricks. The idea is to pack the F and F sets of Algorithm 3 into words, i.e. represent them by a binary sequence.

Algorithm 3: Ancestors Preprocessing in
Step (6) Given a bag b, we define δ b as the sum of sizes of all ancestors of b. The tree decompositions are balanced, so b has O(log n) ancestors. Moreover, the width is t, hence δ b = O(t · log n) = O(log n) for every bag b. We perform a top-down pass of each tree decomposition T i and compute δ b for each b.
For every bag b, u ∈ V (b) and d 1 ∈ D * , we store F (u, d 1 , b, −) as a binary sequence of length δ b ·|D * |. The first |V (b)|·|D * | bits of this sequence correspond to F (u, d 1 , b, d b ). The next |V (b p )| · |D * | correspond to F (u, d 1 , b, d b − 1), and so on. We use a similar encoding for F . Using this encoding, Algorithm 3 can be rewritten by word tricks and bitwise operations as follows: -Lines 5-6 copy F (u, d, b p , −) into F (u, d, b, −). However, we have to shift and align the bits, so these lines can be replaced by -Line 10 sets a single bit to 1.
-Lines 12-13 perform a union, which can be replaced by the bitwise OR operation. Hence, these lines can be replaced by -Computations on F can be handled similarly. Note that we do not need to compute R anc explicitly given that our queries can be written in terms of the F and F sets. It is easy to verify that using these word tricks, every W operations in lines 6, 7, 13 and 14 are replaced by one or two bitwise operations on words. Hence, the overall runtime of Algorithm 3 is reduced to O n·|D * | 3 ·log n W = O(n · |D * | 3 ).

Answering Queries
We now describe how to answer pair and single-source queries using the data saved in the preprocessing phase.
Answering a Pair Query. Our algorithm answers a pair query from a vertex (u, d 1 ) to a vertex (v, d 2 ) as follows: (i) If u and v are not in the same flow graph, return 0 (no).
(ii) Otherwise, let G i be the flow graph containing both u and v. Let b u = rb(u) and b v = rb(v) be the root bags of u and v in T i and let and (w, d 3 ) anc (v, d 2 ), return 1 (yes), otherwise return 0 (no).
Correctness. If there is a path P : (u, d 1 ) (v, d 2 ), then we claim P must pass through a vertex (w, Answering a Single-source Query. Consider a single-source query from a vertex (u, d 1 ) with u ∈ V i . We can answer this query by performing |V i | × |D * | pair queries, i.e. by performing one pair query from (u, d 1 ) to (v, d 2 ) for each v ∈ V i and d 2 ∈ D * . Since |D * | = O(|D|), the total complexity is O |V i | · |D| · |D| log n for answering a single-source query. Using a more involved preprocessing method, we can slightly improve this time to O |Vi|·|D| 2 log n . See [17] for more details. Based on the results above, we now present our main theorem:

Parallelizability and Optimality
We now turn our attention to parallel versions of our query algorithms, as well as cases where the algorithms are optimal.
1. Given a pair query of the form (u, d 1 , v, d 2 ), let b u (resp. b v ) be the root bag u (resp. v), and b = lca(b u , b v ) the lowest common ancestor of b u and b v . We partition the set Optimality. Observe that when |D| = O(1), i.e. when the domain is small, our algorithm is optimal : the preprocessing runs in O(n), which is proportional to the size of the input, and the pair query and single-source query run in times O (1) and O(n/ log n), respectively, each case being proportional to the size of the output. Small domains arise often in practice, e.g. in dead-code elimination or null-pointer analysis.

Experimental Results
We report on an experimental evaluation of our techniques and compare their performance to standard alternatives in the literature.
Benchmarks. We used 5 classical data-flow analyses in our experiments, including reachability (for dead-code elimination), possibly-uninitialized variables analysis, simple uninitialized variables analysis, liveness analysis of the variables, and reaching-definitions analysis. We followed the specifications in [36] for modeling the analyses in IFDS. We used real-world Java programs from the DaCapo benchmark suite [6], obtained their flow graphs using Soot [65] and applied the JTDec tool [19] for computing balanced tree decompositions. Given that some of these benchmarks are prohibitively large, we only considered their main Java packages, i.e. packages containing the starting point of the programs. We experimented with a total of 22 benchmarks, which, together with the 5 analyses above, led to a total of 110 instances. Our instance sizes, i.e. number of vertices and edges in the exploded supergraph, range from 22 to 190, 591. See [17] for details.
Implementation and comparison. We implemented both variants of our approach, i.e. sequential and parallel, in C++. We also implemented the parts of the classical IFDS algorithm [50] and its on-demand variant [36] responsible for samecontext queries. All of our implementations closely follow the pseudocodes of our algorithms and the ones in [50,36], and no additional optimizations are applied. We compared the performance of the following algorithms for randomlygenerated queries: -SEQ. The sequential variant of our algorithm.
-PAR. A variant of our algorithm in which the queries are answered using perfect parallelization and 12 threads. -NOPP. The classical same-context IFDS algorithm of [50], with no preprocessing. NOPP performs a complete run of the classic IFDS algorithm for each query. -CPP. The classical same-context IFDS algorithm of [50], with complete preprocessing. In this algorithm, all summary edges and reachability information are precomputed and the queries are simple table lookups. -OD. The on-demand same-context IFDS algorithm of [36]. This algorithm does not preprocess the input. However, it remembers the information obtained in each query and uses it to speed-up the following queries.
For each instance, we randomly generated 10,000 pair queries and 100 singlesource queries. In case of single-source queries, source vertices were chosen uniformly at random. For pair queries, we first chose a source vertex uniformly at random, and then chose a target vertex in the same procedure, again uniformly at random.
Experimental setting. The results were obtained on Debian using an Intel Xeon E5-1650 processor (3.2 GHz, 6 cores, 12 threads) with 128GB of RAM. The parallel results used all 12 threads.
Time limit. We enforced a preprocessing time limit of 5 minutes per instance. This is in line with the preprocessing times of state-of-the-art tools on benchmarks of this size, e.g. Soot takes 2-3 minutes to generate all flow graphs for each benchmark. Results. We found that, except for the smallest instances, our algorithm consistently outperforms all previous approaches. Our results were as follows: Treewidth. The maximum width amongst the obtained tree decompositions was 9, while the minimum was 1. Hence, our experiments confirm the results of [34,19] and show that real-world Java programs have small treewidth. See [17] for more details. Preprocessing Time. As in Figure 8, our preprocessing is more lightweight and scalable than CPP. Note that CPP preprocessing times out at 25 of the 110 instances, starting with instances of size < 50, 000, whereas our approach can comfortably handle instances of size 200, 000. Although the theoretical worst-case complexity of CPP preprocessing is O(n 2 · |D| 3 ), we observed that its runtime over our benchmarks grows more slowly. We believe this is because our benchmark programs generally consist of a large number of small procedures. Hence, the worst-case behavior of CPP preprocessing, which happens on instances with large procedures, is not captured by the DaCapo benchmarks. In contrast, our preprocessing time is O(n · |D| 3 ) and having small or large procedures does not matter to our algorithms. Hence, we expect that our approach would outperform CPP preprocessing more significantly on instances containing large functions. However, as Figure 8 demonstrates, our approach is faster even on instances with small procedures. Query Time. As expected, in terms of pair query time, NOPP is the worst performer by a large margin, followed by OD, which is in turn extremely less efficient than CPP, PAR and SEQ (Figure 9, top). This illustrates the underlying trade-off between preprocessing and query-time performance. Note that both CPP and our algorithms (SEQ and PAR), answer each pair query in O (1). They all have pair-query times of less than a millisecond and are indistinguishable in this case. The same trade-off appears in single-source queries as well (Figure 9, bottom). Again, NOPP is the worst performer, followed by OD. SEQ and CPP have very similar runtimes, except that SEQ outperforms CPP in some cases, due to word tricks. However, PAR is extremely faster, which leads to the next point. Parallelization. In Figure 9 (bottom right), we also observe that single-source queries are handled considerably faster by PAR in comparison with SEQ. Specifically, using 12 threads, the average single-source query time is reduced by a factor of 11.3. Hence, our experimental results achieve nearperfect parallelism and confirm that our algorithm is well-suited for parallel architectures.
Note that Figure 9 combines the results of all five mentioned data-flow analyses. However, the observations above hold independently for every single analysis, as well. See [17] for analysis-specific figures. Each row starts with a global picture (left) and zooms into smaller time units (right) to differentiate between the algorithms. The plots above contain results over all five analyses. However, our observations hold independently for every single analysis, as well (See [17]).

Conclusion
We developed new techniques for on-demand data-flow analyses in IFDS, by exploiting the treewidth of flow graphs. Our complexity analysis shows that our techniques (i) have better worst-case complexity, (ii) offer certain optimality guarantees, and (iii) are embarrassingly paralellizable. Our experiments demonstrate these improvements in practice: after a lightweight one-time preprocessing, queries are answered as fast as the heavyweight complete preprocessing, and the parallel speedup is close to its theoretical optimal. The main limitation of our approach is that it only handles same-context queries. Using treewidth to speedup non-same-context queries is a challenging direction of future work. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.