Innermost many-sorted term rewriting on GPUs

This article presents a way to implement many-sorted term rewriting on a GPU. This is done by letting the GPU repeatedly perform a massively parallel evaluation of all subterms. Innermost many-sorted term rewriting is experimentally compared with a relaxed form of innermost many-sorted term rewriting, and two different garbage collection mechanisms, to remove terms that are no longer needed, are discussed and experimentally compared. It is concluded that when the many-sorted term rewrite systems exhibit suﬃcient internal parallelism, GPU rewriting substantially outperforms the CPU. Both relaxed innermost many-sorted rewriting and garbage collection further improve this performance. Since the implementation can probably be even further optimised, and because in any case GPUs will become much more powerful in the future, this suggests that GPUs are an interesting platform for (many-sorted) term rewriting. As term rewriting can be viewed as a universal programming language, this also opens a route towards programming GPUs by term rewriting, especially for irregular computations. © 2022 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Introduction
Graphics Processing Units (GPUs) have enormous computational power and performance-per-watt compared to (multicore) CPUs [1].GPUs are optimised for the highly parallel and regular computations that occur in graphics processing, but they become more and more interesting for general purpose computations (for instance, see [2][3][4][5][6][7]).It is not without reason that modern super computers have large banks of graphical processors installed in them [8].GPU designers realise this and make GPUs increasingly suitable for irregular computations.For instance, they have added improved caches and atomic operations.
This raises the question to what extent the GPU can be used for more irregular computational tasks.The main limitation is that a highly parallel algorithm is needed to fully utilise the power of the GPU.For irregular problems it is the programmer's task to recognise the regularities in problems over irregular data structures such as graphs.
The evaluation of term rewriting systems (TRSs) is an irregular problem that is interesting for the formal methods community.For example, term rewriting increases the expressiveness of models in the area of model checking [9] and the performance of term rewriting is a long-standing and important objective [10].Many-sorted term rewriting systems involve sorts, i.e., data types, which provides a means to require that terms are well-sorted.A question that follows is whether this model for computation can be used to express programs for GPUs more easily.Besides their use in the formal methods community, a TRS is also a simple, yet universal mechanism for computation [11,12].
We recall that a term rewriting system that enjoys the Church-Rosser property is parallel in nature, in a sense that rewriting can take place at any point in the system and the order in which it takes place does not influence the outcome.This suggests a very simple model for parallel evaluation.The (sub)terms of the system can be distributed over the available processors, and each processor can independently work on its own terms and do its evaluation.
We designed experiments and compared GPU many-sorted term rewriting with CPU many-sorted term rewriting of the same terms.Furthermore, we experimented with two different mechanisms for garbage collection of deleted terms, and considered both innermost many-sorted term rewriting and a relaxed version of innermost many-sorted term rewriting.We find that our implementation manages to employ 80% of the bandwidth of the GPU for random accesses.Since random accesses are the performance bottleneck in our implementation, the GPU is used quite well.For intrinsically parallel rewrite tasks, the GPU outperforms a CPU with up to a factor 10. The experiments also show that if the number of terms that can be evaluated in parallel is reduced, rewriting slows down quite dramatically.For this reason, relaxed innermost rewriting is often faster than innermost rewriting, as the former is more flexible in evaluating terms in parallel than the latter.The reason that the possibilities for parallel evaluation affect the performance so much is due to the fact that individual GPU processors are much slower than a CPU processor and GPU cycles are spent on non-reducible terms.
Garbage collection is not only necessary to keep having sufficient memory during rewriting, it also tends to speed up the rewriting itself, as the use of garbage collection tends to lead to better memory access patterns when terms are accessed.Our garbage collection experiments show that typically, maintaining a queue of free positions for new terms is more efficient than periodically moving terms to shift out the deleted ones, but our results suggest that as the number of rewritten terms increases, the latter mechanism becomes increasingly efficient.
This leads us to the following conclusion.Term rewriting, both many-sorted and single-sorted, on a GPU certainly has potential.Although our implementation performs close to the random access peak bandwidth, this does not mean that performance cannot be improved.It does mean that future optimisations need to focus on increasing regularity in the implementation, especially in memory access patterns, for example by grouping together similar terms, or techniques such as kernel unrolling in combination with organizing terms such that subterms are close to the parent terms as proposed by Nasre et al. [13].Furthermore, we expect that GPUs quickly become faster, in particular for applications with random accesses.
However, we also observe that when the degree of parallelism in a term is reduced, it is better to let the CPU do the work.This calls for a hybrid approach where it is dynamically decided whether a term is to be evaluated on the CPU or on the GPU depending on the number of subterms that need to be rewritten.This is future work.We also see that designing inherently parallel rewriting systems is an important skill that we must learn to master.
Although much work lies ahead of us, we conclude that using GPUs to solve (many-sorted) term rewriting processes is promising.It allows for abstract programming independent of the hardware details of GPUs, in the sense that it is not needed to explicitly indicate which data should be processed by which threads, and what type of memory should be used.Also, it offers the potential of evaluating appropriate rewrite systems at least one order, and in the future several orders of magnitude faster than a CPU.

Contributions
In this article, we investigate whether and under which conditions many-sorted term rewriting systems can be evaluated effectively on GPUs.We experimented with different compilation schemes from rewrite systems to GPU code and present here one where all processors evaluate all subterms in parallel.This has as drawback that terms that cannot be evaluated still require processing time.Terms can become discarded when being evaluated, and therefore garbage collection is required.All processors are also involved in this.
The current work extends our previously published work on GPU many-sorted term rewriting [14].Compared to that paper, we have added the following contributions: 1.The fact that our parallel term rewriting supports the use of sorts is made explicit in the definitions.2.More aspects of the parallel term rewriting mechanism are algorithmically explained, and the examples have been expanded.3. The data structures have been redesigned, to further optimise the code.4. The garbage collector introduced in [14] periodically collects the indices of free term positions in a list.An alternative is to periodically sort the terms, thereby shifting the free positions out.In [6], two of the current authors investigated the latter approach in the context of SAT solving.In the current article, we use both approaches, and experimentally compare their impact on performance for term rewriting.5.In [14], we used a rewriting strategy known as innermost rewriting, which has the nice property for our parallelisation that it guarantees thread-safety, i.e., the absence of data races.However, it can also be a limiting factor w.r.t. the amount of rewriting that can be done in parallel.In this article, we also consider a relaxed form of innermost rewriting, which can lead to more rewriting being possible in parallel (this depends on the rewrite system), while still being thread-safe.

Related work
An earlier approach to inherently evaluate a program in parallel was done in the eighties.The Church-Rosser property for pure functional programs sparked interest from researchers, and the availability of cheap microprocessors made it possible to assemble multiple processors to work on the evaluation of one single functional program.Jones et al. proposed GRIP, a parallel reduction machine design to execute functional programs on multiple microprocessors that communicate using an on-chip bus [15].At the same time Barendregt et al. proposed the Dutch Parallel Reduction Machine project, that follows a largely similar architecture of many microprocessors communicating over a shared memory bus [16].Although technically feasible, the impact of these projects was limited, as the number of available processors was too small and the communication overhead too severe to become a serious contender of sequential programming.GPUs offer a different infrastructure, with in the order of a thousand fold more processors and highly integrated on chip communication.Therefore, GPUs are a new and possibly better candidate for parallel evaluation of TRSs.
Current approaches for GPU programming are to make a program at a highly abstract level and transform it in a stepwise fashion to an optimal GPU program [17].Other approaches are to extend languages with notation for array processing tasks that can be sparked off to the GPU.Examples in the functional programming world are Accelerate [18], an embedded array processing language for Haskell, and Futhark [19], a data parallel language which generates code for NVIDIA's Compute Unified Device Architecture (CUDA) interface.While Futhark and Accelerate make it easier to use the power of the GPU, both approaches are tailored to highly regular problems.Implementing irregular problems over more complicated data structures remains challenging and requires the programmer to translate the problem to the regular structures provided in the language as seen in, for example, [20][21][22].
Related to this work is the work of Nasre et al. [23] where parallel graph mutation and rewriting programs for both GPUs and CPUs are studied.In particular they study Delaunay mesh refinement (DMR) and points-to-analysis (PTA).PTA is related to term rewriting in a sense that nodes do simple rule based computations, but it is different in the sense that no new nodes are created.In DMR new nodes and edges are created but the calculations are done in a very different manner.The term rewriting in this work can be seen as a special case of graph rewriting, where symbols are seen as nodes and subterms as edges.
Finally, regarding garbage collection for GPUs, the authors of [24,25] investigated how to offload garbage collectors to an Accelerated Processing Unit (APU).A promising alternative for stream compaction [26] via parallel defragmentation has been proposed in [27].The two garbage collectors that we consider in this article are simpler: one, originally proposed in the conference version of this article [14], has been tailored to the setting of term rewriting, and the second one, originally proposed in [6], has originally been designed for SAT solving, but can be adapted for term rewriting in a straightforward way.This allows both garbage collectors to be simple, yet effective.

Preliminaries
We introduce many-sorted term rewriting, what it means to apply rewrite rules, and an overview of the CUDA GPU computing model.

Many-sorted term rewriting
Terms are constructed from a set of variables V and a set of function symbols F .A function symbol is applied to a predefined number of terms as arguments.We refer to this number as the arity of the function symbol, and denote the arity of a function symbol f by arity( f ).If arity( f ) = 0, we say f is a constant.
The set of subterms sub + (t) of a term t is inductively defined as follows: sub With sub i (t) (i ≥ 1), we refer to the i-th direct subterm of term t.This is defined as follows, with ⊥ denoting that sub i (t) is undefined: In addition, there is a set of sorts S. Every variable and function symbol has a sort in S, and for function symbols, it is defined which sort each of its arguments has.With S n , we refer to n-tuples over S, and S * denotes the set of all finite sequences over S. The following functions are used to map terms to sorts: We require for all f ∈ F that ar( f ) ∈ S arity( f ) .The function st maps each variable in V and function symbol in F to a sort.The function ar defines the sort of each argument of a function symbol.To refer to the sort of the i-th argument of a function symbol f , with i ∈ {1, . . ., arity( f )}, we use the notation ar( f , i).
To refer to the sort of a term, we use a function sort.The sort of a term t is defined as the sort associated to its head symbol, in case t / ∈ V .Otherwise, it is the sort associated to the variable t: Together, the sets V , F and S constitute a signature = (F , V , S, arity, st, ar). 1 The set of terms T over a signature is inductively defined as the smallest set satisfying: A Many-sorted Term Rewrite System (MTRS) over a signature is a set of rules.Each rule is a pair of terms from T , namely a left-hand side and a right-hand side.Given an arbitrary term t and an MTRS R, rewriting means to replace occurrences in t of an instance of the left-hand side of a rule in R by the corresponding instance of its right-hand side, and then repeating the process on the result.

Definition 1 (Many-sorted term rewrite system
).An MTRS R over a signature is a set of pairs of terms, i.e., R ⊆ T × T .Each pair (l, r) ∈ R is called a rule, and is typically denoted by l → r.Each rule (l, r) ∈ R satisfies two properties: (1) l / ∈ V , and ( 2) Var(r) ⊆ Var(l).
Given a rule l → r, we refer to l as the left-hand-side (LHS) and to r as the right-hand-side (RHS).
In the remainder of this article, many-sorted term rewriting is simply referred to as term rewriting, i.e., we always consider term rewrite systems that are many-sorted.

Definition 2 (Substitution).
For an MTRS R over a signature = (F , V , S, arity, st, ar), a substitution σ : V → T maps variables to terms, with for all v ∈ V , sort(v) = sort(σ (v)).We write tσ for a substitution σ applied to a term t ∈ T , defined as σ (t) if t ∈ V , and f (t 1 σ , . . .,

. , t arity( f ) ).
Substitutions allow for a match between a term t and rule l → r.A rule l → r is said to match t iff a substitution σ exists such that lσ = t.If such a σ exists, then we say that t reduces to rσ .A match lσ of a rewrite rule l → r is also called a redex.
A term t is in normal form, denoted by nf(t), iff its subterms are in normal form and there is no rule (l, r) ∈ R and substitution σ such that t = lσ .
A term can be a redex, but it can also contain a number of redexes.In a reduction step, one of these redexes is reduced.
For a given MTRS R over a signature = (F , V , S, arity, st, ar), we define the one-step reduction relation → R : Definition 3 (One-step reduction).Given an MTRS R over a signature = (F , V , S, arity, st, ar), the one-step reduction relation → R is defined on T inductively as follows: 1.For all t ∈ V , t is in normal form w.r.t.→ R ; 2. Assume that → R has already been defined for terms t 1 , . . ., t arity( f ) , then → R is defined for term f (t 1 , . . ., t arity( f ) ) as follows: Given an MTRS R over a signature , the computation of a term t ∈ T consists of repeatedly reducing one of its redexes, or t itself, i.e., this computation is formalised by → R .
As an example MTRS, Listing 1 presents a merge sort rewrite system with an input tree of depth 2 consisting of lists.
After the sort keyword, a list is given of all sorts and function symbols.After the keyword eqn, rewrite rules are given in the form LHS = RHS.The set of variables is given as a list after the var keyword.The input section defines the input term.It is expected that for an input term t, it holds that Var(t) = ∅, i.e., that it is a ground term.For convenience, when we refer to an MTRS in this article, we sometimes actually refer to the combination of an MTRS and an input term on which this MTRS can be applied.
In the sort section, natural numbers, Booleans, lists and trees are given by the inductively defined sorts Nat, Bool, List, and Tree, respectively.For the natural numbers, we define the constant function Zero(), and the function S on natural numbers defines the successor of a number.With these two functions, Peano numbers can be constructed.The function Len on lists returns the length of a list.The rewrite rules for the latter function are given at lines 11-12.The Booleans are given by the two constant functions True() and False(), and the two functions on natural numbers, Lt and Gt, evaluate like the less than and greater than function, respectively.These two functions are defined by means of rewrite rules at lines 35-38 and 30-33, respectively.Note that these rules indeed define the less than and greater than functions.
A list is either empty (Nil()), or a natural number followed by a list (Cons).The functions Merge and Merge2 merge two sorted lists into one sorted list (rewrite rules are given at lines [14][15][16][17][18][19].The functions Even(L) and Odd(L), with L a direct subterm of type List, will evaluate to a List that contains only the elements located at even and odd positions in L, respectively.The Sort function represents the implementation of merge sort and splits the list direct subterm until it consists of at most one term, using the Sort2 function (see lines [21][22][23].Finally, a tree consists of leafs that each contain a list (Leaf) and nodes with two subtrees (Node).
We have used a tree data structure to support input terms consisting of multiple lists.The MTRS rewrites all the Sort terms in the leaves of a given tree of lists in parallel.Other potential for parallel rewriting is implicit and can be seen, for instance, in the Sort2 rule.The two arguments of Merge in the RHS of Sort2 can be evaluated in parallel.Note that Nil(), Zero() and S(Nat) are in normal form, but other terms may not be.
The input term is given at lines 40-41.Note that it defines a tree of depth 2: the term is a Node, with each of its subtrees being a Node as well, and in turn, each subtree of those two subtrees is a Leaf.
An MTRS is terminating iff there are no infinite reductions possible.For instance, the rule f (a) → f ( f (a)), with sort(a) = sort( f ), leads to an infinite reduction.Determining whether a given MTRS is terminating is an undecidable problem [32].
The computation of a term in a terminating MTRS is the repeated application of rewrite rules until the term is in normal form.Such a computation is also called a derivation.Note that the result of a derivation may be non-deterministically produced.Consider, for example, the rewrite rule ρ : f ( f (x)) → a and the term t = f ( f ( f (a))), with sort(x) = sort( f ).Applying ρ on t may result in either the normal form a or f (a), depending on the chosen reduction.To remove nondeterminism arising between nested redexes, a rewrite strategy is needed.First, we focus on the innermost strategy, which gives priority to selecting redexes that do not contain other redexes.In the example, this means that the LHS of ρ is matched on the inner f ( f (a)) of t, leading to f (a).In Section 4.4, we relax this, and consider what we call relaxed innermost rewriting.
Algorithmically, (innermost) rewriting is typically performed using recursion.Such an algorithm is presented in Listing 2.
With ptr to t, we refer to a pointer to the term t.As long as the term t is not in normal form (line 2), it is first checked whether all its direct subterms are in normal form (lines 3-4).For each direct subterm not in normal form, derive is called recursively (line 5), by which the innermost rewriting strategy is achieved.The derive function is given a pointer to a term, such that any rewriting of a term t is seen by all terms that have t as a direct subterm.If the direct subterms are checked sequentially from left to right, we have leftmost innermost rewriting.A parallel rewriter may check the direct subterms in parallel, since innermost redexes do not contain other redexes.Once all direct subterms are in normal form, the procedure rewrite hs(t) is called (line 6).
For each head symbol of the MTRS, we have a dedicated rewrite procedure.The structure of these procedures is also given in Listing 2. The variable rewritten is used to keep track of whether a rewrite step has been performed (line 9).
For each rewrite rule l → r with hs(l) = f , it is checked whether a substitution σ exists such that lσ = t, and if so, l → r is applied on t (lines 10-12).If no rule was applicable, it is concluded that t is in normal form (line 13).
Non-determinism can also arise in rewriting due to multiple rules matching on the same term.For instance, if we have the rewrite rules ρ 1 : f (x) → a and ρ 2 : f (x) → b, then applying this MTRS on a term t = f (a), with sort(x) = sort(a), may result in a or b.Note in Listing 2 that at line 10, it is not specified that the rules have to be considered in a particular order.In our work though, we enforce an order, defined by our implementation of the rewrite functions for the various function symbols.By doing so, we rule out this form of non-determinism.
Besides the two properties for each rule (l, r) ∈ R stated in Definition 1, we assume that each variable v ∈ V occurs at most once in l, i.e., l is linear, meaning that l → r is left-linear [11].When innermost (and relaxed innermost) rewriting is used, an MTRS with non-left-linear rules can be rewritten to one that only contains left-linear rules [30,31].To do so, an Listing 2: A derivation procedure for term t, and a rewrite procedure for head symbol f .equality function needs to be introduced in the MTRS, to syntactically compare terms.By incorporating equality checks in the rules, the multiple occurrences of a variable can be removed.For instance, a rule f(X, X) = a() can be simulated by the following rules, where f' and g are new function symbols, and f' is essentially a copy of f: Although not essential, the left-linearity property is convenient when developing a term rewriter, as it simplifies finding substitutions.For an MTRS with non-left-linear rules, the implicit equality conditions must be taken into account when checking for substitutions.The above rewriting of an MTRS with non-left-linear rules to an MTRS with only left-linear rules means that these conditions are made explicit, which makes it straightforward how checking the applicability of those rules should be implemented.

GPU basics
In this article, we focus on NVIDIA GPU architectures and CUDA.However, our algorithms can be straightforwardly applied to any GPU architecture with a high degree of hardware multithreading and the SIMT (Single Instruction Multiple Threads) model.
CUDA is NVIDIA's interface to program GPUs.It extends the C++ programming language.CUDA includes special declarations to explicitly place variables in either the main or the GPU memory, predefined keywords to refer to the IDs of individual threads and blocks of threads, synchronisation statements, a run time API for memory management, and statements to define and launch GPU functions, known as kernels.In this section we give a brief overview of CUDA.More details can be found in, for instance, [33].
GPU architecture A GPU contains a set of streaming multiprocessors (SMs), each containing a set of streaming processors (SPs).For our experiments, we used the NVIDIA Turing Titan RTX.It has 72 SMs with 64 SPs each, i.e., in total 4,608 SPs.
A CUDA program consists of a host program running on the CPU and a collection of CUDA kernels.Kernels describe the parallel parts of the program and are launched from the host to be executed many times in parallel by different threads on the GPU.
It is required to specify the number of threads on a kernel launch and all threads execute the same kernel.Conceptually, each thread is executed by an SP.In general, GPU threads are grouped in blocks of a predefined size, usually a power of two.A block of threads is assigned to a multiprocessor.

CUDA memory model
Threads have access to different kinds of memory.Each thread has a number of on-chip registers to store thread-local data.It allows fast access.All the threads have access to the global memory which is large (on the Titan RTX it is 24 GB), but slow, since it is off-chip.The host has read and write access to the global memory, which allows this memory to be used to provide the input for, and read the output of, a kernel execution.Furthermore, we use unified memory [33] to store unified variables that need to be regularly accessed by both the CPU and the GPU.Unified memory creates a pool of managed memory that is shared between the host and the device.This pool is accessible to both sides using the same addresses.
CUDA execution model Threads are executed using the SIMT model.This means that each thread is executed independently with its own local state (stored in its registers), but execution is organised in groups of 32 threads, called warps.The threads in a warp execute instructions in lock-step, i.e., they share a program counter.If the (global and unified) memory accesses of threads in a warp can be grouped together physically, i.e., if the accesses are coalesced, then the data can be obtained using a single fetch, which greatly improves the bandwidth compared to fetching physically separate data.

A GPU algorithm for term rewriting
In this section, we address how a GPU can perform innermost term rewriting to get the terms of a given MTRS in normal form.Due to the different strengths and weaknesses of GPUs compared to CPUs, this poses two main challenges: 1. On a GPU, many threads (in the order of thousands) should be able to contribute to the computation; 2. GPUs are not very suitable for recursive algorithms.It is strongly advised to avoid recursion because each thread maintains its own stack requiring a large amount of stack space that needs to be allocated in slow global memory.
We decided to develop a topology-driven algorithm [13], as opposed to a data-driven one.Unlike for CPUs, topologydriven algorithms are often developed for GPUs, in particular for irregular programs with complex data structures such as trees and graphs.In a topology-driven GPU algorithm, each GPU thread is assigned a particular data element, such as a graph node, and all threads repeatedly apply the same operator on their respective element.This is done until a fix-point has been reached, i.e., no thread can transform its element anymore using the operator.In many iterations of the computation, it is expected that the majority of threads will not be able to apply the operator, but on a GPU this is counterbalanced by the fact that many threads are running, making it relatively fast to check all elements in each iteration.In contrast, in a datadriven algorithm, typically used for CPUs, the elements that need processing are repeatedly collected in a queue before the operator is applied on them.Although this avoids checking all elements repeatedly, on a GPU, having thousands of threads together maintaining such a queue is typically a major source for memory contention.
In our algorithm, each thread is assigned a term, or more specifically a location where a term may be stored.As derivations are applied according to an MTRS, new terms may be created and some terms may be deleted.The algorithm needs to account for the number of terms dynamically changing between iterations.
In the upcoming sections, we explain the data structures and arrays that are required to implement MTRSs on GPUs.Then, we explain in detail the main procedure of parallel term rewriting.

Data structures and arrays
First, we discuss how MTRSs are represented on a GPU.Typically, GPU data structures, such as matrices and graphs, are array-based, and we also store an MTRS in a collection of arrays.Each term is associated with a unique index i, and its attributes can be retrieved by accessing the i-th element of one of the arrays.This encourages coalesced memory access for improved bandwidth: when all threads need to retrieve the normal form status of their term, for instance, they will access consecutive elements of the array that stores the normal form flags.We introduce the following GPU data structures that reside in global memory: • An array term of objects of type Term for storing the following per term term[i]: term[i].index points to the current location i.This information is used when terms are moved using the Stream Compaction Garbage Collector, see Section 5.
- • Integer arrays refcounts, refcountsRead are used to write and read the number of references to each term, respectively.For a term with index i, its counters are stored at refcounts[i] and refcountsRead[i].When a term is not referenced, it can be deleted.As with the normal form flags, we have two fields for reference counting per term, to avoid read/write conflicts.
• The constant maxarity refers to the highest arity among the function symbols in F .
• The integer variable n provides the current number of terms.It is stored in unified memory.
• The Boolean flag done indicates whether more rewriting iterations are needed.It is also stored in unified memory.
To avoid running out of memory, some form of garbage collection is necessary to be able to reuse memory occupied by terms that are no longer needed in the term rewriting process.How the garbage collectors work is explained in Section 5. We have the following data structures for garbage collection: • The Boolean flag garbageCollecting indicates whether garbage collecting is needed.It is stored in unified memory.
• The integer array keyIndices is used to store indices.Its function depends on which garbage collector is used (see Section 5).
• The integer variable nextFree points to the next free index, greater than the largest index at which term information is stored in the term arrays.There, a new term can be inserted.It is stored in unified memory.

The main procedure for term rewriting
Listing 3 presents the main procedure of the algorithm, which is executed by the CPU.In it, two GPU kernels are repeatedly called (line 8 and either line 12 or 14, depending on which garbage collector is used) until a fix-point has been reached, indicated by done.Copying data between arrays or data structure fields is indicated by ←.
Initially, at line 2, nextFree is set to the current number of terms n.The nextFree variable is used to find a new index each time a term is created.While rewriting is not finished (line 17), the done flag is set to true (line 4), after which the number of thread blocks is determined.As the number of threads should be equal to the current number of terms, n is divided by the preset number of threads per block (blockSize), rounded up.After that, the nf fields are copied to the fields nfRead, and refcounts is copied to refcountsRead.Reading and writing of the reference counters and normal form states is performed on separate data fields, to avoid the situation that newly created terms are already being rewritten before they have been completely stored in memory.The derive kernel is launched for the selected number of blocks (line 8).This kernel, shown in Listing 4, is discussed later.In that kernel, the GPU threads perform one rewrite iteration and set done to false if and only if another iteration is needed.At line 9 in Listing 3, n is updated in case the number of terms has increased.Finally, if garbage collection is needed (line 10), one of the collectors is called at line 12 or 14, depending on the user-defined flag streamCompacting.
Once a term t with index tid has been rewritten, the flag flags [tid].nf is set to true iff {(l, r) ∈ R | hs(l) = hs(t)} = ∅, i.e., there are no rules in R with the head symbol of their LHS being equal to the head symbol of t, which implies that t cannot be rewritten anymore.In all other cases, rewriting may still be possible, and we cannot conclude that t is in normal form.In those cases done is set to false, to ensure that another derive iteration will be performed (see Listing 3).
In Listing 5, the head symbol of the term being rewritten is updated at line 6, and the index of the direct subterm referred to by X is retrieved at line 7.A fresh index is obtained at line 8 to create the new term Plus(X, Y).This index is set as argument 1 of the term being rewritten (line 9) and argument 2 is set to 0 (line 10).The new term is constructed at lines 11-15.The reference counters are updated at lines 16-18.The new term is referenced once, X gets one more reference due to the new term, and Y also gets one reference, but loses one due to the fact that the rewritten term no longer directly references Y. Finally, Plus(X, Y) loses a reference.The done flag is set to false, as the resulting term may not be in normal form.
Alternatively, if the head symbol is Zero (line 22), the rule r2 is applicable, and the rewriting procedure should ensure that the term at position tid is replaced by X.In this case, when rewriting the term itself, we have to copy the attributes of X to the location tid of the various arrays, to ensure that all terms referencing term tid are correctly updated.This copying of terms is done by first copying the head symbol (lines 24-25), and then the indices of the direct subterms, which is done at line 26 by the function copyTermArgs; it copies the number of direct subterms relevant for a term with the given head symbol, and increments the reference counters of those direct subterms.Next, the reference counters of Zero and X are atomically decremented (since the term Plus(Zero, X) is removed) (27)(28), and we know that the resulting term is in normal form, since X is in normal form (line 29).
Finally, if no substitutions can be found for any of the rules w.r.t. a term t with index tid, it can be concluded that t is in normal form, i.e., flags[tid].nfcan be set to true.In the example, this is done at line 32.
In Section 5, we explain how new indices are retrieved whenever a new term needs to be created, as is the case in the example of Listing 5 at line 8.

Relaxed innermost rewriting
With innermost rewriting, it is ensured that whenever a thread inspects the direct subterms of its own term, no other threads are simultaneously rewriting those direct subterms, since those terms are in normal form.For parallel term rewriting, this offers the nice property that no data races occur when accessing term information.However, it also has a drawback, as it sequentialises the rewriting of terms and their direct subterms.
Outermost rewriting does not seem to be a good alternative, as it requires more complicated thread synchronisation, in our setting in which each (sub)term is assigned to a separate thread.For instance, consider the rules a → b, b → c, f (c) → d and g(d) → e, and the input term g( f (a)).In order for the thread t 1 assigned to a to identify that it can rewrite its term, it must somehow discover that both the thread t 2 assigned to f (a) and the thread t 3 assigned to g( f (a)) cannot yet rewrite their terms.Of course, t 2 and t 3 can check this, store that information somewhere, and update it in each iteration of the rewriting procedure, similar to how we store the normal form status of all terms.However, this either requires each thread to inspect the status of every 'parent' term, which can involve many checks, or the propagation of this information from terms to their direct subterms, which must be done in sequence.
To encourage more rewriting in parallel while still ensuring data race freedom in the elegant way of innermost rewriting, we implemented a relaxed form of innermost rewriting.In this rewriting strategy, the normal form status of direct subterms is checked at the individual rule level, and insofar needed for each particular rule.In general, direct subterms referenced by only a variable do not need to be checked w.r.t.their normal form status.For instance, for the rule Plus(S(X), Y) = S(Plus(X, Y)), the normal form status of the term referenced by Y is not relevant.When applying the rule, this term is not inspected, only its reference is used to construct the new term.On the other hand, the first direct subterm of Plus(S(X), Y) must be inspected, to determine that it has head symbol S, so it must first be checked whether this term is in normal form.Rules where the RHS consists of a single variable, such as Plus(Zero, X) = X, offer an important exception: since applying such a rule involves copying the information of the term referenced by X, it must be checked if that term is in normal form, even though at first it may seem that only the reference X is used.
In general, given a rule l → r and a term t for which there exists a substitution σ such that lσ = t, the normal form status of the following direct subterms of t must be checked when relaxed innermost rewriting is performed: In relaxed innermost rewriting mode, a derive kernel is used similar to the one in Listing 4, but without lines 6-18, which implements the normal form checking.Instead, normal form checking is now done in the rewrite functions.Listing 6 presents how this is done for the Plus terms and the rules as presented in Listing 5.For the rule r1, only the normal form status of the first direct subterm is relevant.After the value of subterm_index has been retrieved (line 3), it is checked whether the first direct subterm still needs to be inspected (line 6), and if so, the check is done.If it is not in normal form, rewriting cannot commence (line 7).In this case, start_rewriting is set to false, and the variable encountered_nnf is set to true.The latter variable is used to keep track of whether at any point during execution of the function, a direct subterm is encountered that is not in normal form.For each individual rule, this is done using the variable start_rewriting (note that this variable is reset before considering the next rule at line 19), but with encountered_nnf, we do this for all the rules checked by the function together.If at the end of the function, no direct

48×
hence the size of the input term for this MTRS grows exponentially as its depth is increased linearly.With Binary search tree N, we refer to this MTRS being applied on a binary tree of depth N. Fig. 1 shows an application of the BST MTRS applied on a tree structure of 22 levels deep (Binary search tree 22).The width of each box (Plot 1a) represents the time of a GPU rewrite step (the derive statement at line 8 in Listing 3) whereas its height represents how many terms are rewritten in parallel in this rewrite step.Plots 1a and 1b reveal high degrees of parallelism resp.throughput for the majority of the execution time.The latter reaches a peak of around 4 billion terms rewritten per second.Plots 1c and 1d highlight to what extent we use the capabilities of the GPU.Usually, the performance of a GPU is measured in GFLOPS, floating point operations per second, for compute intensive applications or GiB/s for data intensive applications.Since term rewriting is a symbolic manipulation that does not involve any arithmetic, it is data intensive.From Table 1 we have seen that the maximum bandwidth our GPU can achieve is 555 GiB/s for aligned accesses and 22.8 GiB/s for random accesses.Since term rewriting is an irregular problem with a high degree of random accesses (to direct subterms that can be anywhere in memory), we focus on the bandwidth for random accesses.As shown by Plot 1c for Binary search tree 22, the overall random access bandwidth of the GPU implementation reaches 15 GiB/s, which is close to the benchmarked bandwidth.Regarding coalesced accesses, a bandwidth of 50 GiB/s is measured, see Plot 1d.This confirms that term rewriting is indeed an irregular problem and that random accesses form the main bottleneck.
Table 2 shows the GPU performance for the three previously described MTRSs, where we select one input term for each MTRS.The GPU achieves a considerable speedup compared to the CPU sequential performance for all three of them.For instance, for the generation and exploration of a binary tree of 23 levels deep (Tree exploration 23), 7 billion terms are rewritten per second.
Next, we experiment with several configurations of the GPU rewriter, involving the streamCompactCollector and relaxed innermost rewriting.We compare the following configurations: 1.The sequential (CPU) recursive leftmost innermost rewriter.2. The GPU rewriter with innermost rewriting without garbage collection.We also considered involving the GPU rewriter with relaxed innermost rewriting and Stream Compaction garbage collecting in our results, but that configuration does not lead to new insights.We address this when we discuss the following results.
To investigate the scalability of these configurations, we select three different input terms for each of the three MTRSs described earlier.Table 3 assembles all our results for the previous configurations.The "Rewritten" column refers to the number of terms that have been rewritten before the term rewriting terminates.For each case, the order in which terms are rewritten differs between configurations, but the total number of rewrites performed is the same for all configurations.The "Rewrites" columns refer to the number of rewriting iterations performed, i.e., the number of iterations through the loop in Listing 3.This is only relevant for the GPU configurations, as the CPU rewriter works recursively.
Table 3 shows that the GPU rewriter with relaxed innermost rewriting outperforms the CPU by more than a factor of seven for a Transformation tree of 22 levels deep, with four million leaves.Frequently, the relaxed innermost rewriting strategy positively influences the GPU rewriter, see column 9. Regarding garbage collection, we observe that the stream-CompactCollector (SCC) starts to pay off as the amount of garbage collection work increases, as demonstrated by the Binary search tree cases (compare columns 7 and 11).Unfortunately, we cannot scale up this system any further beyond 22 levels as this requires more than the 24 GB of GPU memory available on our device.We expect SCC to outperform the queueCollector (QC) when there are sufficient hardware resources to process larger Binary search tree cases.
On the other hand, the QC seems to be more effective for the other benchmarks, and for several cases, applying QC even results in better coalesced memory accesses compared to the GPU rewriter without garbage collection (see columns 7 and 5, respectively).Hence, not only does garbage collection help to reuse memory and thereby reduce the memory requirements, it also improves the performance.For example, when QC is enabled for the Binary search tree 20 benchmark, the running time decreases by 34%, even though the garbage collection adds extra overhead to the rewrite procedure.
We have not included results for the GPU rewriter with relaxed innermost rewriting and Stream Compaction garbage collecting.That combination provides no new insights: as expected, its runtime results are always between innermost rewriting with Stream Compaction garbage collecting, and relaxed innermost rewriting with Queue garbage collecting.
If we consider the potential for GPU term rewriting in the future, then we have to conclude that to achieve even higher performance, it is necessary to introduce more regularity into the implementation, reducing the random memory accesses.Other often used strategies to improve graph algorithms, such as reducing thread divergence, will probably not yield significant performance increase, due to the random accesses being the main performance bottleneck.In addition, the results we present clearly show the different capabilities of GPUs and CPUs.An interesting direction for future work is to create a hybrid rewrite implementation that can switch to a GPU implementation whenever a high degree of parallelism is available.
term[i].hsymbol stores the head symbol.-Thefields term[i].arg 1 , . .., term[i].argmaxaritystore the indices of the direct subterms of each term.The value 0 is never used as index.If term[i].argj= 0, for some 1 ≤ j ≤ maxarity, then term[i] has arity j − 1, and all fields term[i].argj, . .., term[i].argmaxarityshould be ignored.The number of fields needed for an MTRS depends on the maximum arity occurring in the MTRS.A code generator can produce the appropriate number of fields, given an MTRS.An array flags of objects of type Flags.For each term t with index i, a Flags object is stored at flags[i].Each Flags object consists of four bit fields.The fields flags[i].nfReadandflags[i].nfare needed to read the current normal form status of t, and update its status, respectively.We have two normal form flags instead of one, to avoid read/write conflicts.The other two bits are used in one of our approaches to garbage collection (see Section 5).
term[i].subterm_index has a value between 1 and maxarity, and is used to optimise the checking of the normal form status of direct subterms (see Section 4.2).•

Table 2
Performance of three instances of different rewrite systems.

Table 3
Performance comparison of different rewriters on various MTRS benchmarks.The QC and SCC acronyms stand for Queue Collector and Stream Compaction Collector, respectively.The Rewritten and Rewrites keywords stand for the number of terms rewritten and the number of rewriting iterations, respectively.The time measurements are averaged over 10 runs in milliseconds.The bold measurements identify the fastest running times.