Verifying parallel dataﬂow transformations with model checking and its application to FPGAs

Dataﬂow languages are widely used for programming real-time embedded systems. They oﬀer high level abstraction above hardware, and are amenable to program analysis and optimisation. This paper addresses the challenge of verifying parallel program transformations in the context of dynamic dataﬂow models, where the scheduling behaviour and the amount of data each actor computes may depend on values only known at runtime. We present a Linear Temporal Logic (LTL) model checking approach to verify a dataﬂow program transformation, using three LTL properties to identify cyclostatic actors in dynamic dataﬂow programs. The workﬂow abstracts dataﬂow actor code to Fiacre speciﬁcations to search for counterexamples of the LTL properties using the Tina model checker. We also present a new refactoring tool for the Orcc dataﬂow programming environment, which applies the parallelising transformation to cyclostatic actors. Parallel refactoring using veriﬁed transformations speedily improves FPGA performance, e.g. 15.4 × speedup with 16 actors.


Introduction
In the dataflow model of execution, the firing of actors depends only on data availability, allowing each actor in a program to execute asynchronously without a global control flow sequentialising their execution.
Dataflow languages are a natural abstraction for FPGAs, since a program's dataflow structure is distributed across the programmable hardware fabric.In other words, each actor in a dataflow program is compiled to an independent processing hardware block, subject to FPGA resource availability.Moreover, large kernel variables like intermediate arrays may use on-chip block RAM (BRAM).BRAMs are distributed across FPGA fabric, meaning there is no memory contention when multiple actors update their kernel variables, e.g.intermediate arrays.That is, both computation and memory access is inherently parallel.For FPGAs, dataflow programming environments represent a high level abstraction above Hardware Description Languages (HDLs) i.e Verilog or VHDL.The dataflow programming model can be exploited for high FPGA performance, e.g.[1] , or as an Intermediate Representation in compilers for higher level FPGA languages e.g.[2] .The dataflow abstraction enables program analysis e.g.optimal static scheduling [3] and program transformation (this paper).

Related work
This work is influenced by the Box Calculus [7] , a program rewrite system for the Hume embedded systems language [8] .The approach specifies Hume programs with Lamport's temporal logic of actions (TLA) [9] and uses deductive verification in TLA to verify Box Calculus transformations [10] .This is exploited for parallelism by transforming single Hume boxes into multiple boxes using high level patterns like divideand-conquer.The language for implementing Hume boxes (analogous to actors ) is purely functional and stateless, mapping input patterns directly to output expressions and any state is carried via feedback wires.This contrasts with our work, where we aim to support stateful dataflow languages like CAL, where actors can have internal variables that persist values between firings.This complicates parallelisation, due to the possibility of changing the read/write order on mutable variables, or introducing race conditions.Moreover, Hume lacks tooling for programmers to apply program transformations.In contrast, our work uses model checking to verify transformations of stateful actors, and we have extended the Eclipse based Orcc IDE with a graphical tool to apply a parallel transformation when model checking enables it.
The most salient related work in the context of parallelising dataflow programs is the StreamIt compiler.That approach is restricted to parallelising stateless actors only [11] , whereas our use of model checking enables the parallelisation of stateful actors too.Related work that combines dataflow models with model checking includes determining minimum dataflow buffer sizes [12] , and enabling compile-time scheduling of multirate static actors [13] .On the hardware side, related work uses model checking to verify that two Verilog/VHDL modules satisfy the same global requirements [14] and hence enabling one to replace another with dynamic partial reconfiguration.None of these approaches consider the parallelisation of hardware designs or the verification of dataflow graph transformations.

Our approach and contributions
Approach .Our approach ( Fig. 1 ) extends dataflow parallelisation beyond stateless actors, to actors that are stateful and contain firing rules with different data rates, provided they have cyclo-static properties.This is important because in practise cyclo-static actors very often hold state between action firings ( Section 7.2 ).This is achieved by using model checking to identify parallelisable multi-rate static (MRDF) and cyclostatic (CSDF) actors that coexist with dynamic (DDF) actors in dataflow programs.These dataflow models are described in Section 2.2 .The approach is general to any dynamic dataflow language, this paper demonstrates the approach with the CAL dataflow language [15] in the Orcc development environment [16] .It marks cyclostatic actors as parallelisable, to enable an interactive graphical program transformation tool.
Contributions This paper makes the following contributions: • An abstraction of dataflow actors to Fiacre, a formalised language for representing behavioural and timing aspects of embedded and distributed systems ( Section 3 ).
• Three Linear Temporal Logic (LTL) properties that model cyclostatic dataflow properties.Model checking Fiacre abstractions of actors identifies cyclostatic actors in dynamic dataflow programs ( Section 4 ).• A graphical interactive refactoring tool that automates the parallelisation of potentially stateful cyclostatic actors ( Section 5 ).• An evaluation showing counterexamples of the cyclostatic LTL properties, and the efficacy of parallel transformations of a cyclostatic actor on an FPGA with two case studies ( Section 6 ).

Background
Task parallel languages provide either a high level programming model of spawning tasks and defining synchronisation points, or a lower level model with a fixed set of actors, explicit point-to-point FIFO connections and FIFO depths.Software oriented task parallel models are often more high level than parallel languages for hardware .Software languages often support spawning new tasks/actors at runtime across threads on a multicore CPU.This is unsuitable when compiling programs to programmable hardware ( i.e. application specific circuits), where the fixed task graph is mapped into the hardware with place and route, and cannot be changed without re-synthesis.
There are numerous programming models to express task parallel programs, e.g.fork/join APIs and threads, where data flows are implicit between parallel tasks.Dataflow programming models offer a more explicit way to express independent parallel tasks, and which tasks communicate.They clearly separate computation and communication, and are popular for embedded systems programming, especially FPGAs, where the entire task graph and communication routing must be fixed in hardware prior to execution.

Properties of dataflow languages
Dataflow programs are directed graphs of connected actors.There are many dataflow languages for multicore processors and embedded systems, each exhibiting a trade-off between expressivity and reasoning power.The hardware architecture they are designed to target reflects their functionality, specifically the graph structure, scheduling and data rates of programs expressed with them, shown in Fig. 2 .
Graph structure.Some languages support dynamic task graphs, i.e.
where the number of actors change during runtime.An example is Cilk [17] , a C ++ fork/join model that creates parallel threads with spawn and synchronises their results with sync .This is similar to OpenMP's task thread creation and taskwait synchronisation pragmas [18] .With other languages, the number of tasks (or actors) and the task graph topology is static at compile time, and does not change at runtime.Scheduling.This is the selection of executable code within an actor.Some dataflow languages support only fixed scheduling policies e.g.PREESM [19] , whilst other models support code execution choice driven by data availability and data values using pattern matching e.g.Hume [20] and CAPH [21] .Some languages support both, e.g.Cx [22] has non-blocking reads/writes on default ports (static scheduling) and synchronised ports (dynamic scheduling), where data availability determines if an executable rule is enabled.Data rates.An actor with a static data rate produces and consumes the same number of values every time it is executed.An actor with a dynamic data rate is free to consume/produce data sequences of arbitrary length with no periodic pattern.

Dataflow models
The models considered in our approach are: 1. Multi-Rate Dataflow (MRDF) [23] , where actors consume and produce a fixed number of tokens every firing.2. Cyclo-Static Dataflow (CSDF) [24] , where actors contain multiple fireable actions.Actions may have different consumption/production rates, but the sequence of action executions must be periodic.[25] , where, when a DDF actor is fired its scheduler picks an enabled execution rule and executes it.The enabled status of an execution rule can depend on both the availability and the values of tokens, and hence the data rate is unknown at compile time.If there are no fireable rules, execution is deferred and the actor's scheduler tries again to find an enabled execution rule.

Dynamic Dataflow (DDF)
Dataflow actors can have multiple actions but only one action can fire at a time, in contrast to synchronous languages e.g.Quartz [26] , where all enabled actions execute in a single firing.Some dataflow programming environments only support static dataflow models, e.g.direct feedthrough function blocks in Simulink [27] .The open source Orcc development environment [16] supports dynamic dataflow programming with CAL [15] .
Real world application of dynamic dataflow properties include realtime Twitter data processing [28] and MPEG decoding, a widely used dataflow benchmark with practical relevance.[29] presents an MPEG implementation in CAL.The disadvantage of dynamic dataflow is that it inhibits program analysis, meaning a programmer has to revert to manual program refactoring to improve performance, and although this can improve FPGA performance [30] , this approach is cumbersome and error-prone.

Constructing static dataflow graph structures with CAL
CAL is a dataflow language for programming real-time embedded systems, which was developed as part of the Ptolemy II project ( [31] ).It is a dataflow language for constructing dataflow graphs with (1) a fixed set of actors and connections , (2) support for dynamic scheduling and (3) support for dynamic data rates .These two dynamic properties make CAL suitable for implementing relatively complex algorithms on embedded systems, e.g.algorithms for which control flow and the input/output data size is determined by runtime values, at the expense of lost static analysis and program transformation opportunities due to these dynamic properties.
Each CAL actor encapsulates an algorithmic kernel.Actors are connected together with lossless order-preserving FIFOs, through which tokens flow.Fig. 3 a shows a volume amplifier example.Actor ports are named ❶.The actor may have a store of private variables ❷.Actors contain discrete non-interruptible execution rules, called actions.Actions match on input patterns to consume tokens and output patterns produce tokens ( ❸ and ❺).Actions may also update store variables ❹.A Finite State Machine (FSM) determines enabled actions from each FSM state ❻, and the initial FSM state which in this example is s0 .When multiple actions are fireable from a given state, a priority block can disambiguate multiple enabled actions ❼.

Implementing dynamic dataflow languages
Since action selection for dynamic actor firing is a runtime decision, implementations of dynamic dataflow models need a runtime scheduling mechanism to determine which action to execute next.Fig. 4 shows the architecture components necessary for supporting dynamic dataflow execution.Each actor consumes input streams and produces output streams via ports.Connections (edges) enable data to flow between ports.An internal store associates values with actor-local variables.
CAL includes a primitive called guard ( Section 3.1 ), which supports the firing of an action depending on the values of incoming tokens and of variables in the actor's store.Satisfying both of the following conditions enables an action for firing: 1. there is a sufficient number of tokens to match its input patterns, 2. its guard evaluates to true (if a guard exists).Fig. 5 shows the scheduling algorithm for action selection.If there are multiple enabled actions, and an absence of a priority statement to disambiguate action selection, the runtime scheduler chooses one at random.

Actor data rates
An FSM transition system drives actor execution, e.g. the FSM in Fig. 3 b for the amplifier actor.Actions have data rates [ C / P ], meaning they consume C tokens and produce P tokens when transitioning between FSM states when their actions are fired.
Consider an actor A consuming tokens from a connection v and producing tokens on a connection u .Firing an action in this actor will consume

Program preserving parallelisation
Parallelism can speed up execution.However, parallelisation (and any other program transformation) must preserve a program's functional semantics.Two properties of dynamic dataflow programs that must be preserved by parallel program transformation are: 1. Functional data rates: An actor may require multiple input tokens to compute its function and produce an output.Such an atomic operation is often implemented as multiple action firings ( Section 7.2 ).Parallel instances of an actor should read and write enough tokens to preserve the atomic behaviour of the original sequential actor.2. Stores: An actor may have a store of local variables.A program transformation should preserve the modification sequence of an actor's store.
Section 4 shows that, with model checking, one can parallelise actors expressed with a dynamic dataflow language to increase its throughput performance, whilst preserving the actor's functional semantics.

Abstracting actors to models for verification
Our approach uses model checking to identify cyclostatic actors for parallelisation.Since CAL is a dataflow programming language and not a model description language, this section shows how we abstract pertinent features of actor code into Fiacre.

Abstract dataflow transition system for CAL
An abstract transition machine model for describing dataflow actors [32] is now presented to formalise the actor model.It describes the data rates and scheduling rules for actors in terms of transitions on an abstract machine.The firing of actions is determined by an actor's FSM, a labeled transition system between control states.An actor is a sequential process, communicating values by transitioning between states in the actor's FSM.Executions of statements in the body of an action can compute output token values and can update the store.The state machine receives tokens and reacts to them, possibly entering another state, and possibly producing tokens.The dataflow transition: describes a transition from FSM control state  to ′ by firing action 1 , and at ′ the enabled action is action 2 .The store e with the input stream s is the input state, and the firing of action 1 produces a new store e ′ and outputs the s ′ stream.In the following example: The store e is {accum = 0} before the 1 st firing.The action is sumStream , the input stream s is [x] consumed from input port in1 .The new store e ′ is {accum = 0+x} after the firing.
The following (fire-guard) rule supports dynamic dataflow.It adds a predicate function that guards the firing of the transition.This function guard E must evaluate to true for the action to fire.The E predicate can

𝑒 ⊢ 𝐸 𝑔𝑢𝑎𝑟𝑑 𝐸
An example is: This sumTen action consumes token x , adding it each time to accum in the store with 10 successive firings.The guard E predicate checks that count is less than 10.Another action (omitted) might produce the value of accum on the 11 th firing, resetting the count and accumulation variables.
If there are multiple fireable actions from ′ then we use • to indicate a runtime scheduler choice to disambiguate two or more actions that both can fire from the same control state.Hence when there are multiple enabled actions at ′ we write: An example of two (fire-guard) actions in an actor is: The positivesRepeat action produces twice any positive integer consumed, whilst negativesIdentity simply propagates any negative or 0 values consumed.Here, • means that after firing either of these, the scheduler will not know which action to fire next until the value of x becomes known after consuming the token on the next firing.This becomes problematic for verifying cyclo-static data rates when two or more ambiguous actions have different production or consumption rates.

Actor abstractions
The CAL primitives we abstract for model checking are: 1.The production and consumption rates of each action .We need the model checker to verify if there exists a cyclic data rate sequence.2. Updates to variables in an actor's store .These variables can be used to conditionally fire actions, i.e. the (fire-guard) rule.3. Statements inside action bodies .Specifically, if/then/else, variable assignment and loop statements, because they can update variables in an actor's store.4. Actor FSMs .They enforce action scheduling and we need to know (1) if exactly one periodic cycle of visited states exists, and (2) the cycle's data rate, i.e. the sum of data rates for each action representing a state transition in the cycle.

Fiacre
We abstract CAL code to Fiacre [33] to abstract actor to a model for checking cyclostatic properties.Fiacre is a formal intermediate model to represent both the behavioural and timing aspects of embedded and Fig. 6.Automota for Fiacre Process P .distributed systems for formal verification.It is a target language of program modelling tools for use with verification engines.From Fiacre specifications, the Fiacre compiler produces enriched Petri nets, in which transitions can test and modify a set of data variables.Therefore Fiacre provides a high level, compositional syntax for Time Petri Nets.TINA [34] is a model checking toolset to analyse these nets.

Fiacre language
The primary Fiacre primitive is a process .Fiacre processes can contain deterministic statements including assignment and loops ( Section 3.4 ).As with CAL actors, a Fiacre process describes sequential behaviour defined by a set of control states and a set of process transitions.Processes can contain non-deterministic choice, which we map FSM scheduling to in Section 3.5 .CAL actor ports are mapped to Fiacre process ports, and dataflow connections are mapped to Fiacre channels.The following Product Fiacre example 1 reads pairs of integers from port b then sends their product out from port a .The evaluation sequence in a Fiacre process is guided by a labelled transition system.The Fiacre process P below has four control states with labelled transitions between them.Ports denote here pure synchronization labels (without communications).The automaton for this process is in Fig. 6 .
In the abstraction for model checking, actor FSMs map naturally to Fiacre process transitions.The transition labels are the actions to execute to get to new FSM control states.In our approach, action code bodies are abstracted ( Section 3.4 ) and then inlined into Fiacre transitions between control states.

Fiacre evaluation rules
The Fiacre semantics presented here are taken from [36] .Labelled relations express the semantics of Fiacre statements.
The evaluation rules that follow have the shape: which states that under conditions P 1 to P n , the value of expression E with store e is v , where ⇝ is expression evaluation. 1 from [35] .
The semantics of statements is expressed operationally by a labelled relation.The big step relation holds triples (  ,  )  ⇒ (  ′ ,  ′ ) in which S is a statement, e and e' are stores, where Λ is the declared set of states of the process.The ⇒ transition updates the store from e to e ′ , moving execution to statement S ′ , and the label is either a communication action l or silent action .

Abstracting dataflow actions to small step fiacre evaluation rules
This section shows the abstraction of the pertinent CAL code primitives given in Section 3.2 , i.e. procedures, data rates, and actor FSMs, to small step Fiacre rules.Each abstraction rule from actors to processes is written = … , which translates CAL to Fiacre evaluation rules.

Action procedures
Action bodies may include statements that modify the actor's store.The translation maps these statements, i.e. for loops and if/then/else statements, to deterministic Fiacre process statements.
Conditional if/then/else CAL statements are translated to the following evaluation rules: Conditional while loops are translated as: CAL loops with a fixed iteration count are translated as: For assignment, Fiacre offers a number of forms but we only need the simplest here.e [ X ↦u ] is store e extended with the pair ( X, u ).
CAL action bodies can contain multiple procedures, i.e. a sequence of assignments, if/then/else blocks and loops.Fiacre statements can be compositions of other statements, and this is used to compile multiple CAL statements ( S 1 ; S 2 ), e.g. an if statement following a while loop, to the following Fiacre (composition) evaluation rules.In the second rule, the static semantics of Fiacre ensures that at most one of labels l 1 and l 2 is not empty, so l 1 .l 2 resumes to either l 1 or l 2 .

Actions with guards
Actions with guards can only fire if the guard predicate function evaluates to true in the current context.Fiacre has an on statement, which blocks if its expression evaluates to false that prevents execution of subsequent statements.The mapping models the actor guard predicate guard E with on , s is the input queue and s ′ is the output queue for action 1 when executed.The execution of a sequential composition of Fiacre statements abstract from action bodies depends on the boolean value of the on condition.
With an environment context of an actor's store and input tokens s , the abstraction of action guards is:

Actor communication
Actors share the results of their computations by explicitly passing tokens to each other via connections between their ports.The parallelising transformation of actors in Section 5 needs to know its cyclo-static data rate, so we capture token communication in the abstraction.That is, during execution of Fiacre statements we keep a count of the incoming and outgoing tokens into and out of the process.
The syntax for token communication in Fiacre is !and ?, corresponding to the (send) and (receive) Fiacre rules respectively: Communications are sequences    1 …   where p  is a port identified by a label  and the  1 …   are values.
Fiacre has a single-communication constraint whereby only one (send) or (receive) statement can occur in a (process-action) execution.In Fiacre, values are consumed from ports with ?, e.g.in1?x .
Whilst it is possible to consume multiple values from a Fiacre port, e.g.in1?x,y,z , this is constrained by the data type of the port, e.g.where the bytepair channel is a byte # byte tuple in the earlier example in Section 3.3.1 , where exactly two values must be consumed every time, e.g. with in1?x,y .The same Fiacre constraint applies for producing values with !, e.g.out1!x,y .The single communication rule is enforced by Fiacre's type checker to ensure that only one label is ever carried by a ⇒ transition update.In CAL, values are consumed/produced from ports by pattern matching, e.g.in1:[x,y,z] and out1: [3,x+2,0] .The communication semantics for dynamic dataflow differ from Fiacre's single-communication semantics in two ways: 1.An action can have multiple communication events, because an action can have both an input pattern and an output pattern and these patterns can match on multiple input/output ports.2. Actions within the same actor can pattern match on the same ports but consume/produce a different number of tokens, e.g.: Therefore to support dynamic dataflow with Fiacre's singlecommunication constraint, we give the Fiacre process ports a channel type.For each token in an action's input pattern, Fiacre's front function on the queue reads from this internal channel, the value is then removed with dequeue and the counter monitoring how many tokens are consumed is incremented.Likewise for an action's output patterns, a Fiacre queue receives all output tokens, and the counter monitoring how many tokens are produced is incremented by the size of that queue.Actions that have multiple communication events, i.e. multiple tokens in input/output patterns or consuming/producing from/to multiple ports, are separated into multiple intermediate single-communication events when abstracted to Fiacre.

Actor schedules to process selections
The selection of actions in dataflow actors is determined by the actor's FSM between control states.Given an actor's FSM that specifies multiple next states from the current state , the abstraction maps to Fiacre's select statement as: Where [] separates the non-deterministic choices.The following actor FSM example has one control state s0 : These FSMs are abstracted to Fiacre selection statements.Consider the following contrived example for actn1 and actn2 , where both consume a token but produce no tokens: Because the actions actn1 and actn2 have one communication event, and a guard (abstracted to an on statement after consuming x ), the FSM will be abstracted to the following Fiacre selection statement: The abstraction of actor FSMs uses three Fiacre rules (from)(to) and (select) , shown in Fig. 7 .This requires FSM transitions to be grouped by the control states, for all control states Σ in the FSM, that transitions start from.The (from) rule is used for each source state, the action used to transition between states is then abstracted and inlined into each corresponding transition, and finally the (to) rule is appended to these statements to arrive at the destination state, shown with set comprehension as follows: This restructures the FSM to group transitions by their source state, to map actor FSMs to the syntactic structure of Fiacre's select statement.

Abstracting action firing to big step fiacre evaluation rules
The previous section abstracted data rates, procedures and actor FSMs to small step Fiacre rules.We now require a big step Fiacre rule to abstract the atomic firing of a dataflow actor's action.
The following big-step (process-action) rule abstracts the sequential composition of multiple small-step executions.It describes statement executions between (from) and (to) rules, i.e. originating from an initial control state and arriving at a target state, where → is a process action getting from control state s to s ′ and  ⇒ is a big-step statement execution with the (process-action) rule: (process-action) A transition from from to the target control state in the (process-action) rule happens in totality or not at all.If a transition includes an on statement then tokens produced/consumed will only be consumed if the on predicate evaluates to true .This peek semantics is equivalent to CAL's semantics for guard .This big step rule corresponds to firing a dataflow actor's action, i.e. the (fire) or (fire-guarded) dataflow execution rules from Section 3.1 , which capture the actor CAL code we intend to model check.One big step (process-action) evaluation corresponds to multiple small step evaluations of action selection (select) , production and consumption rates (send) and (receive) , and procedures inside an action body i.e. (assignment), (if-then-else) and (foreach) .
Consider an action 1 with an action body with a sequence of statements S 1 ; S 2 , and the body of an action 2 with statement S 3 .Abstracting this (fire-guard) to Fiacre is as follows: Consider the same action, but this time with a guard predicating its firing.Abstracting this (fire-guard) to Fiacre is as follows: (processaction)

Tracking data rates
We introduce a P_static boolean and two numbers P and P_prev that count the number of produced tokens in the current and previous cycle respectively, to check for cyclostatic data rates.Variables C_static , C and C_prev are injected to track consumption rates.
To demonstrate this with an example, the CAL actor implementation on the left of Fig. 8 is a predicate filter that outputs only positive integer input tokens and discards negative integers.The actor scheduler is: The Fiacre translation is on the right of Fig. 8 .In the first complete cycle of a cyclo-static actor, i.e. ( s 0, e ) → ( s 0, e ), P_static and C_static will be false since this cycle computes the actor's production rate.In all subsequent cycles, these are true if the actor has a fixed data rate.To determine data rates for cyclostatic actors, the abstraction to Fiacre intercepts all actor transitions to the user defined initial state, e.g.s 0, introducing an additional Fiacre state sInit that compares previously recorded data rates for  0 → .→  0 .The sInit → s 0 Fiacre transition also consumes tokens from the FIFO for this transition sequence.

Cyclo-Static actors as LTL formula
Once the mapping abstracts Fiacre process descriptions from actors, the Fiacre compiler translates them to enriched Petri nets ( [37] ) for the purpose of LTL verification by TINA to search for counterexamples of cyclostatic properties.LTL (Linear Temporal Logic) is a temporal logic in which a formula can encode time (logical) and future state observations along runs, where each state in time has a single successor.

LTL Model checking approach
The LTL formula we use has logical operators negation ( not ), conjunction ( and ), disjunction ( or ) and implication ( ⇒), and temporal operators eventually ( ◊), always ( □) and until ( until ).In addition, it has atomic properties asserting that a particular process instance in the Fiacre description is in some particular Fiacre state and that one of its variables has a particular value.In the following LTL formulas thse atomic properties are written State(s) and Value(x = v) , respectively.
Our LTL model checking approach is: 1. Translate the actor to a Fiacre process ( Section 3 ).• If a counterexample cannot be found, the actor is classified as a multirate static or cyclostatic actor and can be parallelised using the algorithm in Section 5.2 .• Otherwise the actor is classified as dynamic and is not suitable for parallelisation.

The cyclo-static assertions
Cyclo-static actors always have an infinitely repeating periodic cycle of FSM transitions between control states.We capture cyclostatic dataflow semantics in the following LTL properties: 1.The variables in an actor's store are bound to their initial value after each periodic cycle.Property cyclic_store , Section 4.2.1 .2. There exists a periodic cycle between FSM control states.Property periodic_sequence , Section 4.2.2 .3. The data rates of an actor are static.Property static_rate , Section 4.2.3 .
The model checker searches for violations of these LTL properties to identify dynamic actors.We now give more detail of these three LTL properties.

Periodic process configuration
This LTL property is for checking that all values in an actor's store are reset on completion of the periodic cycle.The LTL formula is generated from actor code, and ensures that all store variables are at their initial value when the initial FSM state is returned to.In the following example the actor's store is {  = 0} , which can be extracted directly from the programmer's code e.g.int x: = 0; , and the actor's initial FSM state is s0 , which the programmer specifies in the FSM declaration ( Section 2.3 ):

Periodic state transition sequence
The LTL property to verify the existence of a periodic cycle in an actor's state transition combines observations of control states with the until temporal property.Consider an actor with states s 0, s 1, s 2 and s 3, and a periodic cycle s 0 → s 1 → s 3 → s 2 → s 0. The property for proving that this is the sole possible sequence, and that it infinitely occurs is: where, for any two Fiacre states i and j , To(i,j) stands for the formula:

State(i) and (State(i) until State(j))
At some position along a run, To(si,sj) holds if the Fiacre state of the actor at that position is si , the actor is in Fiacre state sj at some latter position in the run, and it remains in state si until then.Property periodic_sequence asserts that, in each run, the initial state of the actor is s0 and that at each position in the run exactly one of the To(si,sj) propositions in the disjunction holds.The absence of counterexamples proves an infinitely periodic FSM cycle for this actor.In practice, extracting this LTL formula for automated verification will require a CAL language extension for expressing explicit cyclic sequences that the programmer intends the model checker to verify.

Cyclo-Static periodic data rate
Finally, we check for the same static data rate for every loop through the verified periodic cycle with: The LTL static_production_rate property reads: at each state along each run, either P_static is true or it eventually becomes continuously true .A static consumption rate is checked with the static_consumption_rate property, which tests for the same temporal observations for cyclic consumption rates.
Whilst a simple mechanism, the P_static and C_static booleans enable the model checker to search for a single action firing sequence counterexample whereby the P and C data rate counters deviate from a fixed value after an initial periodic cycle.Should a counter example not exist, i.e. an actor is cyclo static, these two values are used by the parallelisation algorithm in Section 5.2 .

Model checking an actor
Processes cannot be model checked against the LTL rules above without communicating with another process that feeds data to it and Fig. 9. Orcc IR for Dataflow Transformation (from [38] with permission).consumes its results.Therefore the following testbench is composed in parallel with the actor being model checked: Their parallel composition is expressed in a top level Fiacre component:

Dataflow transformations
Our dataflow transformation system supports the parallelisation of static and cyclostatic actors.It is implemented using the Java based Orcc Intermediate Representation (IR) API, shown in Fig. 9 , by instantiating the Actor and Connection classes to implement the parallelisation algorithm in Section 5.2 .

Programming environment
We have extended the open source Orcc dataflow programming environment ( [16] ) with a refactoring tool, shown in Fig. 10 , for parallelising static and cyclo-static actors which introduces fork and join actors to scatter/gather data.The user chooses the parallelism degree, i.e. how many actors to take the place of the previous single actor.Orcc has multiple backends for different target architectures.We have added two additional backends to support the tool: 1) CAL, to generate new actor code for the fork and join actors and 2) Fiacre, for model checking actors.Fig. 11 shows the graphical consequence of applying the transformations.

Parallelisation algorithm
For an actor A consuming from edge u and producing to edge v , the transformation algorithm is: 1. Extract using model checking the consumption and production data rates    (  ) and    (  ) for a complete cyclostatic cycle, where n is the number of firings in a periodic cycle.This is done by parsing the P and C values from the output from the Tina model checker.2. Create N parallel instances A 1 to A N with the Orcc IR API, where N is the user-selected parallelism factor ( Fig. 10 ). 3. Create a fork actor with Orcc IR that distributes data from edge u across  1 , .,   in    (  ) chunks.4. Create a join actor with Orcc IR to consume    (  ) chunks from actors A 1 to A N .5. Output the joined chunks as a sequential stream to edge v .
After this transformation the graph remains partially sequential due to the linear stream of data that the fork actor consumes then distributes.Fig. 12 shows the sequential nature of stream propagation as it broadcasts tokens to each parallel instance in sequence on an FPGA.In this case each parallel actor is receiving streams of 10 elements, 1 token per cycle.Here, the high out1_SEND signal instructs the first parallel actor that the data signal is valid for the first 10 cycles and the low signals out2_SEND, out3_SEND and out4_SEND instructs the other three actors to not read the data signal during these cycles.Stream gathering by the join actor is sequential also.This data propagation latency limits the speedup since some actors may be idle waiting for their next input stream, so there is a balance between the task size, potential parallel speedups and the latency overheads of parallelism.

Evaluation
This section evaluates the model checking based classification of actor models using three actors that implement (1) sparse matrix compression, (2) matrix row sorting and (3) dynamic time warp.One of these actors is identified as not being cyclostatic, the other two are, and hence are parallelisable.They are parallelised using the interactive transformation tool, replacing these sequential actors with up to 16 parallel actors.The speedup results are given in Section 6.3 .

Counterexample
An actor can implement a sparse matrix compression algorithm incrementally as a matrix streams through an actor.It consumes the matrix in row major order, and outputs a (row,column,value) tuple for non-zero values.The transition rules of the actor are: The actor has one control state s 0 and two actions compress and ignore that both transition from s 0 back to s 0. Each matrix value in the stream disambiguates action scheduling: compress fires on non-zero values, and ignore fires on zero values.The compress action increments width and height counters, as w ′ and h ′ in the store after the transition.The actor is 25 lines of CAL code.The Fiacre description is 79 lines of code.The TINA model checker generates a synchronised Büchi automaton and a transition system from the LTL formula in Section 4 , then looks for paths through the transition system for counterexamples.does hold, because state 97 is always to.This is due to the nature of sparse arrays, number of non-zero values in the matrix determines the data rates.

Parallelising verified cyclo-static actors
Two benchmarks are now used to show the performance improvements of parallelisation of verified cyclostatic actors with the mechanised algorithm from Section 5.2 .

Matrix row sorting
The matrix row sorting actor sorts every row in a matrix in ascending numerical order using bubble sort.The pre-sorted nature of the input determines the clock cycle latency to perform the sorting.There is also the latency of consuming and producing rows, one cycle per element.The actor's FSM is in Fig. 13 .With an initial store {  = [] ,  = 0} , the scheduling is:  The model checker is unable to find counterexamples of the LTL properties.For a matrix of width 100, the sorting actor A has a periodic cycle of 202 firings: 100 × receive , 100 × output , 1 × sort and 1 × reset .The fork actor scatters    (202) = 100 tokens to each parallel actor and the join actor gathers    (202) = 100 tokens from each actor.

Dynamic time warp
Dynamic time warping (DTW) is an algorithm to measure similarity of two temporal sequences.Applications of DTW include automated speech recognition.The actor takes two sequences S and T , then outputs the optimal match between them.Consuming each sequence element costs one cycle, followed by multiple clock cycles to compute the optimal match, then one cycle to output that match.The actor's FSM is in Fig. 14 .With an initial store {  = [] ,  = [] ,  = [][] ,  = 0} and a sequence length n , the scheduling is: Again, the model checker is unable to find counterexamples of the LTL properties.For input sequence lengths of 40, the DTW actor A has a periodic cycle of 83 firings: 40 × receiveS , 40 × receiveT , 1 × reset , 1 × dwtDistance and 1 × outputMatch .The fork actor scatters    (83) = 80 tokens to each parallel actor and the join actor gathers    (83) = 1 token from each actor.

Performance results
For the two benchmarks, the Orcc compiler generates an FPGA circuit description using the Xronos Verilog backend ( [39] ).The results in Fig. 15 show the parallel speedup using 4, 8, 12 and 16 actors.Speedup is T 1 / T n , where T 1 is the clock cycles with one actor and T n is the cycles with n actors.The dashed line shows ideal linear speedup.The Verilog simulation testbench randomises the input values for 128 input sets for both benchmarks, i.e. when using 16 actors each actor processes 8 inputs.
Fig. 15 a shows matrix row sorting speedups for different row lengths.The number of elements in a matrix row, and the pre-sorted arrangement of its values, determines the clock cycle latency of sorting the elements since randomly ordered inputs require repeated sorting.The sequential algorithm for a row length of 1000 requires 1,546,477 clock cycles.There is a 5.6 × speedup for 12 and 16 actors when processing rows of lengths between 10 and 200.From a speedup of 4.3 with 12 actors there is a drop in performance to a 3.3 speedup with 16 actors to sort just 10 elements.At such small workloads, most of the 16 actors will not be firing and instead the data propagation latency overheads result in degraded performance.Speedups of 5.6 × and 15.4 × with our approach take significantly less time to achieve compared to using direct Verilog or VHDL optimisations.This is because our tooling automates fork/join data management and verifies the optimisation can be safely applied with model checking, which if done at the Verilog/VHDL level would require extensive testbench simulation.
The CAL and Fiacre implementations, generated hardware descriptions and raw results are in an Open Access dataset [40] .

Discussion
The previous section demonstrated how parallelising actors can elicit speedups.However this is not always the case, e.g. when an actor is not on the critical path of a program's execution [41] .Worse still, rather than increasing performance it can have a countereffect of generating larger programs, which in turn can reduce achievable clock frequencies or exceed hardware resource constraints.Tooling can help direct a programmer to bottlenecks, which parallelism may alleviate.
In Section 7.1 we present a tool we have developed for Orcc that addresses this issue.The motivation is to present hardware costs to the user, e.g.identifying the actor with the longest datapath, which in turn affects the overall achievable clock speed for the entire dataflow graph.Then in Section 7.2 we quantify the CAL language features used in real world code.The use of some CAL language features potentially results in dynamic dataflow behaviours.The frequent occurrence of these language constructs in each actor justifies the need for model checking.

Hardware profiling
The Xronos FPGA backend for Orcc uses Xilinx's open source Open-Forge [42] compiler to generate hardware cost reports for each actor at compile time.These are: 1) the number of BRAM blocks, and 2) the datapath depth for each action in an actor.The datapath depth determines the minimum latency (microseconds) between each clock cycle.
We have added to the Orcc programming environment a tool that lifts these hardware costs into the visual dataflow editor ( Fig. 16 ), to communicate hardware bottlenecks to the user.Green actors have small datapath depths, relative to the orange critical actor(s) which have the longest datapath.BRAM counts for implementing actor internal variables are also shown to the programmer.

Real world dynamic dataflow programs
A static analysis approach with small actors may be able to identify static, cyclostatic and dynamic actors, e.g. an actor with just one action.However scaling static analysis to identify models of computation of larger actors, e.g.multiple guarded actions and store variables updated by multiple actions, would be very challenging and the authors are not aware of a dataflow framework that can do this automatically.That is, applying static analysis on actors implemented in a dynamic dataflow language to classify their dataflow model of computation.
We assess dataflow language features used in a real world code repository2 to evaluate the size of CAL actors in practise.It includes 823 actors in 555 dataflow graphs written by 23 developers, across multiple domains including cryptography, image processing and network protocols.
Table 1 shows the frequency of CAL language features in this repository that can impact on the complexity of classifying an actor's model of computation, e.g.whether it has cyclostatic or dynamic data rates.An external control signal from an input port is in a guard predicate for 15% of actions, and internal control signals in the store appear in a guard predicate for 56% of actions, i.e. whose private variable values may determine runtime scheduling choices.Most actors are stateful, with 78% containing at least one variable in its store.Multiple actions consume the same port for 49% of all input ports.Multiple actions write to the same port for 41% of all output ports.Multiple actions modify store variables for 62% of all store variables.
Static analysis would not be feasible at these scales of code complexity.Machine checked verification would be the only feasible option to overcome this, e.g. with the model checking approach that this paper presents.
Automating the verification and parallelisation approach for the full CAL language will require some additional engineering: (1) the implementation of a parser for the TINA model checking output, (2) an extension of the verification approach to support multiple input or output ports, and (3) a CAL language extension for explicit cyclic sequences to be expressed ( Section 4.2.2 ).Supporting multiple actor ports (task 2) will likely be the most trivial, because there is a direct mapping from CAL ports to Fiacre ports ( Section 3.3.1 ).Extending the syntax of CAL FSMs to support a programmer's claim of a deterministic cyclic sequence through states (task 3), for model checking to verify, will require an extension to Orcc's frontend i.e. changes to the Xtext grammar, and adding it to Orcc IR by extending the Orcc EMF (Eclipse Modelling Framework) model.Processing the output of the TINA model checker (task 1) will a require a parser, or a machine readable output format added to TINA's backend.

Conclusion
As the performance of embedded architectures increases, program sizes that they can support also grow, as does the complexity and dynamic nature of algorithms in programs.As embedded accelerators become more widely adopted, the importance of effective code optimisations grows.Previous work ( [43] ) explores the NP-hard problem of cyclic scheduling of multiple actors mapped to parallel processing architectures.In contrast, this paper presents a program transformation approach that parallelises a single, potentially stateful, actor into multiple actors to exploit parallel architectures.The approach separates cyclostatic from dynamic actors using LTL model checking.The verification phase takes seconds to run.Th graphical parallel refactoring tool speedily increases FPGA performance, e.g. a 15.4 × speedup with 16 actors.
This paper is a step towards auto-parallelisation of dynamic dataflow programs.Verifying other parallelising transformations, e.g.task and pipelined parallelism, and verifying transformation of dynamic actors, would extend our approach.The broader aim of this work is to integrate automated formal verification more widely into optimising compilers and parallel runtime systems for embedded systems.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 15 b
shows dynamic time warping speedups for a length of 40 for sequences S and T .The sequential algorithm requires 1,332,979 clock cycles.Speedup with 16 actors is almost linear (15.4).
2. Abstract an actor's initial FSM state and store variable values into three LTL formulas ( Sections 4.2.1, 4.2.2 and 4.2.3 ). 3. Model check the Fiacre process for counterexamples of the LTL properties that model cyclostatic actors.
The model checker finds a counterexample for the static_data_rate property.State 60 in the Büchi automaton is: 76 → 88 → 97 → 109 → 115 → 127 → 47 → 59 → 76.There is a loop from state 76 to itself.In this loop is state 97, which in the counter example is: Therefore the model checker proves static_production_rate property false, since P_static is false at state 60 serves as the antecedent of the ⇒ implication, and the consequent: