Two challenges in embedded systems design: predictability and robustness

I discuss two main challenges in embedded systems design: the challenge to build predictable systems, and that to build robust systems. I suggest how predictability can be formalized as a form of determinism, and robustness as a form of continuity.


Introduction
An embedded software system is sometimes defined as a computing system that interacts with the physical world. This definition is incomplete, because every software system, once it is up and running, interacts with the physical world. More precisely, what is meant is that an embedded software system has non-functional requirements, which concern the system's interaction with the physical world.
There are two interfaces of a software system with the physical world: the environment and the platform. The environment includes the human users of the system, possibly a physical plant that is controlled by the system, and other application software processes that interact with the system. The platform consists of software and hardware components that implement a virtual machine on which the system is executed; it includes the operating system and network, with specific scheduling and communication mechanisms. Correspondingly, the non-functional requirements of an embedded software system can be classified as follows (Henzinger & Sifakis 2007): -reaction requirements, which concern the interaction of the system with the environment and -execution requirements, which concern the interaction of the system with the platform.
to objective (i), because it is often difficult for the programmer to foresee the effect that a given priority assignment has on the real-time behaviour of the system. As a result, task priorities are often tuned during the testing phase of the system, only to be adjusted again whenever the environment or platform changes. I submit that a successful solution to the grand challenge in embedded software design has to exhibit two key characteristics. First, the programming model must have the property that all software written in the model be predictable, in particular, not only predictable in its functional properties (what output does a program compute?) but also predictable in its reaction properties (e.g. when does the program provide the output?) and in its execution properties (e.g. what resources does the computation consume?). I will formalize the notion of predictability as a form of determinism. Second, the programming model must have the property that all software written in the model be robust, in the sense that its reaction properties change only slightly if the environment changes slightly, and that its execution properties change only slightly if the platform changes slightly. I will formalize the notion of robustness as a form of continuity. The remainder of this article will, in turn, discuss these two universal meta-requirements on embedded software systems: predictability and robustness.

Predictability through determinism
The first universal challenge in systems design is the construction of systems whose behaviour can be predicted. In embedded systems, the interesting aspects of system behaviour encompass not only functionality but also reaction and execution properties such as timing and resource consumption. For the sake of illustration, I will concentrate in the following on timing. For this purpose, I will use a notion of system behaviour that includes, in addition to the values that are being computed, also the times at which the computed values become available. If other non-functional dimensions of behaviours are of interest, such as power consumption, then similar arguments can be made.
Predictability, at first glance, seems at odds with the concept of nondeterminism. A non-deterministic process is a process that may continue in several different ways, and thus, to predict the outcome of the process, all possible continuations need to be considered. This is often a difficult and prohibitively expensive task. Non-determinism, on the other hand, is one of the central defining notions of computer science and lies at the very heart of several fundamental questions, among which I list two. Is non-deterministic computation more efficient than deterministic computation (P versus NP)? Is there an observable difference between two actions being executed concurrently and the same two actions being executed sequentially in a non-deterministic order?
One approach to the building of predictable systems is to build them entirely from deterministic parts. This would require one to use: only processors for which the execution time of each instruction is predictable, in particular, independent of cache and memory accesses; communication channels for which the delivery time of each message is predictable; etc. I believe that such a fully deterministic approach cannot provide a generally acceptable solution, for two reasons.
First, some residual sources of non-determinism, such as the possibility of hardware failure, are difficult if not impossible to remove. Indeed, as circuit elements become smaller, they also become less predictable. Second, the fully deterministic approach voluntarily foregoes one of the most successful design principles, namely that certain forms of non-determinism can be controlled by hiding them underneath a higher deterministic layer of abstraction. Such a deterministic layer is typically a programming model (but it could also be a processor model; Edwards & Lee 2007). Thus, I believe that the key question before us is not how we can build complex embedded systems from deterministic parts, but how we can build deterministic abstraction layers from nondeterministic parts.
To approach this question, one needs to distinguish between several sources of non-determinism, as follows: (i) input non-determinism, (ii) unobservable implementation non-determinism, (iii) don't care non-determinism, and (iv) observable implementation non-determinism.
I will argue that the fourth kind of non-determinism, and only the fourth kind, needs to be avoided in order to design predictable systems. Since I focus on software systems, I generally refer by the term 'system' to a program written in a high-level programming language, and by the term 'implementation' to a lower layer of abstraction, such as machine code. Hence, when a system is classified as deterministic (or in §3, continuous), we appeal to the semantics of the system as defined by the programming model (i.e. the definition of the programming language). Other, more hardware-centric views reserve the term 'system' for the concrete deployed artefact; in their terminology, what we call a deterministic (or continuous) system would be referred to as an executable, deterministic (or continuous) 'model' of the system (Kopetz 2007). Either way, I argue for avoiding certain forms of non-determinism (and certain discontinuities) at the abstract level of the design.
The first source of non-determinism is the environment of the system, which is free to behave in many different ways. This kind of non-determinism, called input non-determinism, is uncontrollable; it must be present whenever a system reacts to environmental stimuli. Therefore, when a reactive system is referred to as deterministic, what we mean is not that there is a unique stream of input and output values, but that, for every stream of input values that are provided by the environment, the stream of output values that are computed by the system is unique. For embedded systems, the input and output streams must include time stamps. More precisely, a timed input stream is a sequence of time-stamped input values, such as sensor readings or user commands; and a timed output stream is a sequence of time-stamped output values, such as actuator updates and other generated events that are observable by the environment. We say that an embedded system is time deterministic if, for every timed input stream, the timed output stream that is computed by the system is unique Kopetz 2007). Note that time determinism refers not only to the input and output values but also to the times at which input values are given to the system and the times at which output values are made available to the environment. If an embedded system computes a unique output value, but may make the value available to the environment (say, by updating an actuator) at different time instants, then the system is not time deterministic. Obviously, time determinism is essential for safety-critical real-time systems such as those deployed to control automobiles or aircraft.
The second source of non-determinism is the omission of implementation details that do not influence the observable behaviour of the system. For example, one may specify a sorting algorithm by saying that, as long as there are two adjacent values x and y with xOy in the input sequence, the two values are swapped, without specifying how the input sequence is scanned in order to find such values. No matter whether the input sequence is scanned left to right (as in conventional bubble sorting) or right to left or in any other way, the result is the same sorted output sequence of non-decreasing values. We call this kind of non-determinism unobservable implementation non-determinism, because the resolution of nondeterministic choices does not affect the uniqueness of the outcome. Unobservable implementation non-determinism is not a source of unpredictability because, by definition, it has no bearing on the observable behaviour of the system. Unobservable implementation non-determinism is nonetheless helpful as it prevents overspecification, permits efficient implementations and often simplifies correctness arguments. Since our definition of time determinism refers only to the input and output values and their times, it does allow unobservable implementation non-determinism.
The third source of non-determinism is 'don't care' values in the output behaviour of a system. For example, we may not care about the output value y whenever the input value x satisfies a certain condition, or we may not care about the precise time of an output event. This kind of non-determinism, which we call don't care non-determinism, is again useful, as it prevents overspecification. Yet don't care non-determinism is not allowed by our definition of time determinism, which requires a unique output value and output time for each input value and time. However, we can accommodate don't care non-determinism easily by modifying the value and time domains for outputs. If explicit 'don't care' values are added to these domains, then the definition of time determinism does not need to be modified. For example, if on input value 0 the system may output either 0 or 1, then {(0, 0), (0, 1)} is a non-deterministic representation of the system behaviour (there are two possible outputs for the same input), while {(0, t)}, where t denotes a don't care value, is a deterministic representation of the same system behaviour (the unique possible output is t). This small mathematical trick, which allows don't care non-determinism in deterministic systems, shows that don't care non-determinism, while useful, is not essential.
The last source of non-determinism is the omission of implementation details that do influence the observable behaviour of the system. For example, a system with several tasks may compute different output values depending on the order in which the tasks are scheduled. If the scheduler is non-deterministic, or deterministic but unknown and not observable, then there are many possible output streams generated by the system for the same input stream. I submit that this kind of observable implementation non-determinism is harmful and must be avoided because it leads to unpredictable behaviour. While unobservable implementation non-determinism can be hidden underneath a deterministic programming model, and don't care non-determinism is shallow, observable implementation non-determinism is often difficult to control and prohibitively expensive to analyse. Design in the presence of observable implementation nondeterminism requires the exhaustive consideration of all non-deterministic choices and their consequences. Hence, in order to build predictable systems, a designer must strive to minimize all observable implementation non-determinism.
Consider the case of non-embedded programming. Unobservable implementation non-determinism has created some of the most powerful programming abstractions by freeing the programmer from managing inconsequential implementation details and, at the same time, enabling the compiler to perform optimizations. For example, the exact memory layout of data structures does not influence the results of most conventional programs. Thus, memory management is unobservable implementation non-determinism better left to the compiler than to the programmer. In stark contrast, observable implementation non-determinism is the result of some of the most problematic programming abstractions. The thread model for concurrent programming falls into this class. Concurrent threads may be interleaved in any way, thus producing highly non-deterministic results. The programmer is left to reduce this non-determinism manually by introducing locks and other synchronization constructs, which can have other unwanted consequences such as deadlocks. The current research in transactional memories is aimed at addressing the problem exactly by removing observable implementation non-determinism, by implementing the slogan 'think sequential, run parallel' (Allen 2008). Actor models of concurrency offer a different approach to achieve the same goal (Lee 2006).
In embedded programming, the challenge is to build time-deterministic systems. This is difficult because the scheduler has direct impact on the times at which output values are computed and, owing to race conditions, may have indirect impact on the computed values themselves. As a consequence, scheduled systems are often unpredictable. While there have been successes in removing this observable implementation non-determinism-most prominently, the synchronous programming languages (Halbwachs 1993)-we have no widely acceptable solution, especially for soft (average-case) real-time requirements. A key question is whether scheduling can be turned into unobservable implementation nondeterminism, similar to memory management. Is it possible to have the times of output events specified by the programmer and ensured by the compiler? One approach in this direction has the compiler generate a schedule that guarantees that each output value is computed before it is needed and then withheld until a specified time . The withholding of computed output values until the times specified by the program yields a timedeterministic system.

Robustness through continuity
A second universal challenge in systems design is the construction of systems whose behaviour is robust in the presence of perturbations. While robustness is understood for most physical engineering artefacts, it is rarely heard of in connection with software. This is because computer programs can be readily idealized as discrete mathematical objects-a perspective that has been advocated in computer science since the 1960s (McCarthy 1962). If a program is studied as a discrete mathematical object, then correctness is a Boolean notion and can be established by proof: the program either satisfies its requirements or it does not. This prevalent view of programs as discrete partial functions on values or states has led to tremendous successes: it enabled the most fundamental paradigms of the science of computing, including the theories of computability, complexity and semantics.
However, in computer science, unlike in other engineering disciplines, we often lose sight of the fact that the mathematical representation of a software system is just a model, and that the actual system is physical, executing on a physical imperfect platform and interacting with a physical unknowable environment. The realization that programs are ultimately physical shatters the Boolean illusion. Of two mathematically correct programs, one may be preferable to the other owing to the way it behaves if the platform or environment deviates from the nominal expectations, be it due to resource limitations, failures, attacks or simply erroneous or incomplete specifications. To some degree, this observation guides the design of robust non-embedded software, for example, by having the system check whether the input values lie within expected ranges. Moreover, one program may be more fault tolerant than another, functionally equivalent program, resilient against a larger class of potential attacks, etc. But the incompleteness of the sharp Boolean view becomes most apparent in embedded programming, where computing explicitly meets the physical world. This is because, often, the reaction and execution properties of a system-such as response time or power consumption-are best measured in terms of continuous quantities, and they may satisfy a design specification to different degrees.
In order to study computation as an imperfect, physical process, we must develop metrics for preferring one system over another. Given a set of requirements, a preference metric provides a measure of how close a system comes to meeting the requirements, and how robust it is against small changes in the requirements (Chatterjee et al. 2006). Both reaction and execution requirements may refer to several dimensions, such as the precision of the output values, their timeliness, the physical cost of the system and its expected lifetime. Hard worst-case requirements must be given more weight than soft best-effort requirements, and, in general, it will be unlikely that a single design point is to be preferred according to all criteria. In such cases, design decisions need to be based on a Pareto analysis that illuminates the trade-offs (Szymanek et al. 2004). In short, the Boolean constraint satisfaction problem of building a system that satisfies a set of requirements is to be replaced by a multidimensional optimization problem.
In a quantitative setting, we can formalize robustness properties as mathematical continuity: a system is continuous if continuous changes in the environment or platform cannot cause discontinuous changes in the system behaviour. For a continuous system, for every positive real epsilon, there must be a positive real delta such that, for example, delta changes in the values and times of sensor readings cause at most epsilon changes in the values and times of actuator updates, or, delta changes in the processor speeds and loads cause at most epsilon changes in the output values and times. Such continuity properties can hold, of course, only for continuous variables such as real time and other physical quantities; for discrete variables such as switches, the flipping of a single bit may completely alter the behaviour of a system. However, embedded systems and their requirements refer to many continuous quantities, yet we have no design theory or methodology for building continuous systems. For example, a system that reads a sensor value, adds a constant, and outputs the result after some constant time delay behaves continuously, in both output value and time. On the other hand, a system that checks whether the sensor value is greater than a certain threshold, and, depending on the result of the check, invokes two different control algorithms with two potentially different execution times to compute the output, does not behave continuously. Since branching is not continuous, it is usually infeasible to build continuous software systems entirely from continuous parts; so the question before us is how to build continuous systems from non-continuous parts. In the above example, a preferred way to combine the two control algorithms may be to execute both procedures if the input value is near the threshold, to average the result and to withhold the output until a specific time that is independent of the computation times.
Continuity properties are essential not only because we expect the system performance to degrade smoothly if the environment or platform changes in unforeseen ways, accidentally or maliciously, but also because physical quantities cannot be measured with infinite precision. In this sense, continuity is a natural requirement with respect to real-valued variables. There is also an altogether more radical approach, which looks at entire discrete transition systems through a continuous lens. For example, for a given program, the Boolean value of whether a functional safety requirement is satisfied forever may be replaced by a quantitative value that measures for how many transitions the requirement is satisfied (de Alfaro et al. 2003). The goal would be to define a continuous mathematics of programs where the resulting notion of robustness, formalized as continuity, naturally corresponds to our intuition of a system behaving well under disturbances.
Before concluding, let us have a brief look at a potential way of designing reliable systems that touches on both determinism and continuity. Suppose that an embedded control system updates an actuator value x at periodic times. Not all of the updates of x are going to be correct, because sensors may give faulty readings, processors may be too overloaded to compute new actuator values on time, hardware may fail, etc. We give the programmer the option to specify the desired reliability of x as a real number between 0 and 1. In particular, a reliability of 0.999 means that, in the long run, at most 1 out of every 1000 actuator updates writes an invalid value; or more precisely, if a valid update has value 1 and an invalid update value 0, then the limit average of the infinite number of updates generated by the system is at least 0.999 (Chatterjee et al. 2008). This reliability requirement is non-deterministic, as there are many different infinite sequences of '0's and '1's with a limit average of at least 0.999. However, we can make the reliability requirement deterministic by strengthening it slightly. For this purpose, we specify the output of the system as a pair consisting of a stream of actuator values x i , for iR0, and a probability p (in our example, pZ0.999). The behaviours of the system are precisely those streams that are generated by an infinite repetition of a random coin toss that, in the ith round, yields the specified value x i with probability p and a special don't care value with probability 1Kp. By the law of large numbers, the resulting system behaviours satisfy the reliability requirement with probability 1. Moreover, the reliable system is now specified by a unique stochastic process generating actuator values, which makes the definition deterministic.
This distinction between non-deterministic and probabilistic outcomes is important, as, in situations where we cannot hope to build fully reliable systems, we still need to strive for building systems whose failures are probabilistically predictable. For instance, if on input value 0 the system outputs 0 with probability 0.6 and 1 with probability 0.4, then even though it is probabilistic, the system behaviour is fully determined, and in that sense predictable. Intuitively, while drawing a random ball from an urn with unknown numbers of red and black balls has a non-deterministic outcome, depending on the ratio of red and black balls, such a drawing from an urn with known numbers of red and black balls has an outcome that, albeit stochastic, is completely specified. Technically, this is similar to our handling of don't care nondeterminism in deterministic systems: there, several possible output values are specified by a unique don't care value; here, several possible output values are specified by a unique stochastic process, that is, by the value of a random variable. In other words, probabilistic behaviour is a general case of deterministic behaviour.
Back to our example. While the programmer specifies the reliability requirement, the compiler must ensure it, based on assumptions about the failure rates of individual sensors, processors, actuators and communication links. In particular, if the probability of a failing computation on a single processor exceeds 0.001, then the compiler has to replicate the computation of each actuator value x i on several processors and use a voting or averaging procedure to compute the results. Furthermore, to achieve continuity, the compiler must ensure that, if the failure assumptions about the hardware are slightly incorrect, then the reliability requirement on the software is missed only by a small amount; or precisely, for every positive real epsilon there exists a positive real delta such that, if the failure assumptions are wrong by delta, then the system reliability lies within epsilon of the target.

Conclusion
I have outlined what I believe to be two major challenges in embedded systems design, as follows: (i) the challenge to build, on top of non-deterministic system implementations, system abstractions that are deterministic with regard to non-functional properties such as time and resource consumption and (ii) the challenge to build, on top of non-continuous system implementations, system abstractions that are continuous with regard to physical quantities.
Of course, to be of help in systems design, the abstractions need to support compilers and other techniques for synthesizing implementations, usually from a given collection of building blocks such as a specified instruction set. Both challenges require a rethinking of the conventional, purely discrete, purely functional (i.e. non-physical) foundation of computing. Embedded systems design, therefore, offers a prime opportunity to reinvigorate computer science (Lee 2005;Henzinger & Sifakis 2007).