A Self-Similar Traffic Model for Network-on-Chip Performance Analysis Using Network Calculus

Since around year 2000, Network-on-Chip (NoC) has been proposed as a global communication paradigm to interconnect tens or hundreds of cores on a single chip (Bjerregaard & Mahadevan, 2006). One key challenge for NoCs has been Quality of Service (QoS), which is concerned about performance guarantees or bounds. To achieve QoS, formal performance analysis is essential because it overcomes the uncertainty in results and lengthiness in time of simulation-based approaches (Lu, 2007).

The remainder of the chapter is organized as follows. Section 2 summarizes related work and our contributions. In Section 3, we first introduce the property of self-similar traffic. Then we present the Fractional Brownian Motion (FBM) model (Norros, 1995), which is used to characterize the self-similarity of traffic, and how to estimate FBM parameters. In Section 4, we present our main findings in the form of theorems, proposing an extended arrival curve to constrain self-similar traffic. Afterwards, in Section 5, we present formulas to calculate delay and backlog bounds. Assuming the latency-rate server model (Stiliadis & Varma, 1998) for network elements, we give closed-form equations. Moreover, to give a complete picture of our method, we describe a performance analysis flow to show how to conduct performance analysis for self-similar traffic. Experiments and results are reported in Section 6. Finally we draw conclusions in Section 7.

Related work
Since being initially identified in Ethernet by Leland et al. (Leland et al., 1994), traffic self-similarity has far-reaching influence on traffic modeling and performance analysis. Explorations of the nature of self-similarity and applications of this complex phenomenon have been extensively studied and summarized (Park & Willinger, 2000). In the context of NoCs, researchers have found the evidence of self-similarity from on-chip communication traces. In (Varatkar & Marculescu, 2004), Varatkar et al. first introduced self-similarity as a fundamental property exhibited by the bursty traffic between on-chip modules in multimedia video applications. This work captured the traffic characteristics between pair-wise nodes rather than for the entire network. Later, Soteriou et al. (Soteriou et al., 2006) empirically studied a large set of traffic traces gathered from the execution of SPEC, MediaBench and bit-parallel benchmarks over the entire on-chip network with different architectures and showed the presence of self-similar phenomena in on-chip traffic flows.
Cruz (Cruz, 1991) has pioneered the network calculus, which is based on bounds of traffic flows. A useful family of bound functions for concise descriptions has the form α(t)=rt + b, where r is the rate and b limits the burstiness of the flow. Based on Cruz's foundation, Chang (Chang, 2000) and Le Boudec (Le Boudec & Thiran, 2004) have further developed the network calculus theory and based it on min-plus algebra. The basic elements in this algebra are arrival curves as an abstraction of application traffic and service curves as an abstraction for components (network elements). A well-defined service curve is the so-called latency-rate function β R,T , where R is the service rate and T the maximum response delay of the node (Stiliadis & Varma, 1998). (Ciucu et al., 2005;Jiang, 2006;Starobinski & Sidi, 2000;Yin et al., 2002) is the probabilistic version of the (deterministic) network calculus. It has recently been developed for stochastic service guarantee analysis. Stochastic network calculus combines the deterministic network calculus with statistical multiplexing. For this, several stochastic versions of arrival curve have been proposed by extending the concept of arrival curve to the stochastic case based on the traffic amount property or virtual backlog property. Among the existing stochastic arrival curves, Sum of Exponentials, Weibull Bounded Burstiness (WBB), Fractional Brownian Motion (FBM) and Multifractal Brownian Motion (MBM) envelope processes consider the self-similar traffic (Mao & Panwar, 2006). In contrast to the deterministic arrival curves, stochastic arrival curves envelop traffic tighter but have higher implementation complexity.

Stochastic network calculus
In (Norros, 1995), Norros introduced the FBM model to capture the long-range dependence within the self-similar traffic. This model inspires WBB envelope process and is the basis for the FBM and MBM envelope processes (Mao & Panwar, 2006). Since the stochastic properties of the FBM process retain well when the traffic is multiplexed, randomly split, or goes through a buffering system, the FBM model serves well for the objective of concatenating single-hop analysis into an end-to-end analysis (Cheng et al., 2007).
We link self-similar traffic to deterministic network calculus. We develop an extended linear arrival model as its arrival curve, and then apply NetCal analysis on it. Our arrival curve is also constructed based on the FBM process. In contrast to other stochastic arrival curves, it is coupled with deterministic network calculus. Also, it is an extension of the traditional linear expression, thus easy to use and understand and simple in implementation. We summarize our contributions as follows: • We prove that self-similar traffic cannot be enveloped by any deterministic arrival curve. • We extend the linear arrival curve α r,b (t)=rt + b with an excess probability ε as ε-α r,b (t)= rt + b(ε), where ε reflects the probability of traffic burstiness surpassing its arrival curve. We prove that self-similar traffic can be characterized by the extended linear arrival curve ε-α r,b . • Based on the extended self-similar traffic model, we derive delay and backlog bounds for self-similar traffic served by one or a series of concatenated network elements. Furthermore, we give closed-form equations to compute the bounds assuming the network elements are modeled by the latency-rate server (Stiliadis & Varma, 1998). • We present a performance analysis flow starting from self-similar traffic and ending with results of delay and backlog bounds.

Self-similarity
Let X(t) denote the traffic volume arriving in the tth time unit. Let A(t) be the cumulative process indicating the total traffic volume from time 0 up to time t. X(t) is also termed as the increment process of A(t) as X(t)=A(t) − A(t − 1).
Given a stationary time series X =( X(t), t = 1, 2, 3, ...), we define the m-aggregated series X (m) =( X (m) (k), k = 1, 2, 3, ...) by summing the original series X over non-overlapping blocks of size m. The time series process X is called asymptotically second-order self-similar (as-s), if the autocorrelation function of X (m) and X follows That is, at all scales the aggregated autocorrelation structures agree asymptotically to the autocorrelation structure of the entire series X.
The crucial feature of self-similar processes is that they exhibit long-range dependence (LRD). These LRD processes have an autocorrelation function r(k) that decays with time lag k, i.e., r(k) ∼ k −γ as k → ∞, where 0 < γ < 1. The Hurst parameter H is commonly used to measure the degree of LRD, and is related to the parameter γ by H = 1 − γ/2. In fact, with 1/2 < H < 1, as-s and LRD imply each other, and self-similarity and LRD are often used interchangeably in practice.

FBM and its envelope process
Many different models are widely used to represent self-similarity. We use Fractional Brownian Motion (FBM) (Norros, 1995) to model the cumulative input traffic A(t). The FBM input {A(t) : t ≥ 0} can be represented by where the mean arrival rate E{A(t)/t} =ā, and σ 2 is the variance of traffic in a time unit, and {Z(t) : t ≥ 0} is the standard (normalized) FBM process with Hurst parameter H ∈ [1/2, 1).
The basic known property of FBM model is its marginal distribution (Norros, 1995), which allows computing an envelope process. For an FBM process A(t) with meanā and variance σ 2 , the envelope processÂ(t) can be defined aŝ where the parameter k determines the probability that A(t) will exceedÂ(t) at time t as follows: where Φ(y) is the residual distribution function of the standard Gaussian distribution, using the approximation Φ(y)=exp(−y 2 /2), k is given by k = √ −2lnε.
The FBM envelope process is advantageous: (1) It is parsimonious, i.e., only three parameters (ā, σ, H) are required to completely characterize a self-similar source; (2) The input parameters (ā, σ, H) can be estimated in real-time from the incoming traffic samples with minimal computational complexity (Fonseca et al., 2000).

Estimation of FBM parameters (ā, σ, H)
The FBM parameters (ā, σ, H) can be estimated from a sample of traffic traces. To estimatē a and σ, we first get the traffic cumulative process A(t) from the sample. The mean arrival rate is derived asā = E{A(t)/t} and the variance of traffic in a time unit is given as σ = Var {A(t)} t H (Norros, 1995).
To estimate Hurst parameter H, there are a number of methods: analysis of R/S (Range/Scale, rescaled adjusted range) statistic, analysis of the variance-time plot, the Whittle estimation and analysis based on wavelet function (Park & Willinger, 2000). We adopt the R/S method summarized as follows.
Given a sample of n observations in the time series (X k , k = 1, 2, ..., n), the R/S statistic is denoted as M R(n) S(n) ∼ cn H as n → ∞ and c is a positive constant. Taking the logarithm of the two parts gives log M R(n) S(n) ∼ H log(n)+log(c) as n → ∞. Thus the H parameter can be estimated by placing the graph of the log{M[R(n)/S(n)]} on log(n) and using the obtained points to select a straight line with slope H based on the least-squares method (Park & Willinger, 2000).

Self-similar traffic model ε-α r,b
In Theorem 1, we prove that a self-similar traffic flow cannot be bounded by any deterministic function.

Theorem 1.
For a self-similar traffic flow, whose FBM envelope process isÂ(t)=āt + kσt H , there does not exist any wide-sense increasing deterministic function α(t) (t > 0) to envelope the flow.
Proof. Using reduction ad absurdum, we assume there exists such where A(t) denotes the cumulative function of the self-similar traffic flow. For any specified time t, the volume of α(t) is deterministic.
Since the self-similar flow is modeled by FBM, with the concept of the FBM envelope process, Asā and σ are all positive and t > 0, there exists some ε * > 0 which makes k > α(t) σt H , i.e., at + kσt H > α(t), at the same time, which conflicts Eq. (5). This means the condition can not be true, i.e., α(t) does not exist.
Note that, in Theorem 1, α(t) covers any deterministic arrival curve, linear and nonlinear. However, in order to use NetCal theory for performance analysis of self-similar traffic, we develop in Theorem 2 an extended arrival curve for self-similar traffic, which is an ε-enhanced linear arrival curve.

Theorem 2.
For a self-similar traffic flow, whose FBM envelope process isÂ(t)=āt + kσt H , there exists a deterministic linear arrival curve ε-α r,b (t)=rt + b(ε), having values exceeded by the traffic flow for any t with the upper excess probability Proof. Since the traffic flow exceeds the arrival curve ε-α r,b with the upper excess probability ε (0 < ε ≤ 1), we have hence Since the Hurst parameter 1/2 < H < 1, Eq. (9) is satisfied for the stable case only r −ā > 0, therefore r >ā.
To proceed further it is sufficient to note that Eq. (9) has to be met for the worst case and therefore, the minimum value of the left side of Eq. (9) in turn must be equal to zero (as of a weak inequality). Let We can see that b(ε) is a function of r (r >ā) and FBM parameters of (ā, σ, H). Certainly, how closely the extended arrival curve constrains the traffic flow is sensitive to the excess probability ε, which is a measure of majorizing precision.

Performance analysis
Using the proposed arrival curve, we derive performance and backlog bounds based on the concepts of arrival and service curves (Le Boudec & Thiran, 2004).

General bounds
When a self-similar traffic flow with arrival curve ε-α r,b is processed by a network element with service curve β, the maximum delay for the flow is bounded by: When a traffic flow is processed by a sequence of network elements, we could simply add the different maximum delays of each individual component together to obtain an end-to-end delay guarantee. However, in this case we can exploit the phenomenon known as Pay Bursts Only Once (Le Boudec & Thiran, 2004), and the end-to-end delay guarantee can be tightened by: The maximum buffer size that is required to buffer the traffic flow is bounded by: And when the traffic flow traverses several consecutive elements, the total required buffer space can even be tightened by: Note that, strictly speaking, the delay and backlog "bounds" should be interpreted as "estimates" for maximum delay and backlog. Since the traffic is not entirely constrained by the arrival curve in our model due to ε, it is possible in theory that the calculated bounds may be exceeded, even though appearing only in extreme cases. However, to follow the terminology used in network calculus based performance analysis, we also use "bounds" for the estimated maximum delay and backlog in the chapter.

Bounds for latency-rate servers
In addition to the general performance bounds, we give equations to compute the bounds assuming the latency-rate server model for network elements (Stiliadis & Varma, 1998).
Consider a self-similar traffic flow with arrival model ε-α r,b (t)=rt + b(ε) traversing a series of network elements, each element i (i = 1, 2, 3, ..., n) guarantees a latency-rate service curve where R i is the service rate and T i delay to serve the flow. Notation T i .I fr ≤ R min , then the delay bound is and the buffer bound is If r > R min , the bounds are infinite.
We can see when ε (0 < ε ≤ 1) is approaching to 1, the backlog and delay bounds are deceasing. In particular, when ε equals 1, the value of b(ε) will be zero and the delay and buffer bounds will equal to n ∑ i=1 T i and r n ∑ i=1 T i , respectively. The reason is that, as ε increases,

Input: A trace file of self-similar traffic
Step 1: Estimate FBM parameters Step 4: Compute delay and backlog bounds

Results: Delay and backlog bounds
Step 2: Derive arrival curve Step 3: Abstract network elements with service curves Analysis for Hurst parameter, Fig. 1. Performance Analysis Flow Using Network Calculus on Self-Similar Traffic. more bursty traffic exceeds the arrival curve. This is similar to the effect of lowering the traffic arrival curve. Thus the computed delay and backlog bounds become smaller.

Performance analysis flow
We illustrate the analysis flow in Figure 1. The input is a trace of self-similar traffic and output is delay and backlog bound results. The procedure contains four steps: • Step 1: Estimate FBM parameters (ā, σ, H) (Section 3.3). This step checks for self-similarity in the trace and performs, for example, the R/S analysis, to derive Hurst parameter H. With this step, we obtain its cumulative process. • Step 2: Find its FBM envelope process, and further derive its ε-enhanced arrival model (Section 4). • Step 3: Model network elements with service curves. • Step 4: Compute delay and backlog bounds for its traversal through a single node or concatenated nodes. If the service models follow the latency-rate model, we can use the closed-form equations in Section 5.2 to compute the bounds.

Experiments and results
We devised experiments to (1) validate the proposed self-similar model; (2) show the correctness and tightness of calculated bounds via comparing them with simulated results.
With the experiments, we also exemplify the performance analysis flow.

The simulation platform
We use a simulation platform in an open source simulation environment SoCLib (SoCLib Simulation Environment, n.d.) to collect application traces and to simulate their delay and backlog in on-chip networks. As shown in Figure 2, the platform contains a MIPS R3000 processor, on-chip memories, a display component (TTY), and other components such as DSP and DMA. These components are interconnected with a 3 × 3 mesh network. The network performs wormhole flow control and uses XY routing. Routers are uniform, taking 5 cycles to deliver head flits and one cycle for other flits. Application code and data are stored in RAM3. The Network Interfaces (NIs) encapsulate transactions into flits and de-encapsulate flits into transactions.
We run four embedded multimedia programs on the MIPS: an MP3 audio decoder, an MPEG2 video decoder, a JPEG and a JPEG2000 decoder, respectively. The MP3 processes a 4KB audio stream, MPEG2 a 176 × 176 video frame, JPEG and JPEG2000 a 256 × 256 image. We set up two measurement points to observe the transactions between MIPS and RAM3 in the platform, as indicated in Figure 2. While application code running on the processor, at Point 1 we record the sequence number and timing of flits generated by MIPS in a trace file, and at Point 2 we observe the end-to-end delay experienced by each flit after traversing four routers, {R1, R2, R3, R4}, and the system backlog.
We have performed analysis and simulation for all the four application traces. For concise presentation, we only detail the analysis and simulation results of the MP3 application in Section 6.2 and Section 6.4, respectively. Section 6.3 discusses the derivation of the extended arrival curves for the MP3 application and the selection of parameters ε and r. Nevertheless, we report both analysis and simulation results on delay and backlog for all the applications in Section 6.5. For all results, the unit for delay is cycle, for backlog is flit. While examining traffic's self-similarity, we choose 100 cycles as the time window.

Analysis for MP3 application
The analysis of the MP3 application follows the four steps described in Section 5.3.
Step 1. The entire trace of MP3 application contains 1,697,249 flits in total and lasts for 46,696 hundreds of cycles as drawn in Figure 3. For such 100-cycle aggregated data series, we use the R/S analysis method to derive its Hurst parameter as illustrated in Figure 4. It turns out that H equals 0.86. This means the MP3 traffic exhibits good self-similarity. The FBM parameters ofā and σ are also derived using the formulas presented in Section 3.3. We get the mean rateā = 36.35 flits/100 cycles, and the variance in time unit of 100 cycles σ = 0.33.
Together with the MP3 cumulative process, the two curves of ε-α r,b (t) andÂ(t) are plotted in Figure 5. As we can see, the derived model ε-α r,b (t) tightly bounds the cumulative process of the self-similar traffic. This validates the correctness of our proposed self-similar arrival model. Step 3. The routers are modeled as latency-rate servers with the same service curve of β(t)= 100(t − 0.05) + , which represents that the routers delay head flits for 5 cycles and forward 100 flits per 100 cycles.
This means that b(ε) decreases as r or/and ε increases. The relation among b, r and ε is shown in the 3D Figure 6. With a small increase of r from 36.6 to 38, b is approaching 0. With an increase of ε, b is also decreasing and approaching to 0, but with a relatively less acceleration.
We also give the delay and backlog estimates as follows: Delay Estimates: Backlog Estimates: From the formulas, we can see that D/B decreases as r or/and ε increases, in a similar way as b(ε). We draw two 3D figures for the delay and backlog estimates in Figure 7. We can see that the three figures are similar in shape.

Selection of ε and r
As can be observed from Figure 6 and 7, the burstiness b, delay and backlog estimates (D and B) are very sensitive to the value of r > 36.35. Staring from r = 36.5, a small increase of r sharply reduces b, D and B. We choose r = 37, since, from this point, the curves do not go down quickly. With this value, we plot a 2D figure to show how the delay and backlog estimates vary with ε in Figure 8. Figure 8 clearly shows that, as ε increases from 1E-6 to 1E-1, the delay and backlog are both decreasing and the decrease is sharp until ε goes beyond 1E-4. From then on, the decrement of ε affects the bounds lightly. For smaller ε, the arrival curve allows less flits excess, and the bounds are certainly calculated larger. "ε = 1E-4 (1 × 10 −4 )" means that the tolerance of exceeding the arrival curve is one out of 10,000 flits. Note that the excess probability ε may come from application constraints. In such cases, ε is pre-determined and we only need to consider the relation between r and b.
With ε = 1E-4, we can look closer on how the selection of rate r influences the delay and backlog estimates, as shown in Figure 9. While varying r from 36.8 to 38, both the delay and backlog estimates decrease and the decrease is sharp until r exceeds 37. From then on, the increase of r affects the bounds lightly. For smaller r, the burstiness b is greater so as to guarantee that the ε-α r,b envelopes the traffic for a certain excess probability, and the bounds are consequently calculated larger. Since r = 37 is the turning point, we have chosen r = 37 for the MP3 application.

Simulation results of MP3 application
We present detailed simulation results for the MP3 application. Figures 11(a) and 11(b) show the delay and backlog histogram, respectively, for the entire trace. We find the maximum delay is 24 cycles and there are no flits experiencing larger delay than the bound of 30 cycles, so the excess ratio equals zero. For the backlog, the observed maximum backlog is 20 flits. There are 6 points in total exceeding the bound of 17.4 flits. The real exceeding ratio equals 6/1697249 = 3.53E-6, which is far smaller than the assumed   excess probability ε = 1E-4. This validates that our arrival curve with a predictive upper excess probability can well bound the self-similar traffic.

Summary of results for all applications
We summarize all calculated bounds and simulated results for the four applications, MP3, MPEG2, JPEG and JPEG2000 in Table 1, where we also list their FBM parameters and extended

Conclusion
Performance analysis techniques must properly characterize traffic flows. In this chapter, we have presented a traffic arrival model for self-similar traffic, which is a very influential category of traffic observed in various networks. This model complies with the linear arrival model, and enhances it with an additional parameter, excess probability ε, to capture the probability of bursty traffic surpassing the linear arrival envelope. We develop such a model because of two reasons. One is that, as we have proved in the chapter, self-similar traffic cannot be bounded by any deterministic function. The other is that we hope to keep the elegance of the traffic abstraction in network calculus. With such an ε-enhanced arrival curve, we have shown how to apply network calculus theory for performance analysis of self-similar traffic flows. Assuming the latency-rate server model, we give closed-form equations for computing delay and backlog bounds for self-similar traffic traversing a tandem of network elements. We have also devised experiments to exemplify the performance analysis flow. Our simulations with real on-chip multimedia application traces have validated our model and results.
We have aimed our performance analysis of self-similar traffic for on-chip networks. However, the arrival-curve-compliant self-similar traffic model and its associated performance analysis method and formulas are equally applicable to off-chip networks, since we do not make any NoC-specific assumptions. Nevertheless, we believe our approach is most beneficial to the design of NoCs since NoC is a closed system focusing on specific application domains whereas traffic can be closely inspected, properly profiled and characterized.