A new 2D mesh routing approach for networks on chip Une nouvelle approche de routage pour les réseaux sur puce à topologie mesh 2D

تنًضًنا تًظنلأا . Abstract Traditionally, embedded systems and digital electronics technology were confined to computer systems. Today, embedded systems and systems on chip are applied in a wide range of areas such as television, communication systems, radar, military systems, medical instrumentation, and consumer electronics use digital techniques. The interconnection between these systems blocks is one of the biggest development challenges. Network on Chip (NoC) is a new interconnection structure that is used for Systems on Chip (SoCs). It is come to replace classic interconnections and to solve its problems. NoC structure provides a high performance, scalable and power efficient communication infrastructure for connecting SoCs modules. In this paper, we propose a new approach for routing in a 2D mesh topology NoC. This approach is based on a combination between a placement strategy, a modified XY routing and a communication load-based clustering technique. We show that our approach provides a better latency and an enhanced resources consumption than in the most notable


INTRODUCTION
According to the ITRS (International Technology Roadmap for Semiconductors) [1], the transistor feature size will be smaller, in submicron scale, that enables the integration of more tansistors on a single chip and lead to smaller new technology. The integrated circuits using this new technology can be clocked faster. Following Moore's Law, the density of integrated transistor in a circuit would double every 18 to 24 months [2], that led to implement a whole electronic system on a single chip (System on Chip, SoC). Typically, in these systems, a several complex and heterogeneous components can be integrated such as programmable processors, memories, input/output interfaces, custom hardware, peripherals, external interface IP (intellectual property) blocks and an onchip communication architecture that serves to interconnect these components. The mentioned elements be used to increase the performance, reduce the cost and improve the energy efficiency of SoCs [3,4]. One of the major problems in SoCs design is the interconnection structure that used to perform the communications between the deferent system modules. Since 1990, the bus-based interconnection structure is used to interconnect the IPs components of a SoC or an MPSoC system, where all the components are related to a single transporting medium that allows only one communication at time that managed by an arbiter. When the number of participating cores is more than ten, then the bus system will have a performance bottleneck problem due to its bandwidth limitation [5] and does not meet the needs of newer technology where it begins to block traffic. In the next 5 or 10 years, the bus structure starts to be narrow and it will be less used. In order to solve the performance bottleneck problem of the bus structure, a fully crossbar interconnect approach proposed to be used. However, this approach will implicate a wiring complexity in the circuit, where it grows depending on the number of integrated IPs, so wires could be more dominant than the logic parts, where it increases with the square of the number of communicating elements Another problem in the fully crossbar interconnect is the effect of electromagnetic interference that can disturb the interconnect functionality. A point-topoint interconnect is also another alternative solution to the performance bottleneck problem and also to handle with the wiring complexity problem. However, this structure, has limitations in terms of flexibility and scalability. The bandwidth limitation in the shared bus system can be solved by using a hierarchical bus system, in which a bus system is interconnected to other buses using a bridge component. In this kind of structures, the communications through the bridge becomes a bottleneck, which implies an increase of the latency when two bus want to communicate between them. To overcome such problems, the researchers propose a new interconnection architecture, able to accommodate such a high number of integrated modules, it is the Network on Chip [6,7]. Nowadays, Networks on Chip are considered as a scalable solution for on chip communication. In the recent years, Network on Chip has emerged as a growing and important research field. Topology, routing algorithm and placement of modules are the most important keys for the performance of NoCs. Before determining the routing algorithm, we should define the network topology. The placement is the intermediate stage between the topology choice and the routing algorithm definition. Our proposed approach is based on a combination between a clustering technique and a modified XY routing algorithm in a 2D mesh topology. We demonstrate that such proposed approach enhances the performance of the XY routing algorithm and provide better latency, logic resources consumption and higher frequency. To evaluate its performance, we compare it with the static XY routing in a classic 4x4 2D mesh topology. The next section describes the Network on Chip technology (NoC) and its characteristics. In section 3, we develop some clustering techniques that are suitable to 2D Mesh architecture. In section 4, we present our approach. In section 5, we present the experimental results. And finally, we conclude this paper and present the future works.

NETWORK ON CHIP ARCHITECTURE
In NoC-based systems, the traditional shared-bus structure replaced by a packet switched communication network. The NoC is emerging as an efficient solution to solve the aggravating scalability and contention issues of on-chip communication and play an important role for the performance of current VLSI system. A NoC-based system is typically consisted of the following elements: intellectual properties (IPs) cores, network interfaces (NI), routers (also called switchs or nodes) and an interconnection structure (Fig.1). The IP cores could be any component such as a microprocessor, application-specific integrated circuit (ASIC), memory, or a combination of components connected together. These IPs are connected to the network routers via a network interface, where its function is to assemble the data into a packet before.
sending it from one core to another one through the network and disassemble it back before being sent to the destination core. The input data that injected in the NI by IP core is separated into small packets An aditional information about the destination node and the followed path are added to the header of each packet, which is forwarded hop by hop on the network via the decision made by each router. The NI role is also to adapt between the IPs blocks and the NoC protocols. A high-performance global links which are, in most NoCs, bidirectional; allows connecting the nodes and transferring data between them. The NoC consists of several routers used to route a packet sent by one IP component to another according to a specified routing algorithm. Each router includes a set of communications ports that allow connection to the links connecting the nodes between them. Figure 1 shows an interconnection architecture of a 2D mesh NoC topology, consisting of several IPs connected together via routers and regular sized links. Recently, network-on-chip (NoC) research has focused on the various aspects of on-chip networks, including topologies [8,9], routing strategies [10], flow control techniques [11], router architecture [12,13], and errors detection and correction.

A. topology
The NoC topology determines the physical layout and connections between routers and links in the network, affecting the bandwidth and latency of a network. Each topology can be characterized by a few metrics that some of them are mentioned bellow. The first metric is the router degree that refers to the number of links at each node. It defines the regularity or irregularity of the topology. The second metric is the number of links traversed by a message from the source to the destination or called diameter. The third one is the maximum channel load which means the maximum number of bits that can be injected by every node into the network per second (bps) before it saturates [14]. We mention also the path diversity. As shown in figure 2, there exist many network topologies [15], such as mesh, torus, star, ring, butterfly, mixed and custom topology, most of them are proposed for minimizing the number of nodes and node degrees. NoC survey shows that over 60% of NoCs employ mesh or torus topology, due to their grid-type shapes and regular structure which are the most appropriate for the two dimensional layout on a chip as Field Programmable Gate Array (FPGA). These regular topologies provide better scalability than traditonal interconnection structures. A good topology exploits the characteristics of the available packaging technology to meet the bandwidth and latency requirements of the application at minimum cost. Due to its simplicity and its prevalence in several implementations we have chosen the mesh topology for our design, which consists of horizontal and vertical lines with routers placed at their intersections. Each router has five input/output ports. Local port to access the IP core connected to this router and four other ports (North, South, East and West) to connect the router to its neighbors. The router design plays an important role in the performance of NoC systems that is used to route the packet to the correct destination according to a specified routing algorithm. The routers structure in a NoC depends on the network topology. The most routers are composed of: buffers, crossbar, routing and arbiter units. In the mesh topology, except the routers on the network boundaries, each router has five active ports: one is connected to the local IP; while the others are connected to the neighbouring routers (North, South, East and West). The number of IP cores, in the mesh topology, is the same as the routers number. A router, mainly, consists of four principle elements (Fig.3). The first one is the buffers (FIFO) that located at each router input and it is used to store, temporarily, the transmitted information (packets). The second element is the crossbar which is the heart of the router, and its role is to connect the inputs the outputs channel using the address that situated in the packet header. The next element is the routing unit that ensure the switching function [16]. This element is the responsible for decoding the information provided by the incoming message. The last element is the arbiter unit which used to determine which packets get the priority to take the path when multiple packets from different sources are vying for transfer on the same interconnection link. The main arbitration policies that can be used for the NoC architecture are: Time Division Multiple Access (TDMA), Round-Robin and fixed priority. The next step, after the determination of the network topology, is the choice of the appropriate routing algorithm, which is the responsible to determine the path of a packet from the source to the destination.

B. routing
The choice of the routing algorithm depends on several metrics such as minimizing power required for routing, minimizing logic and routing tables to achieve a lower area, increasing performance by reducing delay and maximizing traffic utilization of the network. There are several possible routing algorithms that can be used in a NoC. The purpose of routing algorithms is to ensure that all the data packets will correctly reach its destination no matter which routing algorithm is selected. It can be classified into several categories such as static or dynamic routing, distributed or source routing, and minimal or non-minimal routing.

CLUSTERING
The clustering algorithms, which are based on dividing the network into a set of nodes under certain conditions, are used in many network problems. In particular, it allows the efficiency of routing protocol by reducing the control traffic in the network and simplifying the switching process data. Clustering algorithms are designed to satisfy certain goals depending on the context in which the clustering is deployed (routing, security, energy conservation), and several studies [17,18,19] are interested in establishing a classification of existing approaches to clustering. These algorithms have different characteristics such as nomber of nodes at each cluster, the average number of clusters formed in the network, the distances between nodes and clusterheads. The clustering technique consists on the network nodes organization into virtual groups called "clusters", where the nodes are grouped into the same cluster according to certain rules (distance, application, synchronization, communication load ...). In a cluster we find generally three types of nodes head nodes, gateway nodes and ordinary nodes (Fig.4). In our case we only use two types (head and ordinary). The clustering technique represents a promising solution for network based on programmable FPGA circuits. From the literature, we distinguish several mechanisms for clustering [17,18,19,20] where it can be based on several criteria. The first classification is based on the cluster-head selection. Another one may take into account the number of hops that separate an ordinary node from a clusterhead node to which it is attached. In the following, we present the criteria for the classification of existing clustering approaches.

Radius of clusters:
It is the value of the maximum distance between the cluster members to the clusterhead. There are three classes of algorithms (1hop, k hops and variable). Metric for selecting cluster-heads: the choice of the cluster-head is done according several criteria. We can randomly choose (without metric), according to its position according to other nodes, the communication load. Cluster characteristics: They represent the average number of nodes managed by a cluster head, the average number of clusters formed in the network and the sum of the distances between nodes and clusterheads.

OUR APPROACH
The main objective of our approach is to minimise the average latency, the energy consumption and especially the logic resources consumption. In order to enhance these parameters, we propose a modified XY routing algorithm and a placement technique for 2D mesh-based NoC, which are combined with an appropriate clustering technique. We choose a 4x4 2D mesh network as an instance of the NoC architecture (a popular size in many applications [21,22]). As shown in figure 5, the whole network is devided into four 2x2 2D mesh subnetworks using a clustering technique, where the most communicating nodes are grouped in the same cluster. The calculation of the communication load density between each pair of nodes is done according to the chosen application. Each router is identified by its coordinates (c, x, y), where c is the cluster number, x is the horizantal coordinate and y is the vertical coordinate. Inside each cluster there is two types of routers: an ordinary router and a cluster head router. Each ordinary router has three input/output ports (one to local IP and two to neighbor router), a set of iput FIFO, a crossbar, and a routing and arbiter unit. The cluster-head router has the same architecture as an ordinary one, but it contains five input/output ports and add a routing table into the routing unit.  The flow of our approach is performed as follows (Fig.6): Step 1: In the first phase, calculate the communication load density between each pair of nodes for a chosen application to select the most communicates nodes in the system.

Step 2 and 3:
Having the communication load, we place the most communicating cores in the same cluster, where the number of nodes in each cluster does not exceed a certain threshold. In our case, each cluster has 4 routers; three ordinary routers and one cluster-head router. The clusters management done in way to get a minimum inter-cluster communications, where cluster 0 communicate with cluster 1 and cluster 2 more than the communication with cluster 3, and that is why we place clusters 1 and 2 near to cluster 0 more than cluster 3, and the same between the other clusters. Step 4: the selection of a cluster-head for each cluster is done in this step. The headers contain supplementary information about routing in the whole network by adding a routing table, while the other members have only an XY routing unit. The role of the cluster-head is to manage the communication between the members of different clusters. Its selection is done according to some criteria. However, in our case, we chose the node which has the smallest density of communication load in the cluster as a header. In doing so, we reduce the number of traffic passes through order and minimize the chances that it breaks down.
Step 5: the last step is the routing algorithm selection. Our routing algorithm consists of two levels; intra and inter cluster routing. In the inter-cluster level, the chosen routing algorithm is a modified XY routing, where each router is identified by its coordinates (c, x, y). This algorithm begins by comparing the current router address (C c , C x , C y ) to the destination router address (Dc, D x , D y ) of the packet that is stored in the header flit. First of all, it compare the current router cluster Cc to the destination router cluster Dc, if they are equal the choosen algorithm is th XY routing. Packet will be routed to the west port when C x > D x , to east when C x < D x and if C x = D x the packet is horizontally routed. When the last condition is true, the D y address is compared to the C y . When C y > D y the packet is routed to the North but if C y < D y it will be routed to the South. When C c ≠ D c the packet is routed to cluster header. The cluster-header has the informations about other clusters-headers and all these informations are integrated in a routing table. Thus, our routing algorithm is as folows: This algorithm ensures the deadlock free routing. We use a table-based routing algorithm in the intracluster level with adding some information about the clusters-headers (Tab. 1).

Tab. 1 : Source routing table for node 010
Comments on the routing algorithm (Fig.7): 1. If the source node and the destination node are in the same cluster, send the packet directly to the destination node according to the XY routing and without need to pass by the cluster-head. 2. Else, if the source node and the destination node are in deferent clusters,  The source node sends the packet to its cluster-head (CsH). it's the destination node or not.  If it is the destination node, it sends the packet to its IP. Else, it sends the packet to the destination node. Such approach provides ameliorations in terms of:  Minimising the power consumption by reducing the overall routing area.  Offering a better latency because the most communicated nodes are in the same cluster.  Offering more efficient use of resources by reducing logic ressources.

EXPERIMENTAL RESULTS
We have conducted Experiments in order to evaluate the performance of our approach and give a comparison with the static XY routing algorithm in a classic 4x4 2D mesh topology, which is widely used in NoC. We have implemented our approach in VHDL at the structural Register Transfer Level (RTL) and synthesized it using Xilinx ISE Design Suite 14.2 tool. The network was prototyped in VIRTEX 5 XC5VLX50-3FF324. Figure 8 shows, when the used packet size is 8 bits, that comparing to the cluster-head router, the number of slice LUTs of an ordinary router is reduced by 15.44%, and the number of slice registers is reduced by 18%. Moreover, when the used packet size is 16 bits, the number of slice LUTs of an ordinary router is reduced by 16.8% comparing to the cluster-head router, and the number of slice registers is reduced by 20.7%. Resource utilization experiments are done using 8 bits data (Fig.9). The extra resources are about 49 registers and 406 LUTs between classic 4x4 mesh and proposed 4x4 clauster-based mesh (the same router using respectively the static XY and the proposed modified XY routing algorithm). The classic 4x4 NoC uses 10.8% more registers and 12.1% more slice LUTs compared to the cluster-based 4x4 NoC.   Figure 10 shows that the maximum frequency in the classic 4x4 mesh NoC is slightly higher than that of the proposed approach by 2.3%. Figure. 10 : Comparison on the maximum frequency.

CONCLUSION
In this paper we presented an approach to manage communication in a 2D mesh NoC through the combination between routing, placement and clustering techniques. The aim of the approach is to reduce the average latency and logic resources costs, with respect to the constraints of the bandwidth and network communication power consumption. We have shown that our approach reduces the detection and reporting failures time, in links and/or routers, while optimizing the resources use. Experimental results show that, compared to the classic 2D mesh topology using a static XY routing, significant performance improvements can be achieved when using the proposed approach with acceptable additional cost in the cluster-head router. As a future work, we plan to apply our approach on various multimedia benchmark applications.