Scalable THz Network-On-Chip Architecture for Multichip Systems

. While THz wireless network-on-chip (WiNoC) introduces considerably high bandwidth, due to the high path loss, it cannot be used for communication between far apart nodes, especially in a multichip architecture. In this paper, we introduce a cellular and scalable architecture to reuse the frequencies of the chips. Moreover, we use a novel structure called parallel-plate waveguide (PPW) that is suitable for interchip communication. The low-loss property of this waveguide lets us increase the number of chips. Each chip has a wireless node as a gateway for communicating with other chips. To shorten the length of intra-and interchip THz links, the optimum conﬁguration is determined by leveraging the multiobjective simulating annealing (SA) algorithm. Finally, we compare the performance of the proposed THz multichip NoC with a conventional millimeter-wave one. Our simulation results indicate that when the system scales up from four to sixteen chips, the throughput of our design is decreased about 5 . 8%, while for millimeter-wave NoC, this reduction is about 21%. Furthermore, the average latency growth of our system is only 1% compared with about 40% increase for the millimeter-wave NoC.


Introduction
e recently emerging intrachip communication infrastructure for a multicore system-on-chips (SoCs) is the network-on-chip (NoC) where a router-based and packetized communication infrastructure is used [1]. e number of integrated processing elements in a chip has been rapidly increased in recent years [2]. Designing such a large multicore chip usually results in a lower yield. Aggregating multiple moderately smaller dies within a package can provide the functionality of a large chip and at the same time can provide significant advantages in terms of higher yield and better packing of rectangular die on a round wafer [3]. Obviously, a suitable interchip communication infrastructure considering bandwidth and latency requirements is needed.
As predicted by ITRS, the pitch of the conventional I/O interconnects such as peripheral component interconnect (PCI) is not scaling as fast as the gate lengths or pitch of onchip interconnects [4]. is leads to forming a gap in the density and performance of conventional I/O systems relative to on-chip interconnections. In addition, the off-chip interconnections pose new design challenges such as crosstalk, and signal integrity issues as well as switching between the off-chip and on-chip communication protocols [4]. Although the integrated inter-and intrachip photonic is a promising solution to the off-chip interconnection challenges of conventional I/O, the pitch of photonic interconnects is not scalable well [5]. Another solution is vertically integrated monolithic 3D ICs [6] that require sophisticated thermal management techniques and suffer from the issue of low yields due to the vertical misalignment of the layers. multicore chips. In other words, in a WiNoC, multihop wired paths between far apart cores are replaced by highbandwidth long-range single-hop wireless links. Consequently, the reduction of average hop count leads to better performance of the network and reduction of the average packet latency and power consumption [7]. While several studies have been conducted on wireless interconnect paradigms for intrachip communications [7,8,10], interchip wireless communication is in the early stages. Especially, the demand for a scalable multichip system continues to be felt. So, we take advantage of applying cellular idea at THz frequency to propose such a scalable multichip system.
In this paper, we propose a wireless communication backbone at THz frequencies which enables data exchange between cores in a single chip as well as different chips in a large enough multichip system. We investigate the performance of the proposed multichip WiNoC from the viewpoint of scalability by scaling up the system size from four to sixteen chips. While the largest network reported in the previous papers, to the best of our knowledge, connects up to nine chips [4], our sixteen-chip network shows similar behavior to eight and four chips systems. While THz communications can provide high data rate, the considerable attenuation at such high frequencies imposes limits on data link length. erefore, the cores inside the chips equipped with wireless transceivers should be near to each other as much as possible. So, to find the optimum configuration for every chip, it is necessary to consider a tradeoff between the reduction of total hop number and shortening the wireless links. We reach this goal by adopting a multiobjective simulating annealing (SA) metaheuristic algorithm. Since the distance between these gateways may be fairly large, we also propose a new low-loss waveguide for data exchanging between different chips. e interference of different wireless links using a shared medium is another aspect that must be addressed. Different types of multiple access technique such as frequency division multiple access (FDMA), code division multiple access (CDMA), wavelength division multiple access (WDMA), and orthogonal frequency division multiple access (OFDMA) have been reported for a different kind of NoCs in the previous literature [7,[12][13][14][15][16][17][18]. FDMA and WDMA techniques have some drawbacks such as increasing the bandwidth demand in parallel with increasing core count and requiring many oscillators operating at different frequencies [15]. On the other hand, the CDMA technique in which multiple transmitters can send data simultaneously through orthogonal codes has some difficulties. e major drawback of this technique is the synchronization problem between multiple transmitters that may result in low-reliability [19]. Among the aforementioned schemes, OFDMA promises some advantages that the researchers have encouraged to use this modulation as the multiple access technique for NoCs [15][16][17][18][19][20][21][22]. So, we use an OFDMA-based wireless communication for intrachip wireless links as well as interchip ones. It is worth noting that to have a scalable design, we use the cellular idea in which the frequency bands will be reused by the chips, repeatedly.
We can summarize our contributions as follows: (i) We introduce a broadcast-enabled scalable wireless NoC for multichip systems (ii) We propose a proper low-loss waveguide for interchip communication at THz frequency (iii) We introduce a novel concept, and we call it here cellular architecture, in which the frequency can be reused by different chips (iv) We adopt multiobjective SA to determine the optimal number of wireless nodes as well as the best position of them e rest of this paper is organized as follows. e related work is discussed in Section 2. Section 3 describes the intraand interchip wireless interconnects including their architectures and medium access control mechanisms. is section also discusses the details of the multiobjective optimization process for finding the optimum configuration from the viewpoint of wireless node count and position. e simulation results of the proposed network are investigated in Section 4, followed by the conclusion in Section 5.

Related Work
Different interconnect techniques such as photonic interconnect [5], wireless interconnects [4], capacitive-or inductive-based interconnects [23], and vertically integrated 3D integration [6] are used to connect chips within a multiple chip system. Among these interconnects, the wireless technique promises some advantages that the researchers have encouraged using this type for intrachip as well as interchip connections [4]. Shamim et al. [4] presented the design of a seamless hybrid wired and wireless interconnection network for multichip systems with on-chip wireless transceivers. e authors used a small-world wireline configuration for intrachip interconnect and a tokenbased millimeter-wave wireless links architecture for interchip communication. e maximum number of chips in this reference is four. Obviously, such a structure cannot be scalable because when the number of chips is increased, the wireless nodes should spend more time waiting for the token, and therefore the network performance is considerably degraded.
Ganguly et al. [24] discussed the advances and challenges of interconnecting a multichip system with millimeter-wave wireless interconnects including interconnection topology, physical layer, medium access control (MAC), and routing protocols.
e design of an energy-efficient, seamless wireless interconnection network at mm-wave for multichip systems with in-package memory stacks has been proposed in [25]. e authors considered a multicore system with four 16-core processing chips connected to 4 in-package DRAM memory stacks. In this reference, instead of using a token-based MAC protocol, each wireless interface broadcasts a control packet at the beginning of its transmission.
is control packet contains information about the number of flits, the destination wireless interface, and the packet ID. Obviously, using this control packet leads to the reduction of available bandwidth as well as increasing the packet latency. Ahmed et al. [26] proposed large-scale utilization of the available abundant interposer resources for multichip integration by implementing a hypercube interconnection architecture in an interposer for chip-to-chip communication. Again, the authors considered a 64-core system with four smaller chips and four in-package DRAM memory stacks. Due to the complexity of such a 2.5D structure, it cannot be suitable for massive many-core architectures.
Saxena et al. [27,28] used graphene-based wireless links to enable phase-based communication protocol, instead of the token-based MAC protocol, to create a THz wireless interconnection fabric for multichip systems. is reference considered a multichip system with three different sizes, namely, single chip, 4-chips, and 9-chips, in which every chip contains 64 cores. Operating at THz frequency band is the main difference of this work with previously mentioned works in this section. Our approach is different from these works in two aspects. First, using the cellular idea, we provided a scalable architecture in which the number of chips can be easily increased to sixteen and more. Second, instead of propagation in the free space, we used a proper low-loss waveguide for interchip THz communication as well as a proper placement algorithm for wireless routers that considerably reduce the power consumption.
Different medium access mechanisms in WiNoCs from simple token passing based protocol [10] to more complicated CDMA or OFDMA mechanisms have been proposed. Zhao and Wang [29] suggested a collision-free QoS-aware MAC protocol for wireless NoC. Actually, this MAC protocol permits only one node to send the packets at each time, and therefore, there is no simultaneous communication. Ganguly et al. [7] presented an FDM-TDM combination technique for sharing the common channel between wireless nodes. However, this MAC technique does not lead to high enough bandwidth utilization. CDMA is another technique that has been widely used in this context [30,31]. Nevertheless, such a technique may suffer from the synchronization problem between multiple transmitters.
To alleviate the drawbacks of the abovementioned MAC techniques, the researchers have been encouraged to use the OFDMA method in WiNoCs [32]. Unlu and Moy [15] proposed an OFDMA-based wired on-chip RF interconnect as an efficient MAC method. In this work, a bandwidth arbitration mechanism that allocates more bandwidth to cache-line carrying long packets has been presented. It should be noted that the OFDMA-based MAC technique was applied for simultaneous communication between the cores of a single chip. Compared with this reference that is restricted to a single-chip architecture, we use an OFDMAbased scheme for simultaneous communication in a multichip system.

Intra-and Interchip Wireless Interconnects
Since the propagation loss of THz waves is quite high, it is necessary to reduce the length of wireless THz links as much as possible. So, here we describe a new proper design for interchip communication as well as an optimization method to determine both the optimum number of the wireless nodes and their proper positions for intrachip interconnect. In this section, first, we describe the topology of the THz multichip WiNoc and then concentrate on designing the intra-and interchip interconnects, separately.
3.1. Topology. We consider a three-level hierarchical 16384 (16K) core THz WiNoC in which every 16 adjacent cores form a subnet, as can be seen in Figure 1. e cores of each subnet are connected in a Star-Ring topology to a hub to form the first level of the hierarchical network. Moreover, there are 64 hubs on each chip that are connected in a mesh architecture to form the second level. e details of this level are presented in Section 3.2. erefore, the proposed system consists of sixteen one-kilo-core chips as the third level and each chip has one node as the gateway for communicating with other chips as will be described in Section 3.3. It is worth noting that a feasibility study of implementing the chips with many cores has been addressed in [33].
3.1.1. Cellular Architecture. One of the most challenging problems for an extension of many-core systems is the bandwidth limitation. In other words, when the number of wireless nodes is increased, the assigned bandwidth to each node is decreased and therefore the throughput of the network will be decreased. In this paper, we introduce a novel concept of cellular architecture in which the frequency can be reused. In other words, the network is divided into four four-chip clusters which can use the whole bandwidth repeatedly. To avoid interference, the chips in a cluster should use different frequencies as well as the neighboring chips of two adjacent clusters. erefore, if the entire usable bandwidth can be divided into four parts F 1 , F 2 , F 3 , and F 4 , the scalable proposed architecture has an arrangement such as Figure 2. As can be seen in this figure, there is at least onechip distance between all of the chips operating at similar frequencies. Consequently, due to the extremely high loss of THz waves, this distance will be enough to avoid any interference between similar operating frequency chips. On the other hand, each chip has a gateway to communicate with other chips. is THz communication is performed by means of a low-loss structure called parallel-plate waveguide (PPW). is waveguide is one of the most proper structures for transmitting THz waves because of its low-loss nature [34]. In addition, the propagated electromagnetic waves through the PPW are completely isolated from the intrachip waves. So, the whole of bandwidth can be reused for communication between different chips without any interference. is interchip interconnect is described in Section 3.3 in detail.

Intrachip Interconnect.
As previously mentioned, each chip contains 1024 cores that have been divided into 64 subnets. Each subnet has been formed by connecting 16 cores to a hub, and the hub itself can be provided by a wireless interface. Using wireless links can significantly decrease the latency and power dissipation of the network. is improvement is achieved by replacing the multihop paths between far apart cores with a high-bandwidth singlehop wireless link. Nevertheless, the interference problem of different wireless links that use a shared medium should be addressed. To prevent interference, multiple access techniques can be used for such simultaneous multiple communication. One of the major design challenges with the wireless interconnects is the design of an efficient medium access control mechanism, as discussed below. Another aspect that should be addressed is the attenuation problem, especially when the operating frequency is placed at the THz band. Apparently, whatever the distance between wireless hubs is smaller, the attenuation and consequently the power consumption will be reduced. erefore, for the design of such THz WiNoC, it is necessary to consider a trade-off between the reduction of total hop number and shortening the wireless links. In other words, due to space, power, and cost limitations as well as the interference problem, it is crucial to determine both the optimum number of the equipped wireless hubs and their proper positions. To reach the optimum solution, we adopted the multiobjective SA metaheuristic algorithm, as described below. e data packets are divided into smaller parts called flits. e header flit carries the routing and control information when travels through the network. Moreover, we adopt the wormhole routing for data transmission. Packets have two strategies for arriving at the destination. In the first strategy, the packets use the nearest wireless nodes to the sender and receiver for arriving at the destination. In other words, the packet is forwarded to the nearest wireless hub to be broadcast to other wireless nodes. Afterward, according to the destination address, one of the wireless nodes receives the packet and forwards it to the packet destination. In the second strategy, the packets route only through wired links. Clearly, the path with a minimum number of hops would be chosen by the traveling packets as the shortest path.

Antenna and Transceiver.
e on-chip antennas are an important component of the THz WiNoC and significant efficiency with low-area overhead is desired. Although each wireless node should broadcast the packets to others, because of the high path loss at THz band as well as week coupling of the planar antennas, a single omnidirectional antenna cannot provide efficient data communication. So, we adopt the proposed graphene antenna array in [28,35] with the capability of beam-steering at the THz band. is architecture leverages the unique plasmonic properties of the devices to considerably simplify the array design. Each radiating element of the array consists of the source, the modulator, and the antenna that they share the same active graphene layer. e resonance of the nanoantenna is occurred with surface plasmon polariton (SPP) wavelength and thus is much smaller than a regular patch antenna for the same frequency. On the other hand, these small nanoantennas can be packed into a very dense array because the mutual coupling of the radiating elements depends on SPP wavelength and not the free-space wavelength [35]. us, with a very small area footprint, the power output of the plasmonic array as well as its beamforming capabilities can be comparable with a similarly designed conventional array. e details of the source, the modulator, and the single antenna design have been presented in [35]. So, in this paper, we concentrate on the antenna array design for one of the wireless nodes in the corner of the chip, specifically the gateway of the configuration Type III shown in Figure 3(c). It is worth noting that the plasmonic front end has single antenna beamforming capability. e radiation pattern depends on both the modulator and the nanoantenna, and the modulator response varies as per the Fermi energy of graphene. So, this architecture has the ability of reconfiguring the radiation pattern using a single control line for all the modulator blocks, without altering the feed points or increasing the design complexity. Nevertheless, in this work, we assume that all of the individual antennas have a similar pattern with a half-power beamwidth (HPBW) of about 30 degrees for the main lobe [35]. We use a 3 × 6 array of such antenna to increase the power output as well as concentrate the beam to the receivers. e antenna elements are spaced one-quarter of wavelength apart. e total radiation pattern of an array antenna is the multiplication of the pattern of an individual element and the pattern of the array assuming point sources, called the array factor. Figure 4 shows the 2-D and 3-D illustrations of the calculated array factor for the gateway of the configuration Type III. To reduce sidelobes, we taper the currents of the individual antennas according to the Chebyshev tapering technique as shown in Figure 5. e 3-dB beamwidth of the array factor is about 75 degrees and the final beamwidth of the array is about 29 degrees, which is sufficient for broadcasting packets to the other wireless nodes on the chip. It is worth noting that the antenna array for the other wireless nodes can be designed similarly, according to their desired patterns.

Medium Access Control Mechanism.
As previously mentioned, all of the wireless nodes use a shared channel (medium) to transmit their packets. So, multiple access techniques (MACs) should be used to prevent interference for such a simultaneous multiple communication. Between all kinds of multiple access techniques, OFDMA shows significant properties such as high levels of spectral efficiency, broadcast capability, robustness against fading/interference, and capability of capacity allocation [36].
OFDM is based on the well-known technique of frequency division multiplexing (FDM). In spite of the FDM technique in which each channel is separated from the others by a frequency guard band to reduce interference between adjacent channels, OFDM makes use of a large number of closely spaced orthogonal subcarriers [36]. Consequently, OFDM shows better spectral efficiency in comparison to the FDM scheme. In addition, the FDM receiver should have separate oscillators for each channel to receive all signals, but thanks to FFT (fast Fourier transform) maturation, the OFDM receiver can receive all node signals without any additional oscillator. As a result, the OFDM scheme is suitable for broadcasting the data. In this work, for communication at 1 THz, the entire bandwidth of 100 GHz (maximum 10 percent of central frequency) is considered. erefore, using a thumbnail account and without considering cyclic prefix for OFDM blocks, if a modulation scheme with 8 levels (such as 8QAM) is employed, we can reach the maximum data rate as high as 300 Gbits/s. On the other hand, this bandwidth should be divided between 4 chips of each cluster, as mentioned in Section 3.1. So, the entire wireless nodes belong to a chip can use bandwidth as high as 75 Gbits/s.

Optimizing the Number of Wireless Interfaces and Placement.
e determination of the optimal number of wireless nodes as well as the best position of them has a significant effect on the THz WiNoC performance. As abovementioned, it is necessary to consider a trade-off between the reduction of total hop number and shortening the wireless links. In other words, we should solve a twoobjective problem and determine both the optimum number of equipped wireless hubs and their proper positions. To reach the optimum solution, we adopted a multiobjective SA metaheuristic algorithm [7].
is algorithm models the physical process of heating a material and then slowly reducing the temperature to decrease defects, and therefore minimizing the system energy.
We should minimize the following two objectives simultaneously: (1) Minimizing the total number of hops between all source and destination hubs (2) Minimizing the distance between the farthest wireless hubs It should be noted that the first objective leads to minimizing the intrachip network latency and the second objective decreases the maximum power consumption of THz wireless nodes. To minimize the power consumption in the interchip level, the gateways should be placed with the minimum distance from each other. Accordingly, there are three configuration types: (i) Type 1: the gateway has been placed at the center of the chip (Figure 3(a)) (ii) Type 2: the gateway has been placed on the sidewall of the chip (Figure 3(b)) (iii) Type 3: the gateway has been placed at the corner of the chip (Figure 3(c)) erefore, the optimization procedure should be repeated for each type of configuration. Nevertheless, the number of wireless nodes for all types will be considered the same.
First, we assign the probability of P i for the ith node to have a wireless interface. P i can be calculated as follows:   Journal of Computer Networks and Communications where H i is the sum of the distance (which is accounted in the number of hops) between ith node and the others, L i is proportional to the link loss between ith node and the gateway, and w is the weighting coefficient. In addition, d ij is the number of hops between ith and jth nodes, N is the total hub numbers, and d ig is the number of hops between ith node and the gateway. Moreover, k is the atmospheric absorption coefficient corresponding to free space propagation of THz waves, Δ d is the distance between to adjacent nodes in meters, and finally, d min indicates the minimum distance from the gateway in a chip. As can be seen, P i is directly related to the distance between ith node and the others. Furthermore, it has an inverse relationship with the path loss exists between the node and gateway. It should be noted that for free space propagation of the THz signals with omnidirectional antennas, the received power (P r ) and the transmitted power (P t ) can be related as follows: where c is the light velocity, f is the frequency, and d is the distance between transmitter and receiver. As can be seen, only the last part in equation (4) is distance-dependent. So, we have considered this term as the loss metric in optimization as well as in equation (3). By considering n w as the predefined number of wireless hubs, the shortest path between each pair of nodes will be determined. e SA algorithm will refresh the position of wireless nodes to minimize simultaneously the normalized number of total hops as well as the normalized distance between the two farthest wireless hubs. In other words, the SA algorithm should optimize the following function: where H t is the normalized total number of hops, d w ij is the number of hops between ith and jth nodes when n w wireless nodes exist, and d 0 ij is the number of hops between ith and jth nodes when there are not any wireless nodes. Moreover, L max is proportional to the link loss between the two farthest wireless hubs, d m is the distance (which is accounted in Δ d) between them and finally, and d max indicates the distance in Δ d between farthest hubs on the chip. Obviously, for a 8 × 8 chip, d max � 9.89.
At the beginning of the SA algorithm, according to the calculated probabilities in equation (1), the w nodes are randomly selected and accordingly the optimization metric (equation (5)) is calculated. At the next iteration, a wireless node is randomly replaced by a new one. If the metric of the new configuration is lower than the metric of the current network, the new network is accepted as the new solution. Nevertheless, if the new metric is higher, it is possible that the worse solution is accepted by the SA algorithm [7]. e threshold for the worse configuration selection is shown in equation (8). By accepting the networks that raise the metric, the algorithm avoids being trapped in local minimum points, and it is able to explore globally for more possible solutions: where H t and L max are current metrics, H t ′ and L max ′ are the new metrics, and T is the temperature. An annealing schedule is selected to systematically decrease the temperature as the algorithm proceeds. As the temperature decreases, the algorithm reduces the extent of its search to converge to a minimum. At each iteration, we consider the temperature decreases with a factor of 0.9. e optimization results for the different number of wireless hubs as well as for three different configuration types are shown in Figure 6. It should be noted that these figures have been calculated by w � 0.6. In addition, a standard medium with 1% of water vapor molecules is considered. To determine the optimum number of wireless hubs, it should be considered the values of H t and L max simultaneously, as shown in Figures 6(a) and 6(b). Accordingly, by increasing the n w , the maximum loss factor (for Type 3) considerably increased from 0.086 for n w � 3 to 0.383 for n w � 10. On the other hand, for a small number of wireless hubs, the total number of hops is apparently higher that results in large latency. erefore, to consider a trade-off between power consumption, latency, and area overhead, we consider n w � 6 as the optimum number of wireless hubs. Figure 7 illustrates the H t and L max versus different values of the weighting coefficient for n w � 6. According to these figures, the optimum normalized number of hops is varied from 0.66 for w � 1 (minimizing only latency) to around 0.95 for w � 0 (minimizing only loss). However, by considering w � 0.6, the H t fairly decreases and reaches near the minimum value. On the other hand, for w � 1, the loss factor (L max ) increases to a value between 0.49 and 0.59 that is clearly high in comparison with the values around 0.178 to 0.235 for w � 0.6. So, it seems the proposed optimization algorithm has properly found the optimum configurations. Figure 3 shows the optimum configurations for predefined three different types. It is worth noting that according to the chip position, the gateway is placed on the proper sidewall for type 2 or corner for type 3. Obviously, the wireless hubs are rearranged according to the selected corner or sidewall.

Area Overheads.
e area overhead of the proposed architecture is about 0.4 mm 2 per 3 × 6 graphene antenna array. e area overheads of the transceiver circuits are negligible in comparison with the antenna array [28]. Consequently, in our architecture, there are 6 such antenna Journal of Computer Networks and Communications arrays in a single chip with 64 cores, which amounts to 2.4 mm 2 area per chip. In other words, 0.6% of the area of a typical chip of size 400 mm 2 is required to enable the antenna arrays.

Interchip Interconnect.
Wireless links at THz frequencies, when communicating between far apart chips, indicate high losses that lead to limit the number of chips in a multichip system. To overcome this drawback, we use a proper medium instead of propagating in free space to transmit THz signals.

Parallel-Plate Waveguide.
e PPW is one of the most proper structures for transmitting THz waves because of its low-loss nature [37]. In addition, the PPW wave spreads out like an expanding cylinder instead of spherically expanding of the free space wave. As a result, the wave power of a PPW decreases with a rate of (1/d) that can be compared with the decay rate of (1/d 2 ) of the free space wave propagation. In addition, while the interchip traffic is conducted through this dedicated layer, the interchip communication channel is electromagnetically isolated from the rest of the system. Moreover, the parallel plates create a closed space where no energy leakage is allowed. erefore, this structure offers  Journal of Computer Networks and Communications robustness and design flexibility, although it adds fabrication cost and increases the overall volume of the system [37]. Figure 8 shows that the proposed PPW was just placed beneath the Si layer. e PPW consists of two parallel metallic plates that have been put at a small distance of h from each other. We assume that the space between the two metallic plates is empty. e top plate can be considered as the ground of the chip. Since the packets of each chip should be broadcast to other chips, there is not any need for directional antenna. erefore, we can use a prob as an omnidirectional monopole antenna as a gateway for each chip. Table 1 illustrates the physical and operating parameters of the proposed PPW. To illustrate the performance of this waveguide, we have calculated the transmission characteristics of structure for different distances using computer simulation technology (CST) software [38]. In this simulation, we have put 9 monopole antennas at different distances from the transmitting antenna to fully investigate the attenuation of the PPW at 1 THz. Supposing the chip size is 20 mm × 20 mm and the spacing between two adjacent chips is 10 mm, the maximum distance between two antennas is about 100 mm. It is worth noting that the gateway antenna for each chip should be placed at the optimum position to have a minimum distance from other gateways, as described in the previous section. Figure 9 indicates the insertion losses of guided-waves in PPW for different distances between gateway antennas in comparison with the free space losses for the same distances. According to this figure, the guidedwaves in PPW experience extremely lower losses compared to free space waves. is low-loss THz-compatible medium can be used for developing the scalable multichip systems.

Broadcast-Enabled Multichip
System. Again, the OFDM modulation is employed for interchip communication. anks to the broadcast nature of the OFDM scheme, each gateway broadcasts its packets and according to the destination address, the other gateways can receive it. Since the electromagnetic waves propagating through the PPW are completely isolated from the intrachip waves, the entire bandwidth of around 100 GHz can be reused for communication between different chips without any interference. In this work, for communication at 1 THz, the entire bandwidth of 100 GHz (maximum 10 percent of central frequency) is considered. erefore, using a thumbnail account, if a modulation scheme with 16 levels (such as 64QAM) is employed, we can reach the maximum data rate as high as 600 Gbits/s. On the other hand, this bandwidth should be divided between 16 gateways. So, each gateway can transmit packets with a maximum data rate of 37.5 Gbits/s.

Power Estimation for Proposed Intra-and Interchip
Wireless Links. In this subsection, we first estimate the power consumption of the graphene-based intrachip wireless links used in our proposed multichip architecture. According to Figure 3, the maximum intra-wireless link is 14 mm and the path loss is shown to be around 55 dB for such a link [39]. On the other hand, the designed antenna array has a maximum directivity of around 9.5 dB that we can adjust in the direction of the longest wireless link. Furthermore, Figure 10 shows the bit error rate (BER) versus SNR for an 8QAM OFDM-based receiver. Accordingly, an SNR of 19.4 dB is assumed for our calculations as it provides a BER of less than 10 − 8 . Now, we can calculate the power consumed by the transmitter with the following equation [28]: e thermal noise power can be calculated by where k is the Boltzmann constant, T is the absolute temperature, and B is the bandwidth. Based on modeling estimates of antenna design in [35], reflection coefficient (S11) is at − 21.3 dB. Consequently, the maximum estimated power consumed by each wireless node for intracommunication is around − 7.5 dBm or 177 microwatts. Now, we can investigate the probability of interference occurrence between two nodes with the same frequency. As can be seen in Figure 2, the distance between the nearest stations with the same frequency is about 50 mm. e path loss for such a distance at 1 THz is about 70 dB [39]. On the other hand, we tapered the current of the array elements according to the Chebyshev tapering technique to reduce the sidelobes at least by 30 dB. So, if the thermal noise power at the receivers is 74 dBm, the maximum SNR at the nearest station with the same frequency of the transmitter node is about SNR � − 14.5 dB, which means the received power is considerably lower than the thermal noise level. erefore, it seems there is no interference for the same frequency stations.
By considering the omnidirectional monopole antenna with directivity around 1.5 dB for interchip communication through the PPW and the maximum distance about 100 mm between the gateways, and therefore 58 dB insertion loss, the maximum power consumed by each gateway is around 1.45 mW.

Experimental Evaluation
We have modified the Noxim [8] that is a cycle accurate simulator for simulating the wireless network-on-chip architecture, developed earlier in previous sections. Ten thousand cycles were performed to reach stable results in each experiment, eliminating transients in the first thousand cycles. All the digital components work with a clock of frequency 1 GHz. e input, output, and wireless ports have four virtual channels, each having a buffer depth of 4 flits.
e width of all wired links is considered to be 32, which is the same as the flit size in this paper. Each packet consists of 4 flits. e parameters used in the simulation are shown in Table 2. While the basic routing algorithm in a chip is the XY, we change the assigned virtual channel for each packet at the entrance to the wireless ports.
is is necessary to prevent deadlock since the wireless links can disturb the XY Journal of Computer Networks and Communications 9 algorithm and cause the cycle in a mesh topology. As previously mentioned, to access the wireless channel in a distributed fashion without any collision, we leverage an OFDM-based MAC mechanism for intrachip and interchip communication. In this MAC mechanism, the usable bandwidth is divided between the wireless nodes and therefore all nodes can broadcast their packets simultaneously.

Performance Evaluation.
To evaluate the performance of the proposed THz multichip NoC, we will consider three networks with different sizes, namely, four, eight, and sixteen one-kilo-core chip systems. In this section, we suppose that the cores generate random traffic.     Figure 11 illustrates the throughput of the proposed multichip system for the different number of chips. As expected, for lower packet injection rates (PIRs), the throughput of the three networks is similar. On the other hand, while the size of the network is increased, the throughput will slightly be decreased, especially for high packet injection rates.
is is a justifiable phenomenon because when the number of chips is increased, the data rate of wireless links between the gateways is reduced and therefore the competition for interchip communication will be raised. Consequently, the long-time channel reservation by the head flits can prevent the other packets to be transmitted from one chip to another. Figure 12 compares the average latencies of the proposed THz multichip NoC with different sizes. According to this figure, for the packet injection rates, approximately less than 0.01packet/core/cycle, all three networks show an acceptable low latency. By exceeding this PIR threshold, because of the long waiting time for releasing the reserved channels, the latency curve has a sudden growth. Again, the proposed architecture including the used MAC mechanism has controlled this latency growth when the network size is increased from 4 to 8 and 16 chips. Finally, the consumed power during the simulation time for the different number of chips is shown in Figure 13.

Comparative Performance Evaluation.
In this section, we compare the performance of the proposed THz multichip NoC with a conventional wireless multichip NoC operates at millimeter frequencies, namely, 60 GHz, in terms of peak achievable throughput, average flit latency, and power consumption. For avoiding collision in this millimeter-wave NoC, a token-based medium access mechanism for both intra-and interchip wireless communications has been used. In addition, since both intra-and interchip wireless links use a common medium for transmitting the packets, the entire available bandwidth should be divided between them. So, we expected the overall performance of our THz design, in which the intra-and interchip communication channels have been separated, should be better. Other parameters of the millimeter-wave multichip NoC have been set the same as the proposed THz one. It is worth noting that this millimeter-wave multichip NoC is architecturally similar to that presented in reference [4]. Obviously, for a fair comparison, we set its parameters such as clock frequency, number of cores per chip, and also the number of chips as what have been considered in the proposed THz multichip system.

Performance Evaluation with Varying System Size.
As previously mentioned, by reusing the frequencies at far enough chips, an important step has been taken to have a scalable network. In addition, interchip communication at the THz frequency band can provide wide enough bandwidth for achieving this scalability. To investigate this idea, we compare the performance of our design with the abovementioned noncellular mm-wave WiNoC when the system scales up. Figures 14-16 illustrate the peak achievable throughput, average flit latency, and power consumption of two different networks in terms of the different number of chips, respectively. In this subsection, we suppose all of the cores generate random traffic. It can be observed that when the system scales up, the proposed THz multichip NoC has still high enough throughput. In other words, while the throughput reduction of our proposed network for scaling up from four to the sixteen-chip network is about 5.8%, mmwave NoC shows a considerable reduction of 21%. Moreover, unlike the token-based mm-wave network, the average latency of our design is not increased suddenly. According to Figure 15, the mm-wave network experiences a latency   growth of about 40% and for our design, this increase is only about 1%. It is worth noting, one of the major parts of the power consumption is the static contribution which is independent of the network workload. erefore, when the number of chips is increased from 4 to 8 and 16, we expect that the power consumption has significant growth.

Performance Evaluation with Nonuniform Traffic.
In this section, we analyze the peak achievable throughput, average flit latency, and power consumption in the proposed THz multichip system with nonuniform traffic patterns, namely, hotspot and transpose synthetic traffics, and compare it with the token-based multichip system. For hotspot case, 5% of the generated traffics by all cores have the same destination which can be chosen randomly [4,40]. e destination of the other 95% of the generated packets by a core will be chosen randomly. In the case of transpose traffic, the destination of the packets generated by each core is diametrically opposite to it in the whole system. It is worth noting that in all the following results, the number of chips has been set to 16 and PIR � 0.01. Figures 17-19 illustrate the peak achievable throughput, average flit latency, and power consumption of two different networks in terms of different traffic distributions. It can be observed that our proposed structure has better performance from the viewpoint of throughput and latency for all types of traffic patterns. However, due to high propagation loss of the THz frequencies as well as having higher throughput and therefore delivering more flits to their destinations, the proposed THz multichip NoC consumes more power, especially for hotspot and transpose cases.

Comparison with Related Works.
In this section, we compare our architecture with the three main already cited works addressing multichip systems. Since the number of cores per chip is different for these references, we calculate the total peak bandwidth per chip. Table 3 indicates details of the related works in comparison with the proposed           architecture in this paper. With regard to this table, the proposed THz multichip system provides higher total bandwidth for each chip. In addition, it has a moderate average packet latency in comparison with the other for PIR � 0.01.

Conclusion
In this paper, we proposed an OFDM-based cellular THz multichip NoC in which the intra-and interchip communications were separated by means of a THz-compatible parallel-palate waveguide. In addition, to have a scalable architecture, the frequencies are reused by the chips that are separated by sufficient distance to cause minimal interference with each other. Due to the high propagation loss of the THz waves, it is necessary to consider a trade-off between the reduction of total hop number and shortening the wireless links when we are determining the positions of the wireless nodes on a chip. To do this, we adopted a multiobjective SA metaheuristic algorithm. Finally, we evaluated the performance of the proposed THz multichip NoC in comparison with a conventional token-based mm-wave multichip system. According to this comparison, our design satisfies the scalability property. In other words, when the network scales up from four to sixteen chips, the proposed THz NoC has a reduction of 5.8% in throughout and enhancement only about 1% for average latency. It is worth noting that for such an increase in the size of multichip NoC, the throughput of the mm-wave NoC is reduced at least 21% and its average latency has a sudden growth about 40%.
For future work, we will consider more than one gateway for each chip. In addition, we plan to employ different routing algorithms instead of the XY, especially compatible with such a multigateway system, for intrachip communication.

Data Availability
No data were used to support this study.

Conflicts of Interest
e authors declare that they have no conflicts of interest.