Area efficient in-plane nanomagnetic multiplier and convolution architecture design

In this study, we propose a nanomagnetic logic (NML) based 2 bit multiplier architecture design for the first time to the best of author’s knowledge. This complex combinational logic (nanomagnetic multiplier) design proposed is built by exploiting shape, positional hybrid anisotropy and the ferromagnetically coupled fixed input majority gate. Subsequently, we extend this proposed multiplier architecture along with the NML adder architecture in introducing NML based convolution architecture design which is efficient in terms of number of nanomagnets, majority gates and clock-cycles. The proposed NML design yields ∼21%–72%, ∼26%–42%, ∼36%–63%, and ∼20%–68%, reduction in the required number of nanomagnets, majority gate, clock cycles and energy compared to the state-of-the-art designs.


Introduction
Quantum-dot Cellular Automata (QCA) are nanoscale devices which are physically implemented combining the discrete property of quantum mechanics and cellular automata. Fabrication of QCA by exploiting the interaction among magnetic nanoparticles has gained momentum owing to its room temperature implementation. Since its first revelation by Cowburn et al., [1], Magnetic Quantum-dot Cellular Automata (MQCA) based in-plane Nanomagnetic Logic (NML) has been emerging as a rebooting computing platform [2] with the prospect to complement CMOS devices in the domain of spintronics [3][4][5][6][7][8]. Followed by its revelation, Csaba et al., have demonstrated that nanomagnets can be used for information propagation [9]. Subsequently the first universal majority logic gate (MLG) implementation has been shown by Imre et al., [10] using oval shaped input and elongated drivers which are clocked by applied field from magnetic force microscopy (MFM) tips [11]. In consequence, the first time demonstration of NML full adder has been shown by Varga et al., [12] using fanout and interconnects.
Subsequently, researchers started exploiting shape (S) anisotropy [13,14] of the nanomagnets leading to optimization of the arithmetic circuit implementation [15]. Furthermore, positional (P) anisotropy of the nanomagnets [16] have been exploited aiding towards mis-alignment free design. In continuation, Li et al., [16] have shown the 1 bit full adder implementation using 45 degree aligned input nanomagnets.
In parallel, researchers demonstrated approximate arithmetic computing using NML [17]. Consequently, in our recently reported works, we have utilized SP hybrid anisotropy [18] of nanomagnets to design an 1bit full adder. In subsequence, the authors proposed an optimized NML adder implementation by exploiting physical analogy of ferromagnetically coupled fixed input majority gate (FMG) [19] and further extended it to perform runtime reconfigurable approximate arithmetic computation [20,21].
In view of the above, both from the material and architecture point of view the adder architecture design has been the focus till date. The main objective of this study is to explore how these basic building blocks can be extended further to implement complex arithmetic logic using NML. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
With the emerging demand of the resource constrained implementation [8] of the computationally intensive convolutional neural networks (CNN) on the edge, requires area and energy efficient architecture design for performing convolution operation. The authors believe that the MQCA based logic computing devices would play a major role in such design and motivated by this, in a first attempt, we propose an NML based efficient nanomagnetic multiplier architecture design along with the convolution.
The rest of the study is organized as follows: section 2 briefs the proposed design and anaysis, followed by its discussion and conclusion in section 3.

Proposed design and analysis
The working principle of nanomagnetic majority logic is depicted in figures 1(a)-(e). When two nanomagnets are placed side by side it exhibits antiferromagnetic data propagation (inverter) and when placed one above another it enables ferromagnetic data propagation. When this both antiferromagnetic and ferromagnetic data propagation are coupled in a single system that exhibits a majority logic gate where the state of the compute middle nanomagnet is determined by the combined influence of inputs 1,2 & 3 (cf 1(d), (e)).
Proposed MQCA based NML Architecture Design for Convolution:-The objective of this section is to introduce the MQCA based architecture to accomplish the computationally intensive convolution in area and energy efficient way. In general convolution is expressed as follows To begin with, we propose the design of first of its kind nanomagnetic multiplier. The inputs of the 2 bit multiplier comprises of a1, a0 (bits of A), b1, bo (bits of B) and their corresponding truth table entries are tabulated in table 1.
The 2 bit multiplier [22] outputs are defined as m3, m2, m1 and m0. As shown in figures 2 and 3, here we propose to use the majority logic blocks Maj-a, Maj-b, Maj-c, Maj-d, Maj-e, Maj-f, Maj-g and Maj-h to build the 2 bit nanomagnetic multiplier. The majority logic blocks and their functionalities are defined as follows The majority gates Maj-a and Maj-b are fed with the inputs a0, b0, c and a0, b1, c respectively where c is set to 1 and a0 acting as a common input for both Maj-a and Maj-b. The Maj-a corresponds to the output m0. Maj-c and Maj-d are fed with the inverted inputs of a1, b0, c and a1, b1, c respectively where a1, c acts as the common inputs for both Maj-c and Maj-d.
We have adopted the DeMorgan's theorem as stated in equation (2) for the optimization as defined below. By DeMorgan's first theorem we know that, Thus the inputs of Maj-c and Maj-d can be represented as, and similarly, Hence, we propose to use the NOR implementation using Maj-c and Maj-d for the reduction of nanomagnets. By using this proposed optimization and placement of nanomagnets together yields 37.5% reduction in the number of nanomagnets to implement the gate functionalities compared to the state-of-the-art approaches which will be detailed in the upcoming discussions.
As it is evident from the aforementioned, the proposed multiplier comprises of adder blocks, therefore its efficient implementation requires efficient adder architecture. Hence to enhance the optimization further, an efficient adder architecture recently reported by the authors' [19] have been used to replace the existing adder block (for brevity the details of this architecture is omitted [19]). The 2bit multiplier architecture representation shown in figure 2 is substituted with their corresponding nanomagnetic representation as depicted in The architecture represented in figure 3 constitutes for the implementation of the possible logic variation for 2 bit multiplier as tabulated in table 1. Select configurations 1-7 have been used to pick the particular architecture corresponding to the input logic variations as depicted in the figure. By following this procedure, the number of design configurations required to achieve all the input logic variation is reduced. As an example 16 design configurations (cf table 1) required for 2 bit multiplication implementation is reduced to 7 design configurations (cf figure 3).
This proposed procedure hence leads to 56.25% reduction in the area footprint required for the optimized multiplier implementation which will be detailed further. This area efficient design plays a significant role in the convolution architecture design. Thus, the combined adder block and the proposed multiplier block are used to implement the convolution architecture as detailed below.
Convolution Illustration:-The modules to implement equation (5) is depicted in figure 4(a). The first module block, representing A 0 B 0 consists of 30 × 7 totaling 210 nanomagnets (cf figure 3), similarly the module blocks representing A 1 B 1 , A 2 B 2 and A 3 B 3 individually corresponds to 210 nanomagnets each. The design of k-bit nanomagnetic adder design using 1 bit adder is briefly discussed in one of our recently reported work [19].

Subsequently, the module blocks representing
comprises a total of 32, 32 and 40 nanomagnets respectively. The convolution design illustrated in figure 4(a) constitutes a total of 210 × 4 + 32 × 2 + 40 = 944 along with ∼40 interconnects nanomagnets leading to ∼984 nanomagnets; 56 × 4 + 8 × 2 + 10 = 250 majority gate operations and 6 × 4 + 4 × 2 + 5 = 37 clock-cycles to compute C as defined in equation (5). These calculations have been performed considering both Block 1 and Repeating Block 1 as depicted in figure 4(a). However, it can be seen that the repeating block 1 is found to be redundant and hence the resources of block 1 can be shared for computing both A 0 B 0 + A 1 B 1 and A 2 B 2 + A 3 B 3 individually which will also result in drastic reduction in the total of number of nanomagnets, majority gate operations and clock cycles.
Subsequently, reduction in the number of nanomagnets, majority gates accounts for the area, energy efficiency and similarly the reduction in the number of clock-cycles leads to high speed. Here, we have generalized our design approach to implement K bit convolution operation using k bit nanomagnetic multiplier and 2k bit nanomagnetic adder as depicted in figure 4(b). This block is represented as module 1 and this module is repeated multiple times to achieve computationally intensive convolution , where the summation upper limit is set to 15 (generic model using module 1). In the upcoming discussions we will describe the inter-module communication required for implementation of this proposed convolution architecture design.
Each module contains set of logic functionalities for execution as detailed above. Output data propagation from one module to another module can be achieved by incorporating the following:-The foremost way of inter-module communication is to have the nanomagnetic wire architecture translating in the form of interconnect nanomagnets exploiting SP hybrid anisotropy. Data propagation is also achieved by using buffer and inverters which comprises of odd and even number of nanomagnets respectively. Similarly, (a) the input and output of the first and last modules can be interfaced to the external CMOS modules using nanoscale spin valves; (b) input can be field-coupled and the output using spin-valve; (c) electro magnetic interface can also be achieved by domain walls (DWs) and (d) magnetic tunnel junction based I/O interface by exploiting the free layer and giant magnetoresistance. However, the focus of this study is to propose an efficient convolution architecture design, for brevity a short summary on inter-module communication is included.
Discussion:-The micromagnetic simulation results of the proposed multiplier design (cf figure 3) is depicted in figure 5. The application of an external magnetic field powers the input slanted edge standalone input nanomagnets. To achieve all the input logic variations of the 2 bit nanomagnetic multiplier, different positions of the slanted edges are required comprising 16 different physical configurations (constituting 480 nanomagnets) as proven effective by earlier experimental realizations [10][11][12]15]. Though proven effective in literatures, in the process of scaling up to higher bit multiplier designs there arises a significant need for optimization. To mitigate this, with our proposed design we have clustered it to only 7 different physical configurations (constituting 210 nanomagnets) resulting in 56.25% reduction in the area footprint and the reduction of number of nanomagnets. As shown in figure 3 by choosing select configuration 1 to 7 the implementation of varying input logic combinations are achieved (cf table 1). Not only from the number of physical configurations perspective, our proposed design and optimization of a) 2 NOR majority blocks and placement of nanomagnets together yields 37.5% reduction in the number of nanomagnets; b) optimization of 2 AND majority blocks yields 12.5% reduction in the number of nanomagnets compared to the traditional implementation of AND and NOR majority blocks. Figure 5(a) corresponds to the multiplication ouput 0; (b) corresponds to the simulation outputs of 1 × 1 leading to output 1; (c) corresponds to 2 × 1 vis-a-vis 1 × 2 leading to output 2; (d) corresponds for 2 × 2 and its output 4 (e) 3 × 1 vis-a-vis 1 × 3 for output 3 (f) 3 × 2 visa-vis 2 × 3 corresponding to output 6; and (g) corresponding simulation output for 3 × 3. As detailed, the resource requirements for the implementation of the nanomagnetic convolution (for illustration cf figure 4) is 984 nanomagnets, 250 majority gates and 37 clock cycles using the proposed efficient multiplier and the state-ofthe art authors' proposed adder architecture.
Simulation colour coded where the green coloured are interconnects and the red (varying saturation represents field interaction) coloured are input, output and compute nanomagnets.
The performance metrics of the proposed MQCA based architecture design in comparison to the state-ofthe art is depicted in figure 6. The analysis presented are for one physical configuration and without interconnects. From which it is evident that the proposed architecture design is efficient in terms of required number of nanomagnets, majority gates and clock cycles.
Simulation Framework:-Object Oriented MicroMagnetic Framework (OOMMF) [23], a micromagnetic open source simulation tool developed by NIST at ITL, which solves the Landau-Lifshitz Gilbert's Ordinary Differential equation using the 4th order Runge Kutta solver is used throughout this study. We opted for OOMMF, as the researchers vividly use it for designing, validating and reporting, owing to its reproducible and reliable system development that aids in the real-time realization of the developed theoretical models [10,[12][13][14][15]18]. In OOMMF, 3D spins on 2D mesh cells are relaxed using Landau-Lifshitz PDE solver. Spin orbit coupling interaction gives rise to magnetic anisotropic energy. Stoner Wohlfarth model is applied for magnetization rotation to place 45°aligned single-domain nanomagnets. Slanted edge of the nanomagnets aids in achieving standalone inputs and all its magnetic moments are aligned in the easy (long) axis. The direction of the slant edge determines the final state (My) (relaxed state) of the nanomagnet upon removal of the applied field (X component). We have used permalloy (Py) (78.5% nickel, 21.5% iron composition) magnetic dots which is a pronounced polycrystalline soft magnetic material with uniaxial anisotropy constant value set as zero (ie., low coercive field) and larger exchange energy comparatively to the magnetocrystalline anisotropy energy (MAE) [24]. Thus requiring high axial symmetry in maintaining the magnetic anisotropy which is dominated by the exchange interaction as specified by the single-domain bistable nanomagnetic exchange Hamiltonian. In lieu of time, the maximum torque | m × h | set to 10 −5 A/m, with damping coefficient of 0.25, saturation magnetization of 800 × 10 3 A/m and the exchange stiffness constant of 13 × 10 −12 J/m [18,19]. A hierarchical layout editor CleWin [25] developed by WieWeb software has been used for modeling. Design specifications for the proposed nanomagnetic architecture are as: dimensions of slanted edge nanomagnets: 15nm×30 nm area; 10 nm thickness; oval shaped nanomagnets: 10 nm×30 nm area; 10 nm thickness and antiferromagnetic and ferromagnetic coupling wall separation of 10 nm | 15 nm. Finer 3D spins in 2D mesh has been taken into account and the underestimation of gap by ±5% is less likely to have its effect on the reliability of the circuit. The architecture proposed here is designed with sub-50 nm design node which requires special attention (in-spite of its earlier experimental realization of slanted edges and 45 degree aligned nanomagnets), which could be better realizable with the state of the fabrication techniques. The proposed design is also evaluated against thermal fluctuations and premature bit-flip as defined by theoretical models and found satisfactory which is also briefly reported in one of our recently reported work [19]. The sub 50nm sized nanomagnets used here exhibits the possible scaling down limits of nanomagnets using in-plane dipolar coupling adhering to the theoretical standards without entering the superparamagnetic regime. However, it is to be noted that the proposed design Figure 6. Performance metrics and the corresponding comparative percentage reduction of the proposed MQCA based NML convolution architecture design to the state-of-the-art [FMG [19], SP Hybrid [18], Varga-Adder [12], 45 degree [16], Slant [15], Varga [11]]. methodologies are independent of the design nodes and could be realized using existing design nodes of sub 180 nm and 250 nm [21].

Conclusion
We have proposed the area and energy efficient MQCA based 2 bit nanomagnetic multiplier architecture and convolution design approach as a simulation based proof-of-concept demonstration. The proposed design yields ∼21%-72%, ∼26%-42%, ∼36%-63%, and ∼20%-68%, reduction in the required number of nanomagnets, majority gate, clock cycles and energy compared to the state-of-the-art designs. The proposed energy efficient architecture design is envisaged to have its applications in edge computing and also have its potential impact on graphene [26][27][28] based on-chip clocking.