Domain wall and Magnetic Tunnel Junction Hybrid for on-chip Learning in UNet architecture

We present spintronic devices based hardware implementation of UNet for segmentation tasks. Our approach involves designing hardware for convolution, deconvolution, rectified activation function (ReLU), and max pooling layers of the UNet architecture. We designed the convolution and deconvolution layers of the network using the synaptic behavior of the domain wall MTJ. We also construct the ReLU and max pooling functions of the network utilizing the spin hall driven orthogonal current injected MTJ. To incorporate the diverse physics of spin-transport, magnetization dynamics, and CMOS elements in our UNet design, we employ a hybrid simulation setup that couples micromagnetic simulation, non-equilibrium Green's function, SPICE simulation along with network implementation. We evaluate our UNet design on the CamVid dataset and achieve segmentation accuracies of 83.71$\%$ on test data, on par with the software implementation with 821mJ of energy consumption for on-chip training over 150 epochs. We further demonstrate nearly one order $(10\times)$ improvement in the energy requirement of the network using unstable ferromagnet ($\Delta$=4.58) over the stable ferromagnet ($\Delta$=45) based ReLU and max pooling functions while maintaining the similar accuracy. The hybrid architecture comprising domain wall MTJ and unstable FM-based MTJ leads to an on-chip energy consumption of 85.79mJ during training, with a testing energy cost of 1.55 $\mu J$.


I. INTRODUCTION
Semantic image segmentation is a pixel-level classification of an image and involves clustering parts of the image that belong to the same class [1][2][3] .This deep learning task is integral to computer vision and pattern recognition and has substantial use in fields such as medical imaging 4 , self-driving cars 5 , and satellite imagery analysis 6 .While convolutional neural networks (CNNs) like LeNet, VGGNet, and GoogleNet are commonly used for classification tasks, where the output is a single class label, semantic segmentation requires localization information, i.e., a class label for each pixel.Consequently, image segmentation, with its pixel-wise classification, is computationally more demanding than object classification.The architectures for image segmentation, like UNet, play a crucial role in diffusion models used for image generation, such as OpenAI's DALL-E.This emphasizes the necessity of developing hardware implementations for these networks.
Implementing these complex deep neural network algorithms on traditional hardware which are based on von-Neumann architecture is resource-intensive in terms of energy consumption, area, and time.This is primarily due to the separation of memory and processing units.So, there is a need for specialized hardware designs that utilize in-memory computing paradigm 7 , offering optimization tailored for the efficient implementation of deep neural networks.
Several studies have investigated the specialized hardware implementation of segmentation tasks 8,9 .These works are based on optimizing the segmentation for FGPA implementation 9 and deploying a pipelined VLSI architecture 8 .These works are based on CMOS devices and consume high power and area.Spintronic devices on the other hand consume lower power, area 7 and are compatible with CMOS technology 10 .Spintronic devices also have the advantage of having a diverse range of properties such as non-volatility, oscillatory, plasticity, high endurance, linear response, and stochastic behavior 7,[11][12][13][14] .These properties give a wide range of tools to design specialized hardware for deep neural network implementation.While spintronic realizations of multilayer perceptrons 7,15 , convolutional neural networks 16 , spiking neural networks 17 and reservoir computing 18 have been demonstrated, the implementation of UNet remains elusive.In this work, we propose a spintronic implementation of convolution, deconvolution, ReLU, and max-pooling layers that are essential for UNet.A hybrid of domain-wall MTJ and SHE-MTJs are employed for realizing these layers.We utilize a hybrid simulation method that couples micromagnetic simulation, Keldysh non-equilibrium Green's (NEGF) function, and SPICE simulation with network implementation to capture the diverse physics of spintronic and CMOS devices in our designs.
The rest of the paper is organized as follows, in section II, we describe the UNet architecture used for image segmentation and explain how convolution and deconvolution can be realized using cross-bar arrays and the characteristics of ReLU and max pooling layers.Section III delves into the simulation method, outlining the coupling of micromagnetic simulation, NEGF, and circuit simulation with network implementation to execute image segmentation.In section IV, we describe the domain-wall MTJ and discuss the synaptic behavior of the domain-wall device.In section V, we present the orthogonal current injected MTJ device and circuit designs for ReLU and max pooling functions.In section VI, we show the results of the image segmentation using the CamVid dataset and compare the on-chip energy consumption of the proposed network for different thermal stability factors.In section VII, we discuss the possibility of physical spintronic realization of complex networks with large numbers of parameters.We con-FIG.1.The UNet structure is illustrated, where the feature map is represented by blue boxes with the number of channels indicated at the top and the size displayed on the left edge.White boxes signify copied features from previous stages, and arrows indicate various operations.An example input image and its corresponding output are also depicted.clude in section VIII.

II. ARCHITECTURE FOR SEGMENTATION
There are multiple architectures developed for image segmentation like UNet, SegNet, etc.Among these, UNet has been widely adopted in segmentation tasks.The UNet architecture was initially proposed by Olaf Ronneberger et al 19 for medical image segmentation.This architecture consists of two main components: the contracting path (also known as the encoder) and the expanding path (also known as the decoder), connected by a copy path (also referred to as skip connection).The contracting path reduces the feature map while extracting image features, and the expanding path utilizes these features to localize objects and reconstruct the segmentation mask 19 .As the feature map undergoes a reduction in the contracting path, some information is lost, to address this, the copy connection (skip connection) is employed to reintroduce the lost information to the expanding path.Figure 1 shows the schematic of a UNet structure, here the contracting path contains convolution, ReLU, and Max-pooling layers while the expanding path contains deconvolution, convolution, and ReLU layers terminated by a softmax function.In this network, we employ 4.65 million domain-wall synapses, 21.45 million ReLU circuit instances, and 2.33 million ReLU-Max pooling circuit instances to tackle the highly complex task of semantic image segmentation.
Implementing image segmentation through UNet on hardware necessitates the design of circuits dedicated to convolution, deconvolution, ReLU activation functions, and maxpooling layers.In the following sections, we describe the net-works designed for these layers.

A. Convolution
The convolution operation entails matrix-vector multiplication, where the input is multiplied with a kernel.This matrix-vector multiplication operation is fundamental to artificial neural networks where the input/feature map is multiplied with a weight matrix.In convolution, the kernel can be thought of as a weight matrix.Performing this vector multiplication requires a lot of memory fetches when using traditional hardware based on von-Neumann architecture.So, crossbar arrays 16,20,21 have become very popular for matrixvector multiplication, an example of a crossbar array is shown in Fig. 2. In crossbar arrays, the weight matrix/kernels are stored in non-volatile memory elements (synapses), where analog memory and computing units are intricately interwoven, leading to faster and more energy-efficient matrix multiplication.In Fig. 2, the inputs are applied to horizontal lines, while the kernel weights are stored as conductances of synapses in the vertical lines, and the output of the vector multiplication(weighted sum of inputs) is given by the current value in the vertical lines.
To implement such a crossbar array, a non-volatile synaptic device is necessary.Therefore, we employ a domain-wallbased magnetic tunnel junction (DW-MTJ) device to store the kernel weight.The neural network can have both positive and negative weights, but the conductance values of the DW-MTJ are positive only.To address this we add a conductance in parallel to the DW-MTJ as shown in Fig. 2. So the weight can FIG.2. The convolution operation using DW-based cross-bar array.The vertical lines symbolize the convolution kernels, and the input is applied to the horizontal lines.The DW device along with parallel conductance is used to store the kernel values.The resulting current output from the vertical lines (kernel output) is connected to the ReLU/ReLU+Max pooling devices.be represented as Here, W i,j is the weight connecting i th input with j th kernel, G DWMTJ is the conductance of the DW-MTJ.G AP and G P are the anti-parallel and parallel conductances of the DW-MTJ.
Further details about the DW-MTJ device are elaborated in Section IV.

B. Deconvolution
Deconvolution, also referred to as transposed convolution or fractionally-strided convolution, operates in the reverse direction of convolution.It extrapolates new information from the feature map and can be thought of as a one-to-many connection 9 .Deconvolution serves as a technique for upsampling images, resulting in an output size larger than the input size.This operation has significant application in generative adversarial networks and fully convolutional networks 22 .
The deconvolution operation can be achieved by introducing zeros into the input matrix and performing a convolution operation 22,23 .Figure 3 illustrates the deconvolution operation as a combination of zero insertion and convolution.Zeros are inserted along each row and column, including at the edges of the input matrix, thereby expanding the input size.This up-sampled matrix is then used as input for convolution.This combination of zero insertion and convolution yields the same effect as deconvolution.While this method involves redundant operations of multiplication with zeros, it allows us to utilize the convolution operation for which we have designed a hardware implementation using cross-bar arrays in the previous section.This approach reduces the complexity of the hardware design for the segmentation tasks.Hence, in our network design, we represent deconvolution through the convolution operation with an additional step of zero insertion.

C. ReLU and Max-pooling
Activation functions play a crucial role in neural networks, introducing non-linearity that enables the network to learn intricate structures and distinguish between outputs 24 .The rectified linear activation function (ReLU) 25 has emerged as a default choice for various networks, as it has been shown to improve learning in neural networks [26][27][28] .In convolutional neural networks (CNNs), UNet, and fully connected convolutional networks, a pooling layer is commonly incorporated to reduce the size and parameters while extracting features.Among various pooling methods, max pooling is popular, max pooling also has the ability to suppress noise by discarding noisy activations 24 .
To implement the ReLU function, we employ an orthogonal current-injected MTJ design.Subsequently, we utilize this ReLU circuit to construct a 3 × 3 max pooling network that simultaneously performs both ReLU and max-pooling functions.We discuss these implementations in Section V.The implementation of the domain-wall synapse involves micromagnetic simulation, which gives the response in magnetization of the free ferromagnetic layer due to applied current.To perform these micromagnetic simulations, the mu-max3 software 29,30 was employed.These magnetization results are used to obtain the conductance of DW-MTJ devices using the following equation 15,31 .

III. SIMULATION METHOD
Here, θ represents the angle between free-FM and fixed-FM magnetizations, G P and G AP are the parallel and anti-parallel conductances of the MTJ.NEGF simulation is utilized to compute the G P and G AP conductance values.
In the simulation of the ReLU-Max pooling network, NEGF simulation is self-consistently coupled with the voltage divider(formed by MTJ and a fixed resistor).Here the MTJ angle is varied to find the resistance of the MTJ by iteratively calculating the voltage across the MTJ.The results from the NEGF simulation are incorporated into the HSPICE circuit simulator through VerilogA, where the Verilog-A component provides the MTJ resistance based on the MTJ angle given by HSPICE.The HSPICE also performs magnetization dynamics simulation to find the MTJ angle along with the CMOS device simulations based on the 16nm predictive technology model 32 .The results of the domain-wall synapse, ReLU, and max pooling are incorporated into the Python programming, where the UNet architecture shown in Fig. 1 is implemented using the TensorFlow package.For implementing the UNet architecture in python, we utilize the conductance relationship of the domain-wall MTJ derived from mumax and NEGF, for the ReLU and ReLU-max pooling circuits, we use the empirical relationship between input current and output voltage along with the performance in the presence of thermal noise obtained from HSPICE and NEGF. A. Quantum transport: NEGF We use the Keldysh NEGF technique 12,33,34 to simulate the transport through MTJ that has MgO sandwiched between free and fixed CoFeB FM layers.The NEGF formalism is given by Here G(E) is the Green's function matrix, [I] is the identity matrix, E is the energy variable, [H] is the device Hamiltonian, [H 0 ] is the device tight-binding matrix, [U] is the Coulomb charging matrix, Σ is the self-energy matrix and Σ T,B are the self-energy matrices for the top (fixed) and bottom (free) FM layers respectively.G n is the electron correlation matrix and Σ in is the in-scattering function.
The quantum transport part leads to the calculation of the current operator (I op ) that represents the charge current between two lattice points i and i+1 is given by The current operator I op is 2 × 2 matrix in the spin space of the lattice point.Here I is the charge current through the MTJ device and q is the quantum of electronic charge.

B. Magnetization dynamics
The Landau-Lifshitz-Gilbert-Slonczewski (LLGS) equation 35,36 is used to describe the magnetization dy-namics of the free-FM.The LLGS equation is given by where m is the unit vector along the direction of magnetization of the free magnet, γ is the gyromagnetic ratio, α is the Gilbert H k is the reduced effective field and ⃗ i s = h⃗ I s 2qM s VH k is the normalized spin current.The term ⃗ H eff includes the contribution of the anisotropy field (H k ) and the thermal noise (H th ).The thermal noise 37 is given by ⟨H 2 th ⟩ = 2αk B T γM s V and ⟨⟩ represents the ensemble average.

C. SHE layer
The charge-to-spin conversion via the spin hall effect(SHE) in heavy metals is used to effectively manipulate the free-FM magnetization.The charge-to-spin conversion of the SHE layer and the polarization of the generated current is given by [38][39][40] Here, J s is the spin current density and J c is the charge current density.I s is the spin current generated, θ SH is the spin Hall angle of the heavy metal, L, t are the length and thickness of the heavy metal, and I c is the charge current injected.Îs is the direction of generated spin current flow, Îc is the direction of input charge current, and σ is the polarization of the generated spin current.From Eq. 13, injection of charge current to heavy metal in xdirection results in y-polarized spin current injection to the free-FM (z-direction) on top of the HM layer.
The resistance (R) of the heavy metal is given by Here, ρ and W are the resistivity and width of the heavy metal respectively.

IV. DOMAIN WALL SYNAPSE
The domain wall synapse is a 3-terminal device as shown in Fig. 5(a).In this 3-terminal configuration, the read and write paths are distinct, preventing accidental modification of synapse information during reading 15,31 .The write path in the DW device, illustrated in Fig. 5(a), is between terminals T1 and T3, while the read path is between terminals T2 and T3.The free-FM layer of the DW-MTJ has two oppositely polarized magnetic regions separated by a domain wall.This domain wall can be moved by spin orbit torque (SOT) exerted by the heavy metal.Thus the charge current flowing through the heavy metal injects spin current into the free-FM layer and moves the domain wall.The two pinned layers on either side of the free FM layer help prevent the domain wall from getting destroyed when a high current is applied.The movement of the domain wall causes one magnetic region to shrink while the other expands this changes the average magnetization of the free-FM layer.This change in magnetization translates to a variation in the conductance of the device, due to the tunnel magneto-resistance effect of the MTJ.

A. Device parameters
The spin orbit coupling at the heavy metal-free FM interface leads to Dzyaloshinskii-Moriya exchange interaction (DMI) which stabilizes the Neel domain wall 15,31,[41][42][43] .For our synaptic device, we consider a PMA CoFeB ferromagnet with dimensions 500 × 100 × 1 nm, saturation magnetization(M s ) of 0.7 MA/m, PMA constant(K u ) of 0.8 MJ/m 3 , exchange-correlation constant(A ex ) of 10 pJ/m, damping constant(α) of 0.3 and DMI constant(D) of 1.2 mJ/m 2 .We consider the highly efficient Au 0.25 Pt 0.75 heavy metal 44,45 , with spin hall angle (θ SHE ) of 0.3, resistivity (ρ) of 83 µΩcm and a thickness of 4nm, resulting in a resistance of 1037.5 Ω. Au 0.25 Pt 0.75 is taken as a heavy metal since it has a low spin Hall power factor 45 so it is more power efficient compared to other heavy metals.

B. Results
We show in Fig. 5 duration is applied.We observed that a 100 µA current pulse is needed to move the domain wall to the right edge starting from the center, corresponding to the parallel alignment with the fixed FM layer, and -100 µA is needed to move the domain wall to the left edge, corresponding to the anti-parallel alignment with the fixed FM layer.The velocity of the domain wall due to applied current is shown in Fig. 5(c), which shows a linear relation for the considered parameters.
During the training of the neural network, the weights increase and decrease, so corresponding to this requirement we show in Fig. 6 the response of the DW-MTJ to an input write current pulse train.Figure 6(a) shows the current pulse train, and the corresponding conductance of the DW-MTJ is shown in 6(b) over time.Figure .6(c) -(g) show the snapshots of the free-FM magnetization showing the movement of the domain wall.We observe that the domain wall reaches its initial position over time as the net current applied is zero.We also noted the tilting of the domain wall in the presence of current, this can be explained through 1D domain wall theory 42 .

V. RELU AND MAX POOLING
The magnetic tunnel junction possesses several properties, including the ability to undergo continuous/linear changes in resistance, which can be achieved by applying orthogonal spin currents to the free-FM of the MTJ 11 .This continuous change in resistance is essential since the ReLU function contains linear regions, requiring a device with linear characteristics for its emulation.Figure 7(a) illustrates the schematic of the MTJ injected with orthogonal spin currents generated by the SHE layer.This linear behavior of the MTJ can be utilized to construct a circuit that emulates the ReLU function, as depicted in Fig. 7(b).The circuit incorporates a resistor R 1 , a CMOS inverter, and the MTJ device along with a current source I b to shift the output and generate the ReLU function.Injecting orthogonal currents also enhances the circuit's stability, allowing us to lower the ferromagnet's thermal stability to 4.58.This reduction helps decrease energy consumption while only slightly impacting the circuit's error.
The max pooling function entails finding the maximum of the presented inputs.To achieve this functionality we use multiple ReLU circuits and introduce competition among them so that the circuit with the highest input becomes the winner.We enable this competition through an n-MOSFET and a resistor R 2 connected between each pair of ReLU circuits as shown in Fig. 8(a).

A. Device parameters
For the ReLU-max pooling circuits, we utilize a PMA CoFeB ferromagnet with dimensions 14.4×69.4×1nm.The saturation magnetization (M s ) is 1150emu/cm 3 , anisotropy field (H k ) is 330, 2180, 3300 Oe, the Gilbert damping is 0.01, the thermal stability factor (∆) is 4.58, 30.26, 45.81, a lower thermal stability factor was used as it reduces the power consumption and our circuit design still gives accurate results 11,25 .The heavy metal used is Au 0.25 Pt 0.75 44,45 , with spin hall angle (θ SHE ) of 0.3, resistivity (ρ) of 83µΩcm and a thickness of 4nm, resulting in input resistance of 1000Ω.The circuit parameters, I b the current bias is 9.98µA, R 1 resistor is 698.93kΩ and the resistor R 2 is 16kΩ.

B. Results
Figure 7(c) shows the output of the ReLU circuit, which closely resembles the ReLU activation function for normalized inputs of less than 1, here the normalized current I 0 is 14.5µA.The ReLU circuit consumes an average power of 0.343µW.We show in Fig. 8(b) the transient results of the 3 × 3 ReLU-Max pooling circuit, the 9 inputs are randomly taken to show max pooling functionality.Here we observe competition among the 9 ReLU circuits that enable the max pooling functionality, where the ReLU circuit with the highest input reaches its corresponding output while pushing all other ReLU units to settle to 0V. put of the ReLU-max pooling network where the inputs are chosen using the Monte Carlo simulation.This output closely resembles the ReLU function, demonstrating that our network performs both max pooling and ReLU functions simultaneously.The 3 × 3 ReLU-max pooling network consumes an average power of 17.86µW.

VI. SEGMENTATION RESULTS
We evaluate our UNet design using the Cambridge-driving labeled video (CamVid) Database 46 .This data was captured from the perspective of a driving car, the driving scene increases the number and diversity of the observed object classes.The dataset contains 701 colored images with dimensions of 512 × 512 pixels, each pixel is labeled into one of 32 possible classes.These classes include objects such as buildings, cars, roads, children, bicyclists, etc.To evaluate our network, we partitioned the 701 images into sets of 369 for training, 100 for validation, and 232 for testing purposes.We show in Table .I of energy consumed by the network for different thermal stability factors of ReLU, ReLU-max pooling circuits.This demonstrates a significant reduction in network energy consumption by employing ferromagnets with lower ∆ values, all while maintaining segmentation accuracy.Specifically, there is a 9.57× improvement in energy when utilizing a ferromagnet with a ∆ of 4.58 compared to one with a ∆ of 45.81. Figure 9(a) shows the accuracy(%) and loss of UNet over 150 epochs for testing and validation datasets.We achieved a validation accuracy of 86.87% and testing accuracy of 83.71% using ReLU, ReLU-max pooling circuits with ∆ = 4.58, these results closely resemble those of the fully software-based implementation as shown in Fig. 10, where the validation accuracy is 87.95% and the testing accuracy is 84.53%.We observe a settling time of 4ns for the ReLU emulation, and the worst-case settling time for the 9-input ReLU-max pooling network is 12ns.Considering the data path of the UNet architecture, and assuming the timings are primarily influenced by the ReLU and ReLU-max pooling networks, we estimate that the minimum time required for the input image to traverse the network to be 48ns.We also calculated the energy consumed by the synapses during training as shown in Fig. 9  dissipated in ReLU and ReLU-Max pooling units.We show in Fig. 11 the UNet output of four test images along with the ground truth labels.Here the predicted segmentation results based on our spintronic hardware implementation of UNet closely resemble the ground truth labels.

VII. DISCUSSION
The physical realization of the UNet architecture poses challenges due to its large number of parameters.But, to address complex problems, we need a large number of parameters, and this number will continue to grow with increasing problem complexity.Therefore, realizing these networks on specialized neuromorphic hardware is essential for efficient, scalable solutions.Recently, significant progress has been made in the commercial realization of a very high number of spintronic devices.Some notable works include projects from Renesas, Avalanche Technology, NUMEM & IC'Alps, Everspin Technologies that have developed upto 8Gb memrories based on STT-MRAM.Most of these are for STT-MTJ based memory realizations, but they can be extended to neuromorphic computing due to the similarity between memory architectures and cross-bar arrays 47 .
These developments in fabricating extremely large number of devices offer promising prospects for spintronics-based neuromorphic computing, yet there remains a significant journey ahead.Currently, the implementation of SHE-MTJs and domain-wall MTJs is confined to laboratory settings, requiring further time and effort to enable the integration of a large number of these devices on a chip, which would enable the implementation of complex machine learning algorithms.

VIII. CONCLUSION
In this article, we proposed spintronic-based hardware implementation for highly complex image segmentation tasks.We showcased the convolution and deconvolution designs based on domain wall MTJ and also presented the ReLU, and max pooling implementations using orthogonal current injected MTJs.We presented our simulation platform that couples the micromagnetic simulation, NEGF, circuit simulation, and network implementation to capture the diverse physics of spin-transport, magnetization dynamics, and CMOS elements.We demonstrated the potential of our hardware implementation of UNet by assessing its performance on the CamVid dataset, our results closely match those obtained from software implementation.We showed that employing an unstable ferromagnet for designing ReLU and max pooling functions leads to a nearly 10× reduction in network energy consumption for on-chip training, down to 85.79mJ, without compromising segmentation accuracy.

FIG. 3 .
FIG. 3. The deconvolution operation as a combination of zeroinsertion and convolution operation.

FIG. 4 .
FIG. 4. Overview of the simulation setup.(a) Micromagnetic simulation of the domain wall is simulated in mumax3, and the magnetization outcomes are translated to MTJ conductance using parallel and anti-parallel conductances obtained from NEGF simulation.(b) Hybrid NEGF-CMOS simulation setup for ReLU and ReLU-max pooling circuits.The NEGF is interconnected with a voltage divider circuit in a self-consistent manner to compute MTJ resistance.This resistance is then integrated into HSPICE circuit simulation using VerilogA.The LLGS equation is interconnected with other circuit components to compute ReLU and ReLU-max pooling functions.(c) The characteristics of the DW synapse, ReLU circuit, and ReLU-max pooling network are incorporated into the TensoFlow package to implement the UNet architecture, which is utilized for semantic image segmentation.

Figure 4
Figure 4 presents an overview of the simulation method, encompassing micromagnetic simulation, NEGF formalism, magnetization dynamics, circuit simulation, and UNet implementation.The simulation can be divided into three components: domain-wall synapse simulation, ReLU-max pooling design, and UNet implementation.

FIG. 5 .
FIG. 5. (a) Schematic of the domain-wall based synapse.I write denotes the write current passing through terminals T1 and T3, while I read represents the read current flowing through terminals T2 and T3.(b) The conductance of the DW-MTJ device with respect to input current pulse (I write ).(b) The velocity of the domain wall with varying input current density.

FIG. 6 .
FIG. 6. Response of the domain-wall to a write current pulse.(a) Train of write current pulses.(b) Conductance of the DW-MTJ corresponding to the current pulse train.(c) m z magnetization of the free-FM of the DW-MTJ at t=0 ns.(d) -(g) Snapshots of the m z magnetization at various time points, illustrating the response to the current pulse train.

FIG. 7 .
FIG. 7. (a) Schematic of the orthogonal current injected SHE-MTJ device for continuous resistance change.(b) Circuit design for ReLU function emulation.(c) The output of the ReLU circuit with I 0 = 14.5µA,V DD = 0.5V and ∆ = 4.58.

FIG. 8 .
Figure 7(c) shows the output of the ReLU circuit, which closely resembles the ReLU activation function for normalized inputs of less than 1, here the normalized current I 0 is 14.5µA.The ReLU circuit consumes an average power of 0.343µW.We show in Fig.8(b) the transient results of the 3 × 3 ReLU-Max pooling circuit, the 9 inputs are randomly taken to show max pooling functionality.Here we observe competition among the 9 ReLU circuits that enable the max pooling functionality, where the ReLU circuit with the highest input reaches its corresponding output while pushing all other ReLU units to settle to 0V.Figure8(c) shows the out-

FIG. 9 .
FIG. 9. (a) Accuracy(%) and loss of the UNet across 150 epochs for the CamVid dataset.(b) Energy consumption in all DW-synapses during network training as a function of the number of epochs.
Figure9(a) shows the accuracy(%) and loss of UNet over 150 epochs for testing and validation datasets.We achieved a validation accuracy of 86.87% and testing accuracy of 83.71% using ReLU, ReLU-max pooling circuits with ∆ = 4.58, these results closely resemble those of the fully software-based implementation as shown in Fig.10, where the validation accuracy is 87.95% and the testing accuracy is 84.53%.We observe a settling time of 4ns for the ReLU emulation, and the worst-case settling time for the 9-input ReLU-max pooling network is 12ns.Considering the data path of the UNet architecture, and assuming the timings are primarily influenced by the ReLU and ReLU-max pooling networks, we estimate that the minimum time required for the input image to traverse the network to be 48ns.We also calculated the energy consumed by the synapses during training as shown in Fig.9(b).The energy dissipation per epoch decreases as the network undergoes training and the weights converge.The total energy consumed by the network during training over 150 epochs is 85.79mJ, out of which 44.30pJ is consumed by the synapses for weight updates.The energy consumed by the network to process one image during testing is 1.55µJ with ∆ of 4.58, this energy is

FIG. 11 .
FIG. 11.Comparison of UNet results to the ground truth.(a) Four images from the test set of the CamVid database.(b) Ground truth corresponding to these images.(c) Results obtained from our UNet for the four test images.Each color in the label images corresponds to a distinct element in the image; for instance, cars are represented by pink, while buildings are depicted in red.

TABLE I .
Performance metrics for on-chip training for different thermal stability factors of ReLU and ReLU-Max pooling networks