Local Bit-Line Sharing Robust Dual-Port 8T SRAM With Virtual VSS for Energy-Eﬃcient In-Memory Computing Architecture

,


I. INTRODUCTION
T HE state-of-art computing systems are based on the Von- Neumann architecture [1],which is implemented by separating data storage and processing element.Energy efficiency and performance are a priority in furthering development of data intesive application such as Artificial Intelligence (AI), machine learning and neuromorphic computing.However, the Von-Neumann architecture impose bottleneck [2] on transfer of data from memory to processing element.Energy consumption and latency are incresed due to this bottleneck.
In-Memory Computing (IMC) is a new paradigm to mitigate the bottleneck of the Von-Neumann architecture-based computing system.IMC integrates ALU (computation) and storage in a single memory array, and all computation can be done within the memory array.IMC not only enhances the system performance but also improves the energy efficiency and supports massive parallelism [3].IMC is a promising candidate for beyond Von-Neumann computing that focuses on the application ranging from high-performance artificial Anil Kumar Rajput was with the Department of Information Technology, ABV-Indian Institute of Information Technology, Gwalior, India e-mail: (anil@iiitm.ac.in).
intelligence to the neuromorphic computing system and AIbased Internet of Things (IoT) devices [4].
Recently, IMC concept has been widely discussed in all type of memory from emerging nonvolatile memorys [5] to Static Random Acess Memory(SRAM) [6]- [9].The nonvolatile memorys are still in research face and suffer from low endurance and high write energy, which limit the scope of nonvolatile memory based IMC to system without the need for frequenlty updating the store data [4].The SRAM based IMC provides faster write and lower write energy with a high endurance(approximately unlimited), which makes it applicapble to system of small and mediam capacity for wide range of applications.Implementation of IMC (integration of computation and storage at the same place) in SRAM [10] is extremely significant to improve the system throughput and energy efficiency.
Conventional 6T (C6T) SRAM has been used to perform the IMC operation [6], [7], [8], by activating two Word line (WL) simultaneously.However, C6T SRAM-based IMC suffered from compute-disturbance (flip the data during IMC) due to destructive read and inevitable low operating frequency.The read decoupled path has been proposed in 4+2T SRAM [9] to resolve the destructive read problem.It improves the Read margin and the performance of Content Addressable Memory (CAM) and IMC operation.However, 4+2T SRAM based IMC requires an extra cycle to write back the computed result, which increases the system's latency.A read computation scheme [11] has been proposed to improve the latency during IMC operation by using the 8T SRAM with a skew inverter as a sense amplifier.However, the proposed scheme suffered from compute-failure(provides false computing results) due to process variations [12].The author in [13] has presented the sensing scheme to overcome the compute-failure in 8T SRAM for IMC operation.Apart from this, the author in [14] has used the separate write word line to perform compound Boolean logic with searching operations with the help of 8T SRAM.Further, the Transpose-8T based SRAM has been introduced to perform the programmable IMC [15].The dual-port 8T SRAM has been proposed for current-based compute-in-memory dot product operations [16].However, the 8T SRAM [11], [13], [15], [16] still suffers from low array efficiency and halfselect issue [12] due to 6T-like write operations [17], [18].The 10T SRAM reported in [19], [20] are free from computefailure and compute-disturb and compute all logic operations in a single cycle.However, 10T SRAM suffers from halfselect issue during IMC operation under process variation [21], and hence degrades the performance of 10T SRAM based IMC architecture.To overcome the compute-failure, compute-disturb and half-select issue, 12T SRAM [12] has been proposed for IMC, at the cost of increased area and energy consumption.In this paper, local bit-line sharing Dual-Port 8T (SDP8T) SRAM with Virtual V SS is proposed.The proposed SRAM improves write stability, leakage power, read energy,and read delay at the cost of an increase in write delay and write energy as compared to recently reported DP8T SRAM [16].
A reliable IMC architecture is proposed by using SDP8T-SRAM.The proposed IMC architecture is free from the compute-disturb, compute-failure, half-select issue during IMC operation.Further, The proposed SDP8T SRAM-based IMC architecture fully utilizes all read and write bit line ports and simultaneously performs In-Memory Boolean Computation (IMBC) which improves the latency, throughput, and energy consumption as compared to recently reported IMC in [6], [11], [12] [14].Furthermore, SDP8T-IMC architecture is configured as a Binary Contain Address Memory (BCAM) for searching applications.
The remaining of the paper is organized as follows; Section II gives the overview of the proposed SRAM and IMC architecture structure; Section III presents the analyses of the proposed SDP8T SRAM, Section IV presents SDP8T SRAM based IMC architecture, and Section V presents the evaluation of IMC architecture.Finally, conclusion is presented in section VI.

II. THE PROPOSED SDP8T SRAM
This section describes the proposed SDP8T SRAM structure and operation, as shown in Figure 1. Figure 1(a) shows the overall IMC architecture using the proposed SDP8T SRAM.The SRAM has three main parts: read decouple bit-cells, two pass-gate transistors (PGL/PGR), and one write assist transistor (WA), as shown in Figure 1(b).The proposed SDP8T SRAM has a double-ended write by using transistors (GL, PGL) and (GR, PGR), and decouples read path by using high V th (HVT) transistors (RL, RR) for a deferential read.The VV SS , PGL, and PGR transistors are shared with low V th (LVT) NMOS write-assist transistor (WA) in a block of 4 SRAM cells as shown in Figure 1(b).The VV SS is used to reduce the leakage power during the hold state and also to enhance the write operation [22].
The proposed SDP8T SRAM has a lower impact on Q and QB storage nodes from RWL and RBL/RBLB.Therefore, the read noise margin is significantly improved compared to C6T SRAM.As a result, reliable multi-word activation for IMC operation is achieved.To perform the IMC operations, multiword RWL and WWL are activated simultaneously using RWL-driver and WWL-driver.The RBL and WBL Sense Amplifier (SA) pairs are used to perform the boolean logic operation by using the four operands available on RBL, RBLB, LWBL, and LWBLB.The multiple IMBC operations (NAND, AND, NOR, OR, XOR) are obtained using the RBL-SA and WBL-SA in a single cycle.Hence, it increases the throughput and makes IMC architecture more energy and performance efficient.The Match Line SA (ML-SA) is used to perform BCAM operation, discussed in section IV.The V ref generator is used to provide the reference voltage to RBL-SA, WBL-SA, and ML-SA.The control signals are shown in Figure 2 to perform the hold, read, write, and IMC operation.
1) Hold Operation: RWL kept at high during hold operation, and WWL, WS, and CNTRL signals kept at low.The GND is not directly connected to the pull-down transistors(DL, DR).The logic zero is obtained at V V SS by using the Low V T H (LVT) WA transistor with higher width (Approximately four times then access transistor), as shown in Figure 1(b).As a result, the V V SS value is achieved at 100 mV during hold operation.
2) Read Operation: During the read operation, RBL and RBLB are precharged to VDD, WWL keeps at GND, WS signals keep at GND, and CNTRL signals are kept high as a result, logic zero is achieved at V V SS .As RWL drives from VDD to GND, the RBL and RBLB discharge according to storage value at Q and QB, respectively.The difference voltage between RBL and RBLB is sensed by using the low-offset double-tail latch-type voltage sense amplifier [23].
3) Write Operation: During the write operation, RWL keeps at VDD, and CNTRL signals are kept at low.The voltage of GWBL and GWBLB are set to written data by write driver as shown in Figure 1(a).The voltage of GWBL and GWBLB is transferred to storage node Q and QB, respectively, of selected bit-cell inside SDP8T SRAM by enabling WS and WWL signal.The write access paths are composed of two NMOS transistors (GL/GR) and (PGL/PGR), which increase the write delay and reduce write stability.To overcome the write stability problem, the Virtual V SS write assist [24] technique is used in the proposed SRAM.

4) Half-select issue:
The half-select issue is defined as the flip of the stored value at half selected cell(row, column) during the write-back operation of IMC [12].The C6T SRAM suffers from bit line disturbance in the row-half-selected bit cells [18].As write word-line simultaneously enable all bitcells, including words in non-selected columns, half-selected disturbance happens in the row-half selected cells, but column-half-selected bit cells escape from half-select issue due to inactive WWLs.
In the proposed SDP8T SRAM, one extra control signal WS is incorporated with the help of pass gate transistors PGL/PGR to resolve the half select issue at row half selected cell.In the row half select cell during the write operation in the selected cell, as shown in Figure 3.However, the storage nodes(Q1 and QB1) are maintained because the storage node is disconnected from the GWBL pair.After all, WWL[X] and WS[Y] are kept at low.Thus row-half select issues are resolved using pass-gate transistor PGL/PGR.To analyze the half select issues, the Monte-Carlo (MC) simulation are performed during write operations on the selected cell, row, and column half selected cells, and It is found from the MC simulation result as shown in Figure 4, the half select issues are not present in the proposed SDP8T SRAM.

III. EVALUATIONS OF SDP8T SRAM
In this section, the benchmarks circuits of C6T, Transpose-8T [15], DP8T [16], 8T [11] 10T [20] and the proposed circuit are implemented with the same 65nm CMOS technology with exact sizing and simulate at 1V supply voltage and 27 • C temperature with a typical process corner for performance comparison.2) RSNM: Table I contains the RSNM value of various SRAM cells at V DD =1V.The proposed SRAM has read decouple path.Hence, the read current does not flow through the storage node, which improves the read stability.Therefore, the RSNM of the proposed SRAM is improved by 2.43× and 2.11× over C6T and Transpose 8T [15].The use of virtual ground through WA transistor makes pull-down network slightly weaker than pull-up network.Therefore, The RSNM of the proposed SDP8T SRAM is improved by 1. 6% as compared to the other read decouple SRAM (DP8T [16], and 10T [20]), as given in Table I.The RSNM improves with the increment in supply voltage, as shown in Figure 5(a).The threshold voltage of the transistor reduces with the increased temperature.As a result, the leakage current increases, thereby resulting reduction in RSNM, as shown in Figure 5(b).
The RSNM of the proposed SRAM with respect to the process corner is shown in Figure 6.The threshold voltage changes due to inter-die variation at different process corners.Therefore, the read stability changes accordingly.The difference of the mean value of RSNM between best and worst-case corners of the proposed SRAM is 60mV, which is lower than the difference of RSNM between best and worst-case corners of DP8T SRAM (64mV).Therefore, the proposed SDP8T SRAM show better process variation tolerance in RSNM as compared to DP8T SRAM [16].
3) WSNM: The WM (WSNM) is calculated in terms of dynamic write margin with Write Trip Voltage(WTV) to reflect  the dynamic behavior of SRAM using VV SS during a write operation.The WTV is defined as the difference between V DD and WWL voltage at which the storage nodes Q and QB flips.The proposed SRAM has two series access transistors (PGL, GL), which will degrade the WM of SRAM, as shown in Figure 7(a).However, the use of VV SS at the ground terminal during the write operation improves the WM of SRAM cell [24].The VVSS degrade the strength of the pulldown transistors, which improves the write 1 stability and degrade write 0 stability.It is found that the overall WM of the proposed SRAM improves, as shown in Figure 7(b).Therefore, The WSNM of the proposed SRAM improved by 26.29%, over C6T, as given in Table I.
The WM of the proposed SRAM improves with increment in supply voltage, as shown in Figure 8(a).The WM of the proposed SRAM is better than C6T, DP8T [16], Transpose 8T [15], and 10T [20] SRAM at all supply voltage from 0.4V to 1V.The sub-threshold current increases with the increase in temperature.Therefore, the WM is increased with temperature, as shown in Figure 8(b) Furthermore, the WM of the proposed SRAM with change in process corner is shown in Figure 9.The difference of mean value of WM between best and worst case corners is 60mV, which is smaller than the DP8T [16] SRAM cell's WM variation(101mV).Therefore, the proposed SRAM shows better process variation tolerance in term of WM as compared to DP8T [16].
B. Dynamic Noise Margin 1) WDNM: Write Dynamic Noise Margin (WDNM) is an essential parameter for analyzing SRAM's noise handling capability during the active condition.WDNM is defined by the minimum pulse width of the write word line (WWL) to reach the bit-cell storage node into the switching threshold [25].It is observed from Table I, The WDNM of the proposed SRAM is 20%, 3.44%, and 8.19% better than C6T, Transpose 8T [15], and 10T [20] SRAM, respectively.
2) RDNM: The Read dynamic Noise Margin (RDNM) is defined by the voltage difference of storage node Q and QB when read bit line reach to sensing voltage during read operation [26].The RDNM of the proposed SRAM is 29.23% and 13.91% better than C6T and Transpose 8T [15] SRAM, respectively, as observed from Table I

C. Leakage Power
The power consumes by the cell during standby conditions measure as leakage power.Sub-threshold leakage is a major component of leakage power in the hold condition.The VV SS offers a stacking effect that reducing sub-threshold leakage [22] in hold condition.Therefore, the leakage power of the proposed SRAM is improved by 46.24%,55.49%,54.38%, 47.95%, and 52.63% as compared to C6T, DP8T [16], 8+T [11], Transpose-8T [15], and 10T [20] SRAM, respectively as observed from Table I.At lower supply, the sub-threshold leakage power decrease due to reduction on drain-induce barrier lowering(DIBL), gate-induce drain leakage (GIDL), and gate tunneling current voltage [22].Therefore, the leakage power is reduced with VDD reduction, as shown in Figure 10(a).The leakage power is measured at different temperature values.Figure 10(b) shows that the proposed SRAM has the least possible change in leakage current w.r.t.temperature, which shows that the proposed SRAM has remarkable temperature tolerance compared to C6T, DP8T [16], Transpose-8T [15], and 10T [20] SRAM.

D. Cost Comparison
The cost of the SRAM is compared in terms of read/write energy and read /write delay of SRAM.
1) Read/Write Delay: The proposed SRAM uses a series of transistors for the write operation.Therefore, the write delay of the proposed SRAM is 21% larger compared to DP8T SRAM [16].The proposed SRAM uses a high V th transistor for the read operation.Therefore the read delay of the proposed SRAM is reduced by 1.37% compared to DP8T SRAM [16] as given in Table I.
2) Read/Write Energy: The proposed SRAM is designed by a high V th transistor for the read operation.Therefore the read energy of the proposed SRAM is improved by 47.02% compared to DP8T SRAM [16], as observed from Table I.The write energy in the proposed SRAM is relatively higher because of floated VV SS and two series transistors for the write operation.
3) Area: Figure 11 shows the layout of the proposed SDP8T SRAM and DP8T SRAM [16] using 65nm UMC technology.The layout of three extra transistors PGL, PGR and WA are laid at left, right and bottom, respectively as shown in Figure 11(a).The extra transitor lead to area overhead of 20% in the proposed SDP8T SRAM( 80.44µm 2 (6.41µm × 12.55µm) when compared to DP8T SRAM [16] (66.91µm 2 ).

E. Figure of Merit
There are different parameters(SNM, delay, power, and area) to characterize the performance of the SRAM.These parameters are application-dependent and have a trade-off between them.Thus, The Figure of Merit (FOM) is given in equation 1 [27].I, the FOM of proposed SRAM is 2.005×, 1.11×, 2.20×, and 1.56× better as compared to C6T, DP8T [16], Transpose 8T [15], and 10T [20] SRAM, respectively.

IV. SDP8T BASED IMC ARCHITECTURE
The overall IMC architecture using SDP8T SRAM is shown in Figure 1(a), which performs both data storage and computational inside memory array.The proposed IMC architecture works in two modes IMBC and In-memory CAM mode.The control signal CAMEN is used to select the mode.In the SDP8T-IMC architecture, multiple rows are activated simultaneously to perform IMBC operations.The SA compute different Boolean logic operation in a single clock cycle, and obtained data is stored in the same column.The Asymmetric SA is used in [11] to sense bit-wise NAND and NOR operation, which reduces process variation and requires longer discharging time(Bit-line sensing delay) to develop the difference between bit line voltage which reduces the max operating frequency and degrades the performance of IMC architecture [28].The symmetric SA with low offset voltage [23] is used in the proposed SDP8T-IMC architecture to sense bit-wise NAND and NOR operations of A and B. Therefore, the proposed IMC architecture operates at a higher maximum operating frequency as compared to [11], [28].If both the operands A and B are at 0. The pre-charged RBL line retains its pre-charged state, and the pre-charged RBLB line starts discharging from VDD.The discharging of the RBLB is indicated through ↓, as shown in Figure 12(b).When RBLB discharge below the reference voltage, then, SEN signal is activated to SA, and RWL signals are deactivated.After that, the SA1 senses the difference between RBL voltage and V-REF and correspondingly achieves 1 and 0 at the positive terminal and negative terminal of SA1, respectively.The timing waveform for the same is shown in Figure 13(a).

A. In Memory Boolean Computation
If one of the operand A or B is at 1, both the RBL or RBLB discharges.After that, the SA1 senses the difference between RBL voltage and V-REF and correspondingly achieves 0 and 1 at the positive terminal and negative terminal of SA1, respectively.The timing waveform for the same is shown in Figure 13(b-c).
Suppose both the operands A and B are at 1.The precharged RBLB line retains its pre-charged state, and the precharged RBL line starts discharging.After that, the SA1 senses the difference between RBL voltage and V-REF and correspondingly achieves 0 and 1 at the positive terminal and negative terminal of SA1, respectively.The timing waveform for the same is shown in Figure 13(d).
From Figure 12(a) and Figure 13(a-d), It is found that the positive and negative terminal of SA1 represents NOR and OR operations, respectively.RBLB is connected to the complementary nodes.Therefore, RBLB represents the AND/NAND operation of A and B through sense amplifier SA2.The LWBL and LWBLB perform the basic logic operation on operands' C' and 'D' through SA3 and SA4 in a similar way as described for operands' A' and 'B'.The additional NOR gate is used between two sense amplifiers (SA1 and SA2) to perform the XOR operation of A and B, as shown in Figure 12(a).The additional NOR gate is used between two sense amplifiers (SA3 and SA4) to perform the XOR operation of C and D, as shown in Figure 12(a).To perform a boolean logic operation between C and D, WWL [2] and WWL [3] are activates simultaneously, which seems scenario similar to C6T, and there is the chance for compute-disturbance. Figure 14[a-b] shows a 1K Monte-Carlo simulation result of compute-disturbance versus varied write word line(WWL) activation time.It is observed from Figure 14(a), there is no compute disturbance when WWL activation time is 200ps under process variation.When WWL is turned on for 300ps, compute-disturbance occurs, and 16.6% of data is flipped, as shown in Figure 14(b).In proposed IMC architecture, WWL [2] and WWL [3] are activated for the duration of 200ps for reliable sensing(no-compute-disturbance) of boolean logics [28].

B. In Memory CAM operation
The CAM can be divided into BCAM and Ternary CAM(TCAM) according to search accuracy.The crosscoupled inverter structure is used to store the data, and two decoupled read paths corresponding to two match lines are used for comparison.
1) BCAM Search: The proposed SDP8T-IMC architecture is also configured as binary contains address memory(BCAM).The CAMEN control signal is used to configure the proposed architecture as BCAM mode. Figure 15 shows the BCAM configuration for a 2-bit search operation using SDP8T SRAM.In BCAM configuration, the RBL, RBLB, and RWL are configured as select line (SL), select line bar(SLB), and Match Line (ML), respectively, when CAMEN signal at 1.The one terminal of the sense amplifier is connected to ML, and another terminal is connected to V ref , similar to conventional BCAM [29].In BCAM mode, the search data 1 and 0 are applied on SL[0] and SL [1], and ML is precharged to VDD.The ML remains at initial value VDD if store data matches with the select line otherwise, ML discharges.The row-wise sense amplifier has the same circuit structure as the column-wise small differential sense amplifier used for logic operations.It is observed from Figure 15, the row[0] data match with the search data string; as a result, ML[0] will remain at precharged value VDD, and the sense amplifier will be at the value logic 1 at the output for row [0].The search data do not match from the store value of row [1], row [2], and row [3], however in the case of row [1], row [2], and row [3], the search data string values do not match with the store value; as a result, the ML [1], ML [2]and ML [3] discharge from initial precharged value to zero value and sense amplifier will be at the value of logic 0.

V. EVALUATION OF IMC ARCHITECTURE
Reliability remains the primary concern in existing SRAMbased IMC.The reliability of IMC architecture is assured by avoiding compute-disturb, compute-failure, and half-select issue during the IMBC operation.To verify the reliability of SDP8T-IMC, 2000-point Monte-Carlo simulation across process/mismatch variation and 3-sigma variations in nominal VDD is performed.Reliability remains the primary concern in existing SRAM-based IMC. and row [1] of SDP8T SRAM during IMC.

A. Mitigating Disturbance and Failure
The proposed SDP8T contains a read decouple path, which removes sneak path current [28] and eliminates computedisturbance.It is found from Figure 16 that there is no compute-disturbance(data flip) and no compute-failure in SDP8T-IMC architecture during IMBC operation.Further, the half-select stability of SDP8T-IMC architecture is tested under process variations.The MC simulation is performed during write operations on row-half selected cells on different supply voltage, VDD to analyze the half select issues.The proposed SDP8T SRAM uses the pass gate transistors, PGL, and PGR to resolve the half select issues.Therefore, the proposed SRAM has zero failure probability for all supply voltages (0.3 V to 0.7 V), as shown in Figure 17.However, the failure probability in DP8T SRAM [16] increases with a decrease in supply voltage.Hence, the SDP8T SRAM-based IMC architecture is able to mitigate compute-disturb, compute-failure, and halfselect issue, as observed in the previous IMC proposal.

B. Energy Benefits
The proposed SDP8T SRAM based IMC architecture has low bit line swing and decouple read path which eliminate sneak path current and virtual ground condition lead to significantly reduction in energy consumption.When the work in C6T [6], 6TCSRAM [28] 8+T [11], 8T [14] and 10T [20], 12T [12], and 4+2T [9] are imported to 65nm, they will consume 16.29fJ/bit, 15.13fJ/bit, 27.67fJ/bit, 22.5fJ/bit, 27.94fJ/bit, 17fJ/bit and 31.8 fJ/bit (Energy ∝ T echnology 2 [20]) energy, respectively, which is 32.22%, 27.03%, 60.10%, 50.93%, 60.48%, 35.05% and 65.28% higher as compared to proposed work as observed from Table II.The proposed SDP8T-IMC architecture performs reliable IMBC operation without any disturbance and half select issue from 1 V to below 0.4 V. Therefore, the energy-efficiency of proposed IMC architecture can be futhure improved by scaling the supply voltage to subthershold regions.The energy consumption of proposed IMC architecture is improved by 76% when supply voltage scale down from 1V to 0.4 V, as shown in Figure 18.

C. Figure of Merits
The performance of C6T [6], 6TCSRAM [28] 8+T [11], 8T [14] and 10T [20], 12T [12], and 4+2T [9] SRAM based IMC architectures are used to evaluated the IMC architecture in term of compute error rate (CER), energy, throughput, latency, and area in same 65nm CMOS technology.CER is defined by the percentage of MC simulation failed during IMBC operations as described in [12].Energy is defined by the average energy consumption per bit IMBC operation.Throughput is measured in terms of the number of operands computed in a single clock cycle.The number of cycles requiredrequired to perform read, compute(IMBC), and store operation is defined as latency.The above performance parameters have a tradeoff between them.Therefore, Figure of Merit (FOM) [12] is described in equation 2 to compare IMC architecture.

D. Array Efficiency
The bit-interleaving (or termed as column selection) architecture that is universally employed in SRAM to achieve array efficiency [18], is found disabled in the DP8T SRAM cell due to the half-select issue.A non-bit-interleaving architecture required a sense amplifier for each column [17].Further, the array efficiency of the SRAM array is estimated by using the [17] equation 3 given below, where TTA means the total number of transistor counts in cell array and TTP mean total transistor count in peripheral circuits (Decoder, drivers, precharge ckt, and sense amplifiers).

AE = T T A T T A+T T P (3)
It is found that the non-bit-interleaving architecture has lower array efficiency due to the increased peripheral circuits (require a sense amplifier for each column).However, the proposed SDP8T SRAM support bit interleave structure.Therefore, the proposed SDP8T SRAM has higher array efficiency when compared to the DP8T SRAM [16] as shown in Figure 19.The array efficiency of SDP8T SRAM can be further improved by increasing the size of column interleaving(Col.int).

E. Performance of IMC architecture
Table III shows the comparison of the proposed scheme with other available methods for logic/BCAM operations.Figure 18 shows the simulated max frequency and average energy consumption of IMC operation between two operands in two rows and read operation at 27 • C. The difference between read and IMC maximum operating frequency is only 12% over a wide supply voltage range, as shown in Figure 18.The proposed SDP8T-IMC architecture can be operated on a maximum operating frequency of 1050MHz in the TT process corner at VDD = 1 V, as shown in Figure 18.The proposed architecture is 2.23× faster than the work reported in [14] using 65nm technology and 3.15× faster than the work reported in [11] using 45nm technology as observed in Table III.The IMBC operation consumes approximately twice the energy consumed by the read, as shown in Figure 18.This is because the operands of the IMBC operations are two words.In terms of BCAM, the proposed architecture consumes 0.60fJ/search/bit energy at 1 V.

VI. CONCLUSION
A local bit-line sharing Dual-Port 8T (SDP8T) SRAM with Virtual V SS for energy-efficient IMC architecture is proposed with 26.49% and 52.63% improvement in write margin and leakage power, respectively, and 62.97% reduction in read energy, at the cost of 31.19% reduction in HSNM when compared to 10T SRAM.The FOM of the Proposed SRAM is 2.005×, 1.11×, 2.20×, and 1.56× higher when compared to recently reported C6T, DP8T, Tr.8T, and 10T SRAM, respectively.The half-select issue is resolved in the proposed SDP8T based IMC architecture by using pass-gate transistors.Therefore, the proposed SDP8T SRAM can be operated on a wide dynamic supply voltage from 0.4 V to 1 V and obtain higher array efficiency as compared to DP8T SRAM.A reliable IMC operation is performed through SDP8T-IMC architecture without any compute disturbance.The Proposed SDP8T-IMC architecture has 2× improvement in throughput and 60.48% reduction in average energy consumption per bit IMC operation with similar latency when compared to 10T SRAM-based IMC architecture.The FOM of the proposed SDP8T-IMC architecture is improved by 4.88×, 1.84×, 3.90×, 1.96×, 5.70×, 2.06× and 4.45×,, when compared to recently reported C6T, 6TCSRAM, 8+T, 10T, 12T and 4+2T SRAM based IMC architecture, respectively.The proposed SDP8T-IMC architecture shows excellent potential for developing a low-power and reliable computing system.

Figure 1 :
Figure 1: Scheme of Circuits (a) The Proposed SDP8T SRAM based IMC architecture.(b) The schematic of local bit-line sharing Dual-Port 8T (SDP8T) SRAM with Virtual VSS.

Figure 3 :
Figure 3: SDP8T SRAM array and the storage node of row and column half select cell during write '1' operation.

Figure 4 :
Figure 4: Waveform of the selected cell and half-selected cell during the write operation.

Figure 5 :
Figure 5: Variation of RSNM with a) VDD.(b) Temperature.A. Static Noise Margin 1) HSNM: HSNM is the noise margin of SRAM during a standby condition.The HSNM of different SRAM cells is

Figure 6 :
Figure 6: Comparison of RSNM under different process corners.

Figure 7 :
Figure 7: WM of proposed SDP8T SRAM a) without VVSS write assist (b)with VVSS write assist.

Figure 9 :
Figure 9: Comparison of WM under different process corners.

F
OM = RSN M N * W SN M N * RDN M N * HSN M N Area N * Energy N * W DN M N * Delay N * Lekage N (1) Where RSN M N , W SN M N , RDN M N , HSN M N , Area N , Energy N , W DN M N , Delay N , Lekage N are normalized value w.r.t 6T SRAM.It is observed from Table

Figure 12 (Figure 12 :
Figure 12(a) shows the simulation setup used to realize basic logic operations like NAND/AND, OR/NOR, and XOR/XNOR through proposed IMC architecture with BL and BLB capacitance of 100fF for all simulations.Four sense amplifiers, SA1, SA2, and SA3, SA4, are used to realize the basic logic operations on two pairs of different operands A, B, and C, D, respectively.The RBL and RBLB are connected to the positive and negative terminals of SA1 and SA2, respectively.The LWBL and LWBLB are connected to the negative and positive terminals of SA3 and SA4, respectively.The optimal V-REF value of 0.9VDD is used as one terminal of sense amplifiers in SA1, SA2, SA3, and SA4.For performing the basic logic operations on A and B, both RBL and RBLB are pre-charged to VDD.After that, two RWLs, RWL0 and RWL1, are connected to the ground.The voltage level for RBL, RBLB, and LWBL and LWBLB for IMBC operations

Figure 13 :
Figure 13: Timing waveform of IMBC operation on different operands (a) A = 0, B = 0 (b) A = 0, B = 1 (c) A = 1, B = 0 (d) A = 1, B = 1. at different operands (A and B, C and D) are shown in Figure 12(b).If both the operands A and B are at 0. The pre-charged RBL line retains its pre-charged state, and the pre-charged RBLB line starts discharging from VDD.The discharging of the RBLB is indicated through ↓, as shown in Figure12(b).When RBLB discharge below the reference voltage, then, SEN signal is activated to SA, and RWL signals are deactivated.After that, the SA1 senses the difference between RBL voltage and V-REF and correspondingly achieves 1 and 0 at the positive terminal and negative terminal of SA1, respectively.The timing waveform for the same is shown in Figure13(a).If one of the operand A or B is at 1, both the RBL or RBLB discharges.After that, the SA1 senses the difference between RBL voltage and V-REF and correspondingly achieves 0 and 1 at the positive terminal and negative terminal of SA1, respectively.The timing waveform for the same is shown in

Figure 14 :
Figure 14: Monte-Carlo Simulation at different write word line access time (TW W L) (a)The 16.6% data flip when TW W L = 300ps,(b) No data flip when TW W L = 200ps.

Figure 17 :
Figure 17: The Failure probability due to half-select issue.

Figure 18 :
Figure 18: Simulation Frequency and energy consumption of IMC and READ operation.

Figure 19 :
Figure 19: Array Efficiency with no column interleave and different column interleave structure.

1
Divided by array size.

Table I :
Performance comparison of various SRAM at 1 V supply voltage.During the hold state, the VV SS is connected to an off-state LVT WA transistor.Therefore, the HSNM of the proposed SRAM is 31.19%lower with respect to C6T SRAM.
*Area of various cell normalized with respect to 6T SRAM1shown in TableI.

Table II :
Performance comparison of IMC architecture based on various SRAM Area of various cell normalized with respect to 6T SRAM.
2Number of computation in single cycle.3Number of cycle require to perform all computation.

Table III :
Comparison with the reported IMC architecture for logic/BCAM works