Advancement of Memory-based Hardware for Efficient Resource-Constraint Digital Signal Processing Systems

Rapid advancement in very large scale integration (VLSI) technology and hardware performance of digital devices have paved way to efficient memory-based computing systems as alternative to the conventional logic-only computing in order to meet the stringent constraint and growing requirements of the digital signal processing (DSP) systems in different application environments. Several algorithms and architectures have been suggested in the past to reduce the area and time complexities of commonly encountered computationintensive cores of DSP functions by memory-based computing. Different scientific programs with high efficiency, faster operation and better performance gain have been developed. However most of the scientific programs and applications remain compute-bound in today’s scenario and there is an urge to develop many more algorithms and architecture for flexible design of area-delay-power-efficient systems for various DSP applications.


INTRODUCTION
Digital signal processing (DSP) is considered as the major component of the digital revolution that is currently taking place around the world. The increasing popularity of digital technology, in the recent years, have not only made the DSP applications more prevalent in daily use, but also the algorithms are subjected to more stringent specifications to meet the basic constraints of the application environments. As a natural follow up of the situation, significant research interest has been observed, in the recent years, for developing improved algorithms and architectures to design the DSP systems with less power dissipation, higher speed performance and less area complexity. But due to mutually conflicting behavior of these constraints, it has been noticed that one has to trade one or more aspects to meet a more important requirement [1]. Architectural solutions can be obtained to trade area for time and power or to trade time for area and cost, but it is difficult to minimize the cost, area, delay and power all together in a given architecture. Several efforts have been made to minimize the arithmetic complexities of the algorithms in order to reduce the overall area-delay-power complexities [2].
Algorithms pertaining to the DSP operations are basically computation-intensive, and most of their applications are of hard-real-time by nature [3]. Apart from that, the DSP systems are very often used in small portable devices which depend mostly on limited battery power [4]. The rigid constraints on size and cost do not usually leave scope for a cooling arrangement in these systems, while the system reliability falls to half for every 10 to 20 degree Celsius rise in temperature [5]. The general-purpose machines, however, very often do not meet the speed-requirement of real-time applications and size-constraints of many portable systems. It is, therefore, important to design dedicated very large-scale integration (VLSI) chips for fast and efficient computation of the DSP applications. Efforts have been made to derive modular VLSI for the fast DSP algorithms, based on recursive decomposition [6], [21]. Although, these algorithms require less number of arithmetic operations, they involve complicated routing and large design-time due to their irregular signal-flow graphs. Moreover, the accuracy of fixed-point implementation of these algorithms degrades as a result of successive truncation during the recursive decomposition process. Similarly, the VLSI realization of time-recursive algorithms also suffers from numerical problems and involves difficulty of pipelineability and increased hardware-complexity [22], [23]. It is also observed further that the algorithms optimized for software-implementation, in general, are not well-suited for dedicated hardware-implementation. Parallel algorithms and architectures are, therefore, imperative for efficient realization of DSP functions in VLSI structures. Appropriate algorithm design has a major role on developing a hardware entity that can meet the system requirements and specification. Not only it should necessarily lead to reduction of computationalcomplexity, but also should facilitate maximization of concurrency by exploiting the possible parallelism to achieve high-throughput performance. Moreover, the architecture should be developed synergetic with the underlying algorithms to derive a cost effective and area-time-power efficient optimal VLSI.
As the scaling in silicon technology has progressed over the last four decades, the semiconductor memory has become cheaper, faster and more efficient in terms power dissipation. Memory-based designs consequently are gaining substantial popularity in the DSP application space. Most of the DSP algorithms involve repetitive multiply accumulate operations, and the multipliers, the not only consume most of the resources of the system but also involve most of the computation-time. Significant research have, therefore, been made in the past two decades for efficient multiplier less implementation of DSP systems, which can be classified into three-basic categories, e.g., adder-based implementation, CORDIC implementation and memory-based implementation [7]- [12]. The objective of this paper is to highlight some of the current trends in memory technology along with some development of algorithms and architectures for memory- based hardware design to handle the multiple conflicting constraints of DSP applications.
The rest of the paper is organized as follows: A brief overview of VLSI implementation of DSP algorithms is presented in Section-II. The two important developments on algorithmic aspects and architectural approach of memorybased computing systems are described in Section-III. Some of the current trends in growth of memory technology resulting from different application environments are briefly discussed in the Section-IV and conclusion of the paper are presented finally in Section-V.

II. COMPARISON STUDY OF VLSI IMPLEMENTATION OF DSP ALGORITHMS
In general, we can obtain a significant improvement in any computational structure of a VLSI implementation by appropriately restructuring the algorithmic computational structure of a DSP algorithm. This is sometimes called algorithmic engineering [13]. In order to efficiently make use of this restructuring approach, it is necessary to have a clear architectural target. However, restructuring the forward and inverse discrete cosine transform (DCT)/discrete sine transform (DST) algorithms in such a way that we can obtain an efficient structure that allows the use of the memory-based implementation techniques is a challenging design problem. The DCT and DST [14]- [16] are orthogonal transforms, which are represented by basic functions used in many signal processing applications, especially in speech and image transform coding [17]. These transforms are good approximations to the statistically-optimal Karhunen-Loeve transform (KLT) [16], [17]. The choice of the DCT or DST depends on the statistical properties of the input signal, which in the case of image processing is subject to relatively fast changes. The DCT provides better results for a wide class of signals. However, there are other statistical processes, such as the first-order Markov sequences with specific boundary conditions, for which the DST is a better solution. Also, for low correlated input signals, DST provides a lower bit rate [17]. There are some applications as in [18] and [19], where both the DCT and DST are involved. Thus, a very large-scale integration (VLSI) structure that allows the use of both the DCT and DST is desired. Since both the DCT and DST are computationally intensive, many efficient algorithms have been proposed to improve the performance of their implementation, but most of these are only good software solutions. For hardware implementation, appropriate restructuring of the classical algorithms or the derivation of new ones that can efficiently exploit the embedded parallelism is highly desirable. In order to obtain an optimal hardware implementation, it is necessary to treat the development of the algorithm, its architecture, and implementation in a synergetic manner.
Fast DCT and DST algorithms, based on a recursive decomposition, result in butterfly structures with a reduced number of multiplications, but lead to irregular architectures with complicated data routing and large design time, due to the structure of their signal flow graphs, even though efforts have been made to improve their regularity and modularity as in [20] and [21]. Also, the successive truncations involved in a recursive decomposition structure lead to a degradation in accuracy for a fixed point implementation. The VLSI structures based on time-recursive algorithms [21]- [24] are not suitable for pipelining due to their recursive nature, and suffer from numerical problems, which can severely compromise their low hardware complexity.
The data movement and transfer play a key role in determining the efficiency of a VLSI implementation of the hardware algorithms [26]- [29]. This is the reason why regular computational structures such as the circular correlation and cyclic convolution lead to efficient VLSI implementations [26]- [28] using modular and regular architectural paradigms such as the distributed arithmetic [11] and systolic arrays [30]. These structures also avoid complex data routing and management, thus leading to VLSI implementations with reduced complexity, especially when the transform length is sufficiently large.
Systolic arrays [30] represent an appropriate architectural paradigm that leads to an efficient VLSI implementation due to its regularity and modularity, with simple and local interconnections between the processing elements (PEs); at the same time, they yield a high-performance by exploiting concurrency through pipelining or parallel processing. However, a large portion of the chip is consumed by the multipliers, putting a severe limitation on the allowable number of PEs that could be included.
The memory-based techniques [27], [28], [11], [31] are known to provide improved efficiency in the VLSI implementation of DSP algorithms through increased regularity, low hardware complexity and higher processing speed by efficiently replacing multipliers with small ROMs as in the distributed arithmetic (DA) or in the look-up table approach. The DA is popular in various digital signal processing (DSP) applications dominated by inner-product computations, where one of the operands can be fixed. It uses ROM tables to store the pre-computed partial sums of the inner product. Such a scheme has been adopted to implement several commercial products due to its efficiency in VLSI implementation [32], [33]. However, the main problem is that the ROM size increases exponentially with the transform size, thus rendering the technique impractical for large transform sizes. Moreover, due to the feedback connection in the accumulator stage, the structure obtained is difficult to pipeline.
In [27], a new memory-based implementation technique that combines some of the characteristics of the DA and systolic array approaches has been proposed. When one of the operands is fixed, one can efficiently replace the multipliers by small ROMs that contain the pre-computed results of the multiplication operations. If the size of the ROM is small, a significant increase in the processing speed can be obtained since the ROM access time is considerably smaller than the time required for a multiplication. The resulting VLSI structures are easy to pipeline allowing an efficient combination of the memory-based implementation techniques with the systolic array concept. Using the partial sums technique [32], it has been shown that the size of the ROM necessary to replace a multiplier can be further reduced to half at the cost of an extra adder [27].
Most of the reported unified systolic array-based VLSI designs [24], [25], [35], [36] obtain the flexibility of computing DCT/DST and/or inverse DCT/inverse DST (IDCT/IDST) by feeding the different transform coefficients into the hardware structure. They cannot use efficiently the memory-based implementation techniques since they are not able to use the constant property of the coefficients, namely that for both the DCT and DST, the coefficients are the same and are fixed for each processor. Moreover, they use an additional control module to manage the feeding of the transform coefficients into the VLSI structure, and have a high I/O cost.
The unified DA-based implementations of the DCT/DST and IDCT/IDST algorithms based on a general formulation, presented in [37] and [38], also do not exploit the constant property of the transform coefficients in each processor, nor do they benefit from the advantages of the cyclic convolution structures. Thus, the unification is achieved with a lower computational throughput and a higher hardware complexity. In addition, they have the overheads of the bit-serial implementations with parallel-to-serial and serial-to-parallel conversions, and lower processing speeds compared to the bitparallel ones as they need more than one clock cycle per operation. Moreover, they are difficult to pipeline and are appropriate only for small values of transform length. Using a dual-port ROM-based DA-like realization technique [39] an efficient design strategy to obtain a unified VLSI implementation of DCT/ DST/IDCT/IDST is achieved by an appropriate reformulation of the DCT, DST, IDCT and IDST algorithms, whose transform length is a prime number, so that they retain all the advantages of the cyclic convolution-based implementations. Thus, an efficient unified VLSI structure, wherein a large percentage of the chip area is shared by all the transforms, and which results in a high computing speed with a low hardware complexity, low I/O cost, and a high degree of regularity, modularity and local connectivity, is presented. Many more pioneer work are yet to come in near future to further enhance are resource-constraint requirements for memory-based implementation of DSP systems.

III. ALGORITHMS AND ARCHITECTURES FOR MEMORY BASED COMPUTING SYSTEMS
Most of the DSP algorithms involve repetitive multiply accumulate operations and inner-product computation. Besides, very often the multiplying coefficients (e.g., filter coefficients or transform kernel coefficients) remain constant during the DSP operations. This behavior of DSP algorithms is utilized to realize the memory-based computing systems. There are two basic variants of memory-based computing techniques found to be popularly used. One of the techniques is the direct memory-based implementation of multiplications [12], while the second is based on distributed arithmetic (DA) [11]. 2) Distributed Arithmic: The DA principle is used primarily to compute the inner-products by repeated shift-add operations of partial products corresponding to the successive bit-vectors of one of the input vectors. When one of the vectors is invariant, it is possible to store all the partial product values in a memory. In filtering operations and discrete transform evaluation, one of the vectors is derived from the input samples while the other vector is usually fixed (e.g., impulse response of a filter or coefficients of the transform kernel etc). The inner-products, thus, can be computed by using a LUT and a shift-accumulator, by a straight-forward implementation of the DA-principle.

B. Comparison
Each of these memory-based techniques, however, had some advantages and disadvantages over one another.
 Direct-memory-based implementation involves less hardware complexity compared with the DA-based method, when the word-length is less than the transformlength, while the latter involves less hardware-complexity otherwise.
 In case of DA based method, the time-complexity is independent of the transform-size or the number of filtertaps; and depends only on the word-length. In the directmemory-based implementations, time-complexity is independent of word-length but increases linearly with the transform size.
 To minimize the I/O bandwidth and to have hardwareefficiency of direct memory-based implementation, sinusoidal transforms are usually converted into cyclic convolution form [27]. Since the average computation time and the latency of direct-memory based implementation is high for large transform-lengths, novel algorithms have been proposed in the last a few years to decompose the sinusoidal transforms into multiple number of circular convolution or convolution-like structures of smaller convolution-lengths [29], [34]and [39]- [43]. Such decompositions have resulted in improvement of throughput performance with substantial reduction of hardware and computational latency. New decomposition schemes have also been suggested, similarly, to reduce the computational latency and overall area-delay complexity of direct-memorybased implementation of large order finite impulse response (FIR) filters [44].
 The major disadvantage of the DA-approach is that its memory size increases exponentially with the transformlength or the filter order. Memory-partitioning and multiple memory bank approach along with flexible multi-bit data-access mechanisms are suggested for FIR filtering and inner-product computation in order to reduce the memory-size of DA-based implementation [11], [12]and [45]- [48]. Attempts have also been made to reduce the memory space in DA-based architectures using offset binary coding [49] and group distributed technique [50]. A systolic realization of linear and circular convolution based on coefficient partitioning is suggested in a recent paper for area-delay efficient DA-based systolic architectures [51]. An LUT-less adder-based DA approach has been suggested where memory-space is reduced at the cost of additional adders [52]. Some efforts have also been made for DA-based implementation of recursive filters and reduction of dynamic power by minimization of bit-transitions during the addition of partial results [53], [54]. Based on the DA decomposition technique, several systolic and systolic-like architectures of discrete sinusoidal transforms are also suggested in the last ten years to have improved area-delay performance over the existing structures [28], [37], [38], [55]and [56]. A few DA-based architectures are suggested for video and multimedia applications and adaptive FIR filtering [57]- [59], while many more DA-based accelerators are expected to come up in the future years.

IV. DIFFERENT APPLICATION ENVIRONMENT LEADING TO CURRENT TRENDS IN MEMORY TECHNOLOGY
DSP plays a vital role in digital modulation and demodulation, speech and image data compression, speech recognition, synthesis and equalization, spectral estimation and analysis, along with a wide range of adaptive filtering applications [60], [61]. The DSP functionalities are, therefore, appearing increasingly in electronic systems for wired-and wireless communication, interactive multimedia systems, biomedical instrumentation, military surveillance and target tracking operations, satellite and aerospace control, remote sensing, and in a host of digital consumer products.
According to the requirement of different application environments, memory technology has advanced in a wide and diverse manner. Radiation hardened memories for space applications, wide temperature memories for automotive, high reliability memories for biomedical instrumentation, low power memories for consumer products, and high-speed memories for multimedia applications are under continued development process to take care of the special needs [62], [63]. Although traditionally memory has remained as an integral part of general purpose computers as a subsystem to store programs and data, it has undergone a lot of transformation in terms of its hierarchical organization and access mechanism. Interestingly also the concept of memory as a standalone subsystem is being replaced by embedded memories those are integrated as part within the processor chip to derive much higher bandwidth between a processing unit and a memory macro with much lower power consumption [64]. To achieve overall enhancement in performance of computing systems and to minimize the bandwidth requirement, access delay and power dissipation, either the processor has been moved to memory or the memory has been moved to processor in order to place the computing-logic and memory elements at closest proximity to each other [65].
According to International Technology Roadmap for Semiconductors (ITRS) [5] system complexities dramatically increase with the amount of software in embedded systems and the rapid adoption of multi-core SOC architectures. Not only is software dominating overall design effort as shown in Fig.1, but hardware dependent software that is tightly coupled to hardware and required functionality, must be eventually handled by an SOC integration and verification process that is still hardware-centric today. Methodological aspects are rapidly becoming much harder than tools aspects as enormous system complexity can be realized on a single die, but exploiting this potential reliably and cost-effectively will require a roughly 50 times increase in design productivity over what is possible today.
Briefly, we discuss here some of the current trends in memory technology which appear to be very much in favor of efficient realization of dedicated and reconfigurable memorybased computing systems for DSP applications. Some of the interesting projections of ITRS pertaining to the time during which research, development, qualification/pre-production and continuous improvement should be taking place for the potential solution, which involves significant technological innovation are shown in Fig. 2, 3 and 4 for logic, DRAM and Non-Volatile memory technology respectively. The industry faces a major overall challenge due to the sheer number of major logic technological innovations required over the next five years: enhanced mobility and high-field transport, highκ/metal gate stack (which are already implemented but requiring continuous improvement with scaling), ultra-thin body fully depleted SOI, and multi-gate MOSFETs, with quasi-ballistic transport. Future innovations in logic technology are: Enhanced transport with alternate channels: III-V or/and Germanium, Enhanced transport with alternate channels: CNT, Nanowire, grapheme and Non-CMOS Logic Devices and Circuits/Architectures as depicted in Fig. 2. As the DRAM storage capacitor gets physically smaller with scaling, dielectric materials having high relative dielectric constant (κ) will be needed. Therefore metalinsulator-metal (MIM) capacitors have been adopted using high-κ (ZrO2/Al2O/ZrO2) as the capacitor of 40-30's nm half-pitch DRAM and this material evolution will be continued and ultra high-κ (perovskite κ > 50 ~ 100) material will be released. Also, the physical thickness of the high-κ insulator should be scaled down to fit the minimum feature size. Due to that, capacitor 3-D structure will be changed from cylinder to pillar shape. On the other hand, with the scaling of peripheral CMOS devices, a low-temperature process flow is required for process steps after formation of these devices. This is a challenge for DRAM cell processes which are typically constructed after the CMOS devices are formed, and therefore are limited to low-temperature processing. The other big topic is 4F2 cell migration. As the half-pitch scaling become very difficult, it is impossible to sustain the cost trend. The most promising way to keep the cost trend and increasing the total bit output by generation is changing the cell size factor (a) scaling (where a = [DRAM cell size]/[DRAM half pitch]2). Currently 6F2 (a = 6) is the majority. To migrate 6F2 to 4F2 cell is very challenging. For example, vertical cell transistor must be needed but still a couple of challenges are remaining. All in all, maintaining sufficient storage capacitance and adequate cell transistor performance are required to keep the retention time characteristic in the future. And their difficult requirements are increasing to continue the scaling of DRAM devices and to obtain the bigger product size (i.e. > 16 GB). In Fig. 3, the DRAM potential solutions are listed, but many future technologies will be necessary for 30 nm half-pitch or less and these future technologies are still unknown [5].
Non-volatile memories are used in a wide range of applications, some standalone and some embedded, with varying requirements that depend on the application. The memory array architecture and signal sensing method also differ for different applications. The technical challenges are difficult, and in some cases fundamental physics limitations may be reached before the end of the current roadmap. For charge storage devices, the number of electrons in the storage node, whether for single level logic cells (SLC) or multi-level logic cells (MLC), needs to be sufficiently high to maintain stable threshold voltage against statistical fluctuation, and cross talk between neighboring bits must be reduced while the spacing between neighbors decreases. Meanwhile, data retention and cycling endurance requirements must be maintained, and in some cases even increased for new applications. Non-charge-storage devices also may face fundamental limitations when the storage volume becomes small such that random thermal noise starts to interfere with signal. A host of nonvolatile random access memory, such as NAND Flash, NOR Flash, Phase change memory (PCRAM), Ferroelectric RAM (FeRAM) and Magnetic RAM (MRAM) are emerging at present, which would possibly provide faster and easier access mechanism, would consume lesser power, and can be embedded directly into the structure of the microprocessor or can be integrated with the functional elements of a dedicated processor.
According to the ITRS projections, embedded memories will continue to have dominating presence in the system on chips (SoC) content, which may exceed 90% of total SoC content in the next few years. It has also been found that the transistor packing density of memory devices is not only high but also increasing much faster than the transistor density of logic devices. Apart from these, the memory-based implementations are more regular compared with the multiply-accumulate structures. Memory-based computing systems have many other advantageous features from their architectural point of view, as listed in the following:  Memory-based computing systems have potential for high-throughput and reduced-latency hardware implementation since the memory-access time is usually very much shorter compared with multiplication time.
 Memory-based designs are expected to involve much less dynamic power consumption due to minimal switching activities associated in obtaining the output product/inner product values by memory read operations.
 Apart from that, memory-based designs have a lot of scope to have flexible implementation to scale the throughput to match the temporal requirement of the applications.
From the above observations it is very much apparent that memory-based computations would have greater potential for resulting in compact and cheaper computing structures. In application specific SoCs, memory-based computing system would, therefore, be a promising alternative to the conventional logic-only implementation, where appropriate combination of logic-based arithmetic circuits and memorybased computing elements may be integrated together for dedicated implementation of DSP functionalities.

V. CONCLUSION
The current trends of advancement of VLSI technology indicate reasonable scope to have area-delay-power-efficient memory-based computing systems which may have potential to meet the growing requirements of the DSP systems in various application environments. Several algorithms and architectures have been suggested in the literature to reduce the area and time-complexities of commonly encountered computation intensive cores of DSP functions by memorybased computing, but many more novel algorithms and architectures need to be developed to design flexible areadelay-power-efficient systems for DSP applications of various domains. Memory elements and logic-elements can be integrated together to form more compact functional elements, and novel memory access schemes may also be explored to maximize the power efficiency and speed-performance as well.