Approximate Circuits in Low-Power Image and Video Processing: The Approximate Median Filter

,


Introduction
An efficient implementation of computer vision algorithms is crucial for many smart embedded systems such as traffic control systems, driver assistant systems, production line inspection systems, and robotics.However, providing high-quality outputs in these applications is usually associated with high computation cost and non-trivial requirements on energy.In order to meet real-time constraints and cope with limited power budget, image and video processing algorithms are often accelerated in application-specific integrated circuits (ASIC) or field programmable gate arrays (FPGAs).If additional energy consumption reduction is requested because of, for example, very limited energy available in remote sensors, mobile or wearable devices, the circuit approxima-tion is one of the most promising approaches to deliver a suitable solution.
Approximate computing [1] exploits the fact that many applications (image and video processing in particular) are highly error resilient.If occasional errors are acceptable by the users -which is possible because the users as consumers of the outputs of these applications are often unable to recognize small imperfections in images or video sequences -implementations of these applications can be simplified.The goal is to create such an implementation which shows the best trade-off between the error, performance and power consumption.Approximate computing has been progressively developed in recent 5 years and influenced the way how energy efficient computer systems (ranging from tiny battery powered devices via common desktop computers to supercomputers) are now constructed and operated.
In this paper, we focus on approximate circuits that are used in image and video processing applications.On the basis of a literature survey, we identified the components whose implementations are the most frequently approximated and the methods used for obtaining these approximations.One of the components is the median-outputting circuit (median for short) which is typically employed to filter out undesired artefacts (such as shots) in digital images.
As the circuit approximation problem can be formulated as a multi-objective optimization problem (with error, performance and power consumption as objectives), various ad hoc and heuristic methods have been introduced to solve it.In our previous work, we have developed circuit approximation methods [2], [3] based on Cartesian genetic programming (CGP) which is a branch of evolutionary algorithms capable of designing and optimizing digital circuits.
Unfortunately, the quality of approximation methods has been compared in the literature only rarely (see a survey of associated methodological problems in [2]); in most cases only parameters of approximate adders and multipliers were compared [4].In this paper, we compare two approximation strategies based on CGP applied to approximate various common implementations of the median filter.The first strategy starts with an exact median filter implementation and tries to remove some circuit components (comparators) and re-connect the remaining ones in such a way that the error of filtering is minimized.The second strategy employs CGP to evolve the image filter from scratch; only on the basis of the training data supplied during the evolution.The goal is to demonstrate how different approximation strategies can influence the trade-offs that are obtained between the quality of filtering and power consumption for target circuits.
The rest of the paper is organized as follows.The research area of approximate computing is introduced in Sec. 2. Section 3 deals with a survey of circuits that were approximated for purposes of power consumption reduction in image and video processing applications.Various aspects of the approximation strategies used to obtain desired approximations have been analyzed.Section 4 is devoted to our case studyapproximate circuits for image filtering.We present conventional implementations of image filters, CGP as the method used to perform desired approximations and two different approximation strategies based on CGP.Results are summarized in Sec. 5. Conclusions are given in Sec. 6.

Approximate Computing
The concept of approximation has been well established in computer science and engineering for decades.For example, a paper with the title "Approximate signal processing" was published in 1997 [5].However, new problems emerged in the last decade that stimulated new research in applying approximation techniques, but in a slightly different context than before.
In particular, problems with high power density of integrated circuits led to the end of Dennard scaling, i.e. simultaneous doubling the number of transistors on a chip, increasing operation frequency and reducing Vdd have no longer worked together.The coming era of "dark silicon", when many transistors are available on a chip, but cannot be used at the same time on high operating frequency because of thermal issues, has forced us to rethink the basic design principles of computer-based systems [6].As conventional power reduction techniques such as dynamic voltage-frequency scaling and power gating do not scale sufficiently and alternative post CMOS technologies are not widely adopted, the only solution seems to be to relax the requirement on precise computing across the computer stack.
In approximate computing, the requirement of exact equivalence between the specification and all implementations levels is relaxed in order to reduce power consumption or improve other system parameters such as performance [1], [7].
The approximation can be conducted at the level of software as well as hardware.Mittal [1] discusses a wide spectrum of approximation techniques which include precision scaling, loop perforation, load value approximation, memorization, task dropping/skipping, memory access skipping, using different SW/HW versions, refresh rate reducing in memory, inexact read/write and relaxed synchronization.
In the case of digital circuit approximation, voltage overscaling and functional approximation are the most popular techniques.In the case of voltage over-scaling, the circuit is supplied with lower Vdd than nominal, which reduces power consumption, but introduces errors for those inputs whose processing requires attending the critical path of the design.In the case of functional approximation, a slightly different function is implemented with respect to the original one, provided that the error is acceptable and key system parameters are improved.The errors induced by approximation are measured using various error metrics such as the average error, error probability, and worst case error.
The approximate solution is usually obtained by a heuristic procedure that modifies the original implementation.In the case of software approximation, programmers can typically declare which parts of the program can be computed approximately and specialized compiler and optimizer then preform requested approximations (see, e.g., EnerJ [8]).In the case of hardware approximation, either general-purpose or circuit-specific approximation methods have been applied.While the aim of general-purpose approximation methods (e.g.SALSA [9], AXILOG [10], ASLAN [11], ABA-CUS [12], CGP [2], [3]) is to automatically approximate any circuit regardless of its structure, the circuit-specific methods are focused on a rather specific class of circuits (such as adders or multipliers [4]).

Approximate Circuits for Image and Video Processing
Based on the analysis of 12 image and data processing applications, Chippa et al. showed that about 83% runtime is spent in error resilient computation kernels that are suitable for approximation [7].The most dominant kernels were the dot product computation and distance computation.The fact that image and video processing circuits are good targets for circuit approximation can be documented by dozens of papers dealing with this topic in the literature.
It has to be noticed that elementary arithmetic circuits (such and adders and multipliers) are often approximated independently of a potential application.The objective is to create a general-purpose library of approximate implementations showing different trade-offs between power consumption and error.Jiang et al. [4] provided a detailed survey of approaches developed in this direction.In this paper, we will deal with approximate adders or multipliers only if they have been applied in an approximate implementation of image or video processing system.
Approximate circuits are also crucial in energy efficient implementations of image and video processing systems (image classifiers, object detectors) based on (deep) neural networks (DNN).As this is rather a specific area [13], [14], we will not consider it in our survey table, but provide a brief introduction in this paragraph.In DNNs, approximations were introduced at levels of the data type quanti- zation, microarchitecture (e.g.neurons insignificantly contributing to the quality of outputs can be removed), training algorithm (an iterative process which can be stopped when good enough results are obtained), the multiply-accumulatetransform circuits (where the design of approximate multipliers and adders for DNN applications represents an independent topic [15], [16]), and memory cells and architecture (where, e.g., less significant bits can be stored in energy efficient, but less reliable memory cells [17]).An ultralow power deep learning ASIC for IoT was implemented on a single chip, capable of performing 374 GOPS/W and consuming less than 300 µW.However, performance of this solution is limited as it operates at 3.9 MHz only [18].While specific low-power electronic circuits can be developed in ASICs (see, e.g., specialized on-chip memory cells and architecture in [18]) to minimize power consumption of DNN, the optimization of an FPGA solution has to be focused on microarchitecture and memory subsystem organization that are composed of (fixed and pre-defined) FPGA primitives.
In our survey, we will primarily focus on functional approximation which is less technology dependent and provides more predictable errors than voltage over scaling.The survey is based on representative papers published in 2011 -2017 on key relevant conferences and in journals.
The result of our survey is presented in the form of table: Table 1 shows that the papers included into the survey are organized according to the Application Type, where the following major application types were identified: Filters, Metrics, Transforms, image compression (JPEG), and video (de)coders (according to MPEG and HEVC standards).In these Application Types, we investigated: • what is approximated, i.e. whether the approximation is performed at the level of components (such as adders, multipliers, comparators) or modules (such as filters, DCT and FFT created using these components), • how the approximation is conducted, i.e. whether an ad hoc or general purpose method is taken, • what is the level of abstraction, where an approximation is conducted, i.e. whether circuits are approximated at the transistor, gate, register-transfer (RT) or behavioral source code level, and • target platform, i.e. an ASIC or FPGA.
It can be seen that less complex applications such as image filters can be holistically approximated as a single system.In the case of more complex applications, the design is firstly decomposed and selected components then undergo the approximation process.Some of them can even be removed to further reduce power consumption.The approximation is predominately conducted at the gate level, but there are tools (such as AXILOG, ABACUS and GRATER) in which requirements on the approximation are specified directly at the source code (RT or behavioral) level.The actual approx-imations are then performed internally by the tool during the synthesis and netlist optimization.
It remains unclear what is the best performing approximation approach in the area of image and video processing.Unfortunately, approximate solutions have been only compared with exact solutions, but almost never with other competitive approximate solutions.

Case Study
The purpose of this case study is to compare the impact of two fundamentally different approximation strategies on the quality and power consumption of a selected module of an image processing system.We decided to approximate the circuits implementing the shot noise image filter.The approximations will be conducted by CGP which proved to be highly competitive with respect to other circuit approximation methods [3], [19].

Median Filters
Conventional implementations of shot noise elimination filters are usually based on calculating the median over the pixels belonging to the filtering window.
The median filter (MF) is a special case of order statistic filters which may be implemented in several different ways [20].In this paper, we will consider a pipelined implementation based on a median network which is suitable for high-performance applications.The median network consists of a sequence of compare-and-swap operations (Fig. 1).Each compare-and-swap (CS) operation acts as a small 2input sorting network which produces a sorted sequence by outputting the minimum and maximum of the input values.
The weighted median filter is an extension of the common median filter, which gives more weight to some values within the filtering window.The center weighted median filter (CWMF) represents a special case in which only the central value of the window is counted with additional weight [21].Compared to the median filter, this modification can preserve more details along the horizontal and vertical directions while suppressing additive white and/or impulsive-type noise.
The median filters uniformly replace the value of every pixel of the filtered image by the median of its neighbors.Consequently, in addition to the removal of noisy pixels, these filters also remove desirable details and thus smudge the resulting image.In order to address this problem, more advanced concepts were introduced.The adaptive median  filter (AMF) represents a multi-level approach which tries to detect and subsequently replace corrupted pixels only [22].At each level, filtering windows of different sizes are utilized.Usually, two levels working with the 3 × 3 and 5 × 5 filtering window, respectively, are sufficient to obtain a very good image quality (Fig. 2).Hardware implementation consists of two median filters, circuitry that determines minimal and maximal values for each filter window, delay buffers to compensate different latency of median filters and simple logic.

Evolutionary Approximation
CGP [23] is a form of genetic programming in which each candidate solution is modeled using a two-dimensional array of n c × n r programmable n a -input/n b -output nodes whose functions are taken from a set G. The circuit utilizes n i primary inputs and n o primary outputs.A unique address is assigned to all primary inputs and to the outputs of all nodes to define an addressing system enabling circuit topologies to be specified (Fig. 3).As no feedback connections are allowed in the basic version of CGP, only combinational circuits can be created.Each candidate circuit is represented using the so-called chromosome which contains n c ×n r ×(n a +n b )+n o integers.The (n a + n b ) integers specify one programmable node: n a integers specify destination addresses for its inputs and n b integers determine the function codes from G. All possible legal chromosomes constitute the search space.
The search is usually performed using a simple (1 + λ) evolutionary algorithm.In this algorithm, every new population consists of the best individual of the previous population and its λ offspring created using a mutation operator.This operator randomly modifies up to h randomly selected genes (integers) of the chromosome.The search is typically terminated after generating a given number of populations.In order to evaluate the population, each candidate solution is evaluated using the so-called fitness function.As the problem is in principle multi-objective (error versus power consumption or area), a suitable multi-objective optimization algorithm has to be taken [2], [3].While the circuit area on a chip can be easily estimated by summing the areas of components involved in the circuit, the error computation is more time demanding (see next sections).

Approximation Strategies based on CGP
Two strategies are compared in this case study: AS1: Since the median filter is implemented as a network of compare-and-swap operations, an obvious approximation strategy is to remove some of them and reconnect the remaining ones in such a way that the error of filtering is minimized.We propose to seed CGP with the best known implementations of median filters and evolve approximate median filters containing fewer comparators than needed in the fully functional implementation.The fitness is constructed according to [24].AS1, therefore, works at the level of comparators.

AS2:
The whole image filtering function is evolved with CGP from scratch.The function set G contains all suitable two-input components, not only the minimum and maximum functions.CGP thus holistically develops a new image filter with the aim to minimize the error of filtering on the training data.Following the approach developed for the evolutionary design of image filters [25], the error is measured by means of the mean absolute error (MAE) between the outputs O f produced by a candidate filter and reference (golden) outputs O g for a given training data set, formally: where K is the number of filtered pixels.
As this approach is not biased by a conventional solution (median filter), there is a chance to discover an implementation showing better filtering properties and lower power consumption.
It has to be noticed that these approximation strategies differ from the approximate median filters proposed in the literature because: paper [26] utilizes approximate transistorlevel circuits to implement the comparators (our comparators are always exact) and papers [3], [27] do not initialize CGP with existing median implementations, but rather evolve approximate circuits form scratch using insufficient resources.

Results
This section presents the setup used to perform desired approximations, parameters of evolved circuits and a comparison of approximate and original filters in terms of power consumption, area and filtering quality.In order to obtain parameters of evolved filters, we described the filters in VHDL and synthesized them using Synopsys Design compiler with 45 nm PDK.The filters were implemented as pipelined circuits with 8 bit operands.The goal of the synthesis was to produce implementations operating at least at 1 GHz.Section 5.1 deals with the implementation cost of conventional and approximate filters.The filtering quality is compared in Sec.5.2.

Implementation Cost
Conventional (Exact) Filters: Table 2 summarizes the synthesis results for various median filters discussed in Sec.4.1 -median filter operating on 3×3 (5×5) filter window denoted as MF9 (MF25), center weighted median filter operating on 3 × 3 pixels with the weight equal to 3 (CWMF9), and adaptive median filter (AMF25).While MF9 consists of 19 compare-and-swap operations (ops), AMF25 requires nearly ten times more operations.Each compare-and-swap operation is implemented using an 8-bit magnitude comparator and two 8-bit multiplexers.For each filter, the number of compare-and-swap operations, total power consumption and occupied area are presented.Contributions to power and area are given separately for registers and logic.The delay is intentionally omitted in all tables because timing constrains were met in all cases.
The key observation is that logic consumes less than 20% of the total power consumption.This is due to the pipeline nature of the circuits.The area on a chip increases with the increasing complexity (i.e. with the number of compare-and-swap operations) of the filters.As expected, the common median filter operating with 3 × 3 pixels is the cheapest solution.If we extend the filter window to 5 × 5, the power consumption increases more than 6 times and the area on a chip increases nearly 5 times.The adaptive median filter represents the most complex and power-demanding filter in our study.Its power consumption is more than 8 times higher compared to MF9.The implementation costs of CWMF is between MF9 and MF25 since CWMF9 is, in fact, a median network with 11 inputs whose three inputs are connected to the central pixel of the filter window.The power as well as the area on a chip are doubled compared to MF9.

Filters Approximated with AS1:
In order to obtain approximate median filters, CGP was seeded with the known optimal implementations of 9-input, 11-input and 25-input median networks exhibiting the minimal number of compareand-swap operations.CGP operated with n a = n b = 2, λ = 20, h = 5, and 10 7 (6.10 5 respectively) generations were produced for 9-input (25-input, respectively) circuits.The function set contained 8-bit compare-and-swap functions and identity function.The error was determined as the position distance with respect to the exact median according to [24].The goal of CGP was to minimize the position error under constrained area (experimented with max.20% -95% area of the exact implementation).As the statistical evaluation of this type of evolutionary design has been performed in the literature [24], we will report just the best evolved solutions.
Several hundreds of approximate implementations were produced by CGP in total.We identified ten Pareto-dominant solutions for each type of filter and synthesized them using Synopsys Design compiler to obtain their electrical parameters.Table 3 summarizes the total number of operations, power consumption and area for selected approximate filters.The obtained reduction with respect to the (exact) original solution is included in the 'red.' columns.
Table 3 shows that pruning of the number of compareand-swap operations and their rearranging enables to significantly reduce not only the area on a chip but also the power consumption.The filtering quality will be reported in Sec.5.2.For example, 9-input approximate median filter MF9 #9 exhibits a 75% reduction in power consumption and a 69% reduction in the area compared to the accurate optimal implementation.Overall, more than 75% of power budget is due to switching activity of registers.The majority of the area on a chip is utilized by registers.Table 3 also includes parameters of approximate adaptive median filters.These approximate filters were obtained by replacing the exact 9-input median and 25-input median with their selected approximate implementations.The rest of the circuitry remained unchanged.Three variants of approximate adaptive median filter are presented -AMF25 #19, AMF25 #79 and AMF25 #99.The first variant consists of the exact 9-input approximate median network MF9, the second of approximate MF9 #7 and third employs MF9 #9.In all cases, approximate MF25#9 is employed.The approximate AMFs occupy nearly half of the area and achieve up to 61% power saving with respect to AMF.
Parameters of the best performing filters evolved under AS2 are summarized in Tab. 4. Two noise-specific filters are included in our comparison -a salt-and-pepper noise filter (denoted EVO #1) and a random-valued impulse noise filter (denoted EVO #2).Please, refer to Sec. 5.2 for details dealing with noise description.Both filters operate on the filter window consisting of 5 × 5 pixels.EVO #1 consists of 27 8-bit components (including 17 min/max functions) and occupies approximately the same area as MF9 but consumes about 50% more power.This is an interesting result because it operates on nearly three times higher number of inputs.EVO #2 is a more complex circuit having 33 8-bit components (including 20 min/max functions).Considering the fact that both filters have 25 inputs, they exhibit significantly lower implementation cost and power compared to MF25.Their filtering quality will be discussed in Sec.5.2.
In order to improve the quality of output images, an ensemble of filters is often employed.In this evaluation, a bank of filters was constructed using 3 best filters evolved for each type of noise [25] (i.e.BNK #1 for salt-and-peper and BNK #2 for random shot noise).As Fig. 4 shows these filters    operate in parallel over the filter window.If the majority of the filters of the bank indicates that the processed pixel is a shot, then the median value is calculated from the outputs of these filters and sent to the primary output of the bank.Otherwise, the original value of the processed pixel is sent to the primary output of the bank.While BNK #1 occupies significantly lower area on a chip compared to MF25 or AMF25, BNK#2 is comparable to AMF25.

Filtering Quality
The quality of the proposed filters was evaluated using a set of 30 test images corrupted by means of two common types of noise -salt-and-pepper noise and random shot noise.
While the salt-and-pepper noise removal represents a typical benchmark problem which can be satisfactory addressed using adaptive median filter, the random shot noise removal is known to be a significantly harder problem.The reason is that the values of noisy pixels for salt-and-pepper noise are equal to either 0 or 255.In the case of the random shot noise, a noisy pixel can gain an arbitrary value from the whole range (i.e.0 -255).Therefore, it is more difficult to detect this noise because the deviation of a noise pixel can be very close to its original value.
Figure 5 shows the filtering quality of common and approximate filters in terms of the mean peak signal-to-noise ratio (PSNR).As 7 noise intensities (ranging from 1% to 30%) were considered, every filter was, in fact, applied to 2 (noise type) × 30 (images) × 7 (noise intensity) = 420 images.Resulting trade-offs between power consumption and filtering quality for noise intensity 1%, 15% and 30% are illustrated in Fig. 6.
The most interesting observations are as follows.Regarding the filtering quality, the mean PSNR indicates that filters evolved in AS2 significantly outperform other filters especially if the noise intensity is lower (15-20% depending on the noise type).For highly corrupted images, the bank of evolved filters can be employed to even improve the quality of filtering.
AMF performs well on salt-and-pepper noise, but it is rather poor for random shot noise; however, it is a very expensive solution.When approximate filters are introduced to AMF, the mean PSNR remains practically the same as for AMF25.The output quality depends mainly on the quality of MF9 (see the resulting PSNR for AMF25 #79 and AMF25 #99), but the difference is below 1 dB even when MF9 #9 consisting of eight compare-and-swap operations was employed.Anyway, approximate versions of AMF significantly reduced power consumption of the original AMF.It has to be emphasized that filters evolved in AS2 still consume only around 50% of the power budget of AMF25#19.
CWMF9 and its approximations provide very good results for random shot noise.Hence, CWMF9 (or CWMF#7 having a slightly worse PSNR) seems to be a solution of the first choice for energy constrained applications because it provides 25% benefit in power compared to EVO#2.
If low power consumption is the key design objective then approximate versions of MF9 show the best trade-off.

Conclusions
On the basis of the literature survey, we reported approximate circuits and approximation methods that have been applied in the area of image and video processing.We observed that the approximations are conducted at different levels of abstraction (from transistors via gates to RT) and focused either on the whole modules (such as filters or DCT) or elementary components (such as adders and multipliers) of these modules.In addition to ad hoc approximation methods, many general-purpose approximation methods have been used.Only in rare cases the approximation methods were mutually compared in terms of quality of produced approximate circuits.
In order to investigate the impact of approximation methods on the quality of resulting approximate circuits, the median circuit approximation problem was chosen.We compared two CGP-based approximation strategies based on removing and rearranging some components (AS1) and complete redesigning of the circuit (AS2).Three conventional median-based circuits (MF, CWMF, and AMF) were included to our study.The approximations were performed for two types of noise and evaluated for 7 noise intensities.As all circuits were implemented as pipelined structures operating at least at 1 GHz, the approximation and optimization was focused on obtaining the best trade-offs between power consumption and filtering error.In the case of AS1, approximate circuits consistently show slightly worse filtering quality, but significantly reduced power consumption with respect to their exact counterparts.The best trade-offs were obtained with AS2, i.e. when CGP was not biased by conventional designs and could deliver new well-optimized filtering structures.
We can conclude that complete resynthesizing of the circuit rather than approximating a conventional solution provides better trade-offs, especially if good filtering quality is desired.While this approach (SA2) was applicable to image filtering circuits, it is not currently applicable for complex circuits (such as the whole HEVC coder) because the design process based on CGP is not fully scalable.Improving its scalability will be one of our future research objectives.
Section 4.1 provides a brief overview of conventional implementations of median filters and their extensions.CGP is then introduced in Sec.4.2.Two approximation strategies (AS) are proposed in Sec.4.3: (AS1) CGP is employed to approximate circuit implementations of the considered filters.(AS2) CGP is used to holistically evolve desired image filters from scratch.

Fig. 1 .
Fig. 1.Pipelined implementation of 9-input median filter consisting of compare-and-swap (CS) blocks and registers (D).All CS blocks contain the output register.

Fig. 2 .
Fig. 2. Adaptive median filter internally computing minimum,maximum and median over kernels with 3×3 and 5×5 pixels and determining the output value using Selector.

Fig. 5 .
Fig. 5. Mean PSNR on 30 test images and different noise intensities obtained for conventional and approximate filters: salt-and-pepper noise (left) and impulse noise (right).

Fig. 6 .
Fig. 6.Mean PSNR and power consumption of selected image filters: salt-and-pepper noise (left) and impulse noise (right).