Towards the Optimal Hardware Architecture for Computer Vision

Computer Vision systems are experiencing a large increase in both range of applications and market sales (BCC Research, 2010). From industry to entertainment, Computer Vision systems are becoming more and more relevant. The research community is making a big effort to develop systems able to handle complex scenes focusing on the accuracy and the robustness of the results. New algorithms provide more advanced and comprehensive analysis of the images, expanding the set of tools to implement applications (Szeliski, 2010).


Introduction
Computer Vision systems are experiencing a large increase in both range of applications and market sales (BCC Research, 2010).From industry to entertainment, Computer Vision systems are becoming more and more relevant.The research community is making a big effort to develop systems able to handle complex scenes focusing on the accuracy and the robustness of the results.New algorithms provide more advanced and comprehensive analysis of the images, expanding the set of tools to implement applications (Szeliski, 2010).
Although new algorithms allow to solve new problems and to approach complex situations with a high degree of accuracy, not all the algorithms are adequate to be deployed in industrial systems.Parameters like power consumption, integration with other system modules, cost and performance limit the range of suitable platforms.In most cases, the algorithms must be adapted to achieve a trade-off solution and to take advantage of the target platform.
Conventional PC-based systems are constantly improving performance but their use is still limited to areas where portability, power consumption and integration are not critical.In case of highly complex algorithms, with an irregular execution flow, complex data representation and elaborated patterns to access to data, a significant gain is not achieved when moving to an ad hoc hardware design.In this case, a high-end CPU and a GPU with general-purpose capabilities (GPGPU) is a flexible and very powerful combination that will outperform other options (Castano-Díez et al., 2008).
However, when a conventional system does not meet the requirements of the application, a more aggressive planning is needed.For instance, migrating the application to a dedicated device such as a DSP, an FPGA or a custom chip (Shirvaikar & Bushnaq, 2009).At this point, the designers have to consider alternatives as to reduce operating range, accuracy and robustness of the results, or to remove expensive operations in order to simplify the hardware that will be implemented (Kolsch & Butner, 2009).PC-based systems enable a great flexibility at cost of performance so pure software-based algorithms hardly match pure hardware implementations.This is a serious limitation because it can compromise the efficiency of the application.This is the reason why the industry is making great efforts to develop novel architectures that enable greater flexibility to adapt the algorithms without compromising the quality of the results.noise reduction, color balancing, geometrical transformation, etc.Most of these operations are based on point or near-neighborhood operations.Point operations are performed at pixel-level in such a way that the output only depends on the value of any individual pixels from one or several input images.With this type of operation it is possible to modify the pixel intensity to enhance parts of the image, by increasing contrast or brightness.Equally, simple pixel-to-pixel arithmetic and Boolean operations also enable the construction of operators as alpha blending, for image combination or color space conversion.Neighborhood operations take also into account the value of adjacent pixels.This operation type is the basis of filtering, binary morphology or geometric transformation.They are characterized by simple operations, typically combining weighted sums, Boolean and thresholding processing steps.
After preprocessing stages, useful information has to be extracted from the resulting images.Common operations are edge detection, feature extraction or image segmentation.Edges are usually defined as step discontinuities in the image signal so finding local maxima in the derivative of the image or zero-crossings in the second derivative are suitable to detect boundaries.Both tasks are usually performed by the convolution of the input image with spatial filtering masks that approximate a first or second derivative operator.
Feature points are widely used for subsequent computing steps in multiple CV-applications.Basically, a feature represents a point in the image which differs from its neighborhood.One of the benefits of local features is the robustness against occlusion and the ability to manage geometric deformations between images when dealing with viewpoint changes.In addition, they improve accuracy when, in the same scene, objects are at different planes, (i.e. at different scales).One of the most popular techniques is that proposed by Harris (Harris & Stephens, 1988) to detect corners.It is widely used due to its strong invariance to rotation, image noise and no large illumination changes.It uses the local auto-correlation function, which describes the gradient distribution in a local neighborhood of each image point to detect the location of the corners.Using the locally averaged moment matrix from the image gradients, corners will be located at the maximum values.Another frequently used technique is the Scale-Invariant Feature Transform (SIFT) (Lowe, 2004).SIFT localizes extrema both in space and scale.Using the Difference of Gaussians as scale-space function, the images are filtered with Gaussian kernels of different sizes (scales).This is performed for different image sizes (octaves).The response of each filter is subtracted from the immediately following in the same octave.The interest points are scale-space extrema so local maxima and minima are extracted by comparing the neighborhood points in the same, the previous and the posterior scales.To improve accuracy, a sub-pixel approximation step is done, interpolating the location of the feature inside the scale-space structure.The number of octaves and scales can be tuned to meet the system requirements.SIFT provides invariance against scale, orientation and affine distortion, as well as partial occlusion and illumination changes.Other algorithms were proposed to improve accuracy or performance like, the Speeded Up Robust Features (SURF) (Bay et al., 2006) or the Gradient Location and Orientation Histogram (GLOH) (Mikolajczyk & Schmid, 2005).This kind of detectors are quite complex and their performance can be low even using custom hardware.For this reason, less reliable algorithms are still in use, as Harris Corner Detector, FAST (Trajkovi & Hedley, 1998) or the Smallest Univalue Segment Assimilating Nucleus (SUSAN) (Smith & Brady, 1997) corner detectors, because of their efficiency under controlled situations and their low hardware requirements.

249
Towards the Optimal Hardware Architecture for Computer Vision www.intechopen.comSegmentation refers to the process of separating the data into several sets according to certain characteristics.There are several techniques to carry out this task, either based on boundaries or regions (Pal & Pal, 1993) (Haralick & Shapiro, 1985).Nevertheless, most of them rely on near-neighborhood operations.Particular attention deserves the clustering methods like the popular k-means which partitions the data set into several clusters according to a proximity criterion defined by a distance function.These methods are not restricted to image data.N-dimensional sets of abstract data can also be partitioned.Furthermore, information about the scene or domain can be introduced (number and characteristics of the target clusters).Therefore, they might be classified either as a low or mid-level vision stage.

Mid-level vision
Mid-level CV stages usually operate on images from previous processing steps, often binary images, and produce a lower amount of data but with a higher concentration of information.Some common operations are object classification and scene reconstruction.
One of the goals of Computer Vision is to recognize objects in a scene.Based on object location, pose or 2D/3D spatial relations between the objects, the algorithms have to be able to analyze the scene and its content.This involves issues such as dealing with object models, classifiers and the ability to integrate new information in the models.In the literature, a large amount of techniques can be found, usually classified as global methods, more intended for object detection and local feature-based methods for object recognition.In all of them good image registration is essential for both accuracy and performance (Zitova & Flusser, 2003).As for the global methods, common techniques are based on template-based matching, which employs a convolution mask or template to measure the similarity between an object patch and the template.In this sense, normalized cross-correlation (NCC), sum of squared differences (SSD) or sum of absolute differences (SAD) are widely used.As for local methods, local feature descriptors play an important role.Roughly speaking, a descriptor is an abstract characterization of a feature point based on its environment.One of the most popular techniques is the proposed in second part of the SIFT algorithm based on stacked orientation histograms which associate a high dimension vector to each keypoint.In order to reduce the amount of false positives and negatives during the matching stage the search area is limited by using strategies like the nearest neighbor search (NNS) which attempts to find the nearest points of a given one in a vector space.An indexing structure allows to search for features near a given feature rapidly.This is the case of the K-dimensional trees , which organize points in a k-dimensional space in such a way that each node has at most two child nodes.
Scene reconstruction consists of the generation of scene models starting from their parts.There are different techniques to reconstruct one or several objects in a scene.To build a 3D model, coordinates of scene points have to be calculated from the objects.If the location of the camera is known, 3D coordinates of a scene point can be determined from its projection on image planes of different viewpoints.The whole process starts with feature extraction and matching.Using geometric consistency tests it is possible to eliminate wrong matches.There are different solutions to estimate the fundamental matrix, as the RANdom SAmple Consensus (RANSAC).Once the matches between images are consistent, camera pose and scene geometry is reconstructed using Structure from Motion methods and refined with Bundle adjustment techniques (Triggs et al., 2000).

250
Machine Vision -Applications and Systems www.intechopen.com

High-level vision
The high-level stage often starts from an abstract representation of the information.This stage is highly application dependent but due to the variety of operations, data structures, memory access patterns and program flow characteristics are often only compatible with a general purpose processor.High-level processing is characterized by the use of a small set of data to represent knowledge about the application domain.More complex data structures are needed to store and process this information efficiently, making the operations and the memory access patterns more elaborated.This, together with the inherent complexity of decision making, makes the program flow very variable.
Robust pattern recognition, object identification, complex decision making or system adaptation are some of the benefits of integrating Artificial Intelligence methods with Computer Vision.Otherwise, the system would be limited to a predetermined set of actions.Machine learning makes computers capable of improving automatically with experience.This way it is possible to generalize the behavior from unstructured information with techniques such as neural networks, decision trees, genetic algorithms, regression models or support vector machines.Machine learning has emerged as a key component of intelligent computer vision systems, contributing to a better understanding of complex images (Viola & Jones, 2001) (Oliver et al., 2000).
Data mining is the process of analyzing data using a set of statistical techniques in order to summarize into segments of useful information.This makes possible analyze data from different dimensions or angles, categorizing and summarizing the identified relationships, making evident hidden relationships or patterns between events.On the contrary of Machine learning, data mining focuses on discovering hidden patterns instead of generalizing known patterns using the new data.
It is very difficult to establish a classification of tasks and operations for high-level processing.Some of the performed tasks fall within the scope of the measurement of application specific parameters such as size and pose of objects, fault detection and monitoring specific events such as traffic situations, for example.The algorithms and technologies are very diverse and most of them lie in statistical analysis and artificial intelligence domains.
As it was previously mentioned some tasks that initially fit into the low or medium level stages due to its context actually are more like high-level operations by the type of operations performed.This is related with the commonly used bottom-up image analysis, which starts from raw data to extend the knowledge of the scene.However, new approaches include feedback to perform top-down analysis.This way, low-and mid-level stages can be controlled with general knowledge of the image, improving the results.

Computing platforms
Given the wide range of algorithms and applications of Computer Vision, it is clear that it does not exist a unique computing paradigm or an optimal hardware platform.The type of operations, the complexity of data structures and especially the data access patterns greatly determine parameters such as the range of application, performance, power consumption or cost.This section presents some of the most prominent platforms in image processing, focusing on their strengths and weaknesses.

251
Towards the Optimal Hardware Architecture for Computer Vision www.intechopen.com

Computing paradigms
The Flynn's taxonomy (Flynn, 1972) classifies the computer architectures in four big groups according to the number of concurrent instructions and data sets processed.Image processing tasks perform more or less efficiently depending on the selected paradigm.In order to develop a Computer Vision application it is crucial to exploit the spatial (data) or temporal (task) parallelism to meet trade-offs among performance, power consumption or cost.

SISD
Single Instruction Single Data (SISD) refers to the conventional computing model.A single processing unit executes a sequence of instructions on a unique data stream.Most modern computers are placed under this category and, although only one processor and one memory element are present, those which are able to pipeline their data-path are generally classified under the SISD category as they are still serial computers.
This paradigm performs better when spatial and temporal parallelism are hard to exploit.As seen previously, high-level image processing fit the SISD paradigm because most tasks are sequential, with a complex program flow and strong dependences between data.As processing is done sequentially, most optimizations aims to enhance the access between the memory and the arithmetic unit.Memory and processor speed are the main constraints.Furthermore, some kind of parallelism can be exploited when pipelining the data-path.Data allocation, pre-fetching and reducing stalls in the pipeline are some of the possible optimizations.

SIMD
SIMD (Single Instruction Multiple Data) computers have an unique control unit and multiple processing units.This control unit sends the same instruction to all processing units, which operate over different data streams.This paradigm focuses on exploiting the spatial parallelism.It is also possible to pipeline the data-path or to employ several memories to store the data in order to increase the bandwidth.SIMD computers are commonly specific-purpose, intended to speed-up certain critical tasks.
One of the drawbacks of SIMD machines is data transference.A network is required to both supply data to each processing unit and share data among them.Its size grows with the number of connected nodes so SIMD architectures have a practical limitation.Another restriction is data alignment when gathering and scattering data into SIMD units.This results in a reduction of flexibility in practical implementations.It is needed to determine the correct memory addresses and reordering data adequately, affecting performance.In addition, as this paradigm exploits spatial parallelism, program flow is heavily limited because all units execute the same instruction.Additional operations are needed to enable at least simple flow control tasks.
Low-level image processing benefits greatly of SIMD units.As it was described previously, most operations are quite simple but repetitive over the whole set of data.In addition, certain tasks of mid and high-level processing can also take advantage of SIMD units when using in conjunction with others paradigms.The simplicity of the arithmetic units and the 252 Machine Vision -Applications and Systems www.intechopen.commemory access patterns make feasible to design efficient units, which is crucial to increase the parallelism.Memory bandwidth and data distribution among the processors is also key for performance.
Two types of SIMD accelerators can be distinguished based on the number of processing units: fine-grain and coarse-grain processors, where the major difference is the number of processing units.While the first includes a large amount of very simple processors with a rigid network, the second features major flexibility although with a much lower parallelism.When using in low-level image processing, fine-grain processor arrays match with massively parallel operations such as filters or morphological operations.Using a processor-per-pixel scheme and local communications, neighborhood operations are completed in just a few instructions.On the contrary, when the parallelism level is lower a configuration as vector processor is usually preferable.By reducing communications and increasing core complexity, they are much more flexible and efficient not only for low-level operations but also for other processing stages.

MIMD
Multiple Instruction Multiple Data (MIMD) refers to architectures where several data streams are processed using multiple instruction streams.MIMD architectures have several processing units executing different instructions to exploit task parallelism.Processors perform independently and asynchronously.MIMD systems are classified depending on the memory architecture.
In Shared Memory Systems, all processors have access to an unique memory.Connection hierarchy and latencies are the same for all processors.This scheme eases data transference among processors although simultaneous access must be taken into account to avoid data hazards.Scalability is also reduced because it is hard to increase the memory bandwidth at the same rate as the number of processors.
If each processor has its own and private memory it is possible to upscale more easily as memory and processors are packet as a unit.This scheme is known as Distributed Memory System.In addition, local memory access is usually faster.The major disadvantage is the access to data which are located outside the private memory because dedicated buses and a message passing system to communicate with the processors are needed.This can result in high access times and an increase of hardware requirements.
In a Distributed Shared Memory System, the processors have access to a common shared memory but without a shared channel.Each processor is provided with local memory which is interconnected with other processors through a high-speed channel.All processors can access to different banks a global address space.Access to memory is done under the schemes as Non-Uniform Memory Access (NUMA), which takes less time to access the local memory than to access the remote memory of other processor.This way scalability is not compromised.
Very long Instruction Word (VLIW) and superscalar architectures are also classified within MIMD paradigm because they exploit instruction-level parallelism, executing multiple instructions in parallel.Pipelining also executes multiple instructions but splitting them in independent steps to keep all the units of the processor working at a time.Mid-level image processing and some operations of the other processing levels of Computer Vision can exploit MIMD processors.Operations are relatively simple, with data-dependent program flow.Temporal and task parallelism are easier to exploit than spatial parallelism although a reduced degree is usually present in this type of algorithms.Each MIMD processing element can include SIMD units.This way it is possible to process complex tasks more efficiently, from kernel operations as when pre-processing images concurrently in multi-view vision systems to high level tasks as multiple object recognition and tracking.In general, any set of tasks with weak dependences between them to reduce internal communications can take advantage of this paradigm.

MISD
There is one more paradigm, Multiple Instruction Simple Data (MISD), which achieves higher parallelism than SISD executing different instructions over the same data set employing several computing units.
Systolic arrays are regular n-dimensional arrays of simple cores with nearest-neighbors interconnections.Each core operates on the input data and shares the result to its neighbor, flowing the data synchronously usually with different flow in different directions.They are employed for tasks such as image filtering or matrix multiplication.Pipelined architectures belong to this type, as they are considered an one-dimensional systolic array, but they are commonly considered an improved version of the other aforementioned paradigms.
This paradigm is rarely used for Computer Vision as the other paradigms match better and offer higher performance and flexibility when dealing with real Computer Vision problems.

Remarks
Low-level Computer Vision entails the largest processing times in most applications.Data sets are usually very large and the kind of operations simple and repetitive.However, operations are inherently massively parallel and the data access patterns are regular.It is feasible to exploit these features to design very optimized SIMD custom hardware accelerators or to migrate the algorithms to existing hardware.
It is harder to extract parallelism in the mid-level stage because operations involve more complex data-flow.In addition, the data set is smaller so the benefits of including dedicated units to speed-up the computation are lower than expected.Despite this, hybrid processors (SIMD-MIMD) able to exploit both spatial and temporal parallelism can overcome this limitation.
Finally, the amount of data involved in the high-level stage is usually small so it is rarely necessary to sacrifice precision in order to get better performance.Moreover, unlike in previous stages, the disparity in the type of data makes the use of floating-point often a requirement.Another characteristic of this stage is the program flow, far more complex, which may even consume more computation time than the arithmetic operations.The kind of computation performed at this stage is so varied that the best option is often a general purpose SISD processor.

254
Machine Vision -Applications and Systems www.intechopen.com

Current devices
There are different possibilities to implement the aforementioned computation paradigms.There is not an unique and direct correspondence between a paradigm and its hardware implementation.On the one hand, it can be designed a dedicated hardware which follows the original conception.On the other hand, the paradigm can be emulated both in hardware or software.

Microprocessors
Microprocessors, SISD machines, are the most straightforward devices to develop a Computer Vision application.Their main advantage is their versatility, the ability to perform very different tasks for a low cost.They can perform any type of data processing, although its efficiency, measured in parameters such as cost, power or integration capabilities, is not always optimal because of their general-purpose condition.The large variety of available technologies, libraries, support and programs cut down the cold start, enabling to get the system ready for development in a short time.Developers can focus on the problem itself instead of technical issues (Bradski & Kaehler, 2008).
Basically, they are composed by a main memory and a processing unit which includes the arithmetic and the control modules.From this basic structure more optimized microprocessors can be designed.From caches to cut down the memory access times, tightly coupled high-speed memory controllers or specialized units for critical tasks, the variability of architectures is as large as the amount of fields in the market (Singhal, 2008).However, despite the evolution of the industry pure SISD microprocessors do not offer adequate performance for a large set of tasks.That is why a wide range of accelerator modules have been included, as specific-purpose arithmetic units and sets of instructions or co-processors.The inclusion of SIMD units is decisive for tasks such as video encoding and decoding, but any data-intensive algorithm can take advantage of them (Franchetti et al., 2005).
As it will be discussed later, the advances in the semiconductor industry allows to increase the integration density so it is possible to include more processing power on the same Silicon area.This has led to abandon the race for speed (to increase the working frequency) to more efficient systems where energy consumption is vital and parallelism is the way to overcome the limitations of Moore's Law (Naffziger, 2009).Most of modern processors are multi-core and the number of cores is expected to grow in the near future.Programming languages and techniques as well as image processing algorithms have to be adapted to this new reality (Chapman, 2007).
Microprocessors are employed in a wide range of applications, from developing and testing algorithms such as autonomous driving (Urmson et al., 2008) to final platforms as medical image reconstruction (Chu & Chen, 2009).Even its use in restrictive stand-alone devices is also viable such as autonomous flight (Meier et al., 2011).Microprocessors stand out in high-level tasks, as the latest stages of image retrieval (Deselaers et al., 2008) and scene understanding (Li et al., 2009), where handling image databases, storing and communicating data are fairly complex to implement them in specific purpose devices.Video surveillance tasks can take advantage of these features for image processing (Ahmed & Terada, 2010) and event control in complex distributed systems (Chen et al., 2008).Mobile processors have become a benchmark in innovation and development after the explosion of the mobile market.As discussed below, they integrate several general purpose cores, graphics processing units and other co-processors on a single chip keeping power consumption very low.The applications they can address are increasingly complex (Taylor et al., 2009) (Wither et al., 2011) (Ren et al., 2010).

Graphics Processing Units
A Graphics Processing Unit (GPU) is a specialized co-processor for graphic processing to reduce the workload of the main microprocessor in PCs.They implement highly optimized graphic operations or primitives.Current GPUs provide a high processing power and exploit the massively spatial parallelism of these operations.Because of their specialization, they can perform operations faster than a modern microprocessor even at lower clock rates.GPUs have hundreds of independent processing units working on floating point data.Memory access is critical to avoid processing downtimes, both in bandwidth and speed.
Their high processing power makes GPUs an attractive device for non related graphic tasks.General Purpose GPU (GPGPU) is a technique to perform general computation not related to graphics on these devices (Owens et al., 2007) (Che et al., 2008).This makes possible to use its specialized and limited pipeline to perform complex operations over complex data types.In addition, it eases memory management and data access.Flow control, as looping or branching, is restricted as in other SIMD processors.Modern GPUs architectures as (Seiler et al., 2008) or (Lindholm et al., 2008) have added support for these operations although slightly penalizing throughput.Word size is also a limitation in GPGPU techniques.It was reduced to increase the integration density as graphic operations do not usually require high precision.However, since this was a serious limitation for scientific applications, large word-sizes support was added later (Thall, 2006).
GPUs are effective when using stream processing, a paradigm related to SIMD (Rixner et al., 1998).A set of operations (kernel) is applied to each element of a set of data (stream).The flexibility is reduced to increase the parallelism and to lower the communication requirements when involving hundreds of processing elements.Otherwise, providing data to hundreds of processors would be a bottleneck.Processors are usually pipelined in a way that results pass from one arithmetic unit to the next one.This way, the locality and concurrency are better exploited, reducing communication requirements because most of the data are stored on-chip.
The use of GPUs to speed-up the computing has greatly increased, specially after the optimization of libraries and functions that mask low-level technical difficulties.Most image processing kernels and algorithms were adapted to work in GPGPUs, obtaining significant improvements.Basic image processing (Podlozhnyuk, 2007), FFT transforms (Govindaraju & Manocha, 2007), feature extraction (Heymann et al., 2007) or stereo-vision (Lui & Jarvis, 2010), all of them computationally expensive, benefit greatly of the massively parallelism of GPU.They can be used also to emulate other computing paradigms frequently used in low level vision (Dolan & DeSouza, 2009).However, some authors argue that the gap between GPUs and CPUs is not as large as it seems if key optimizations are carried out (Lee et al., 2010).

Digital Signal Processors
A Digital Signal Processor (DSP) is a microprocessor-based system with a set of instructions and hardware optimized for intensive data applications.They are specially useful for real-time processing of analog signals but they offer a high throughput in any data intensive application.DSP market is well established and offers a large range of devices, optimized for each particular task (Schneiderman, 2010).Apart from attached processors, to assist a general purpose host microprocessor, DSPs are often used in embedded systems including all necessary elements and software.
They are able to exploit parallelism both in instruction execution and data processing.In a von Neumann architecture, instruction and data share the same memory space.However, DSP applications usually require several memory accesses to read and write data per instruction.To exploit concurrency, many DSPs are based on a Harvard architecture, with separate memories for data and instructions.Many modern devices are based on Very Long Instruction Word (VLIW) architectures so they are able to execute several instructions simultaneously (Lin et al., 2008).Compilers are fundamental to find the parallelism in the instructions and a large improvement can be obtained after an efficient placement of data and programs in memory.Superscalar processors and pipelined data-paths also improve the overall performance although this is done by hardware.To operate, they include specialized hardware for intense calculation such as multiply-accumulate operation, which is able to produce a result in one clock cycle.Although many DSPs have floating-point arithmetic units, fixed-point units fit better in battery-power devices.Formerly, floating-point units were slower and more expensive but this gap is getting smaller and smaller.They also include zero-overhead looping, rounding and saturated arithmetic or dedicated units for address management (Talavera et al., 2008;Texas Instruments, 2002;Wang et al., 2010).
DSPs are usually designed for intensive data processing so performance can be penalized in mixed tasks.However, current all-in-one devices are able to handle complete applications efficiently.In addition, DSPs are not only available as independent devices but also as part of integrated circuits such as FPGAs or SoCs.
DSPs have a large tradition on image processing tasks.As optimized versions of conventional processors, they were used to accelerate the most expensive operations.These operations were related mainly with low-level image processing (Baumgartner et al., 2009), where parallelism and data access enable a large performance increase.Stereo vision (Lin & Chiu, 2008), Fourier transform (Sun & Yu, 2009) or video matching and tracking (Shah et al., 2008) are some samples.However, higher level algorithms also suit for DSPs, specially in industrial tasks (Suzuki et al., 2007) (Neri et al., 2005).Nowadays, DSPs are able to handle large sets of operations efficiently both for co-processing (Rinnerthaler et al., 2007) or standalone (Bramberger & Rinner, 2004).

Field Programmable Gate Arrays
A Field Programmable Gate Array (FPGA) is a device with user-programmable hardware logic.It is made of a large set of logic cells connected together through a network.Both elements are programmable in such a way that the logic cells emulate combinational functions and the 257 Towards the Optimal Hardware Architecture for Computer Vision www.intechopen.comnetwork permits to join them to build more complex functions.In addition, the program can be rewritten as many times as needed.
The main advantage of FPGAs is their high density of interconnections between cells, which provides a very high flexibility.This network has a complex hierarchy with optimizations for specific functions.It provides specialized lines to propagate clock or reset signals across all the FPGA or to build buses with high fan-out within acceptable time delays.With these very basic elements it is possible to build highly complex modules as arithmetic units, controllers or even embedded microprocessors.In addition to simple elements such as 1-bit flip-flops, FPGAs have a large set of embedded modules.Large memory elements, DSP arithmetic units, networking and memory controllers or even embedded microprocessors are available on modern FPGAs (Leong, 2008).
These devices are an excellent mechanism to build proof-of-concept prototypes.On the one hand due to their flexibility any computing paradigm can be implemented restricted mainly for the number of available cells.On the other hand, thanks to the re-programmability it is possible to debug and test on real hardware.Even more, nowadays some final products are implemented exclusively on FPGAs instead of migrating the design to custom integrated circuits.The achieved performance can be tens of times higher with lower power consumption than standard PC-based approaches (Fasang, 2009).FPGAs are widely employed as co-processors in personal computers such as GPUs or as accelerators in specific purpose devices as high-capacity network systems (Djordjevic et al., 2009) or high-performance computing (Craven & Athanas, 2007).Nowadays, it is possible to embed full systems on a single FPGA.
One of the major disadvantages of FPGAs is the set-up time, still higher than in pure-software approaches.Traditional FPGA programming is done with HDL languages, forcing to design at a very low level.High-level languages, such as C extensions, are more friendly for software engineers, although the control over the design is much lower (Coussy et al., 2010).The FPGA-based designs can be exported and distributed as IP Cores because theses languages are platform-independent unless very specific features of a given FPGA are employed.This way, non-recurring engineering costs (NRE) are cut-down.Therefore FPGAs are in an intermediate stage between software and hardware.Algorithms are programmed by software and compiled to a hardware architecture, so an careful hardware/software codesign is fundamental (Moreno et al., 2010).They are able to exploit both spatial and temporal parallelism very efficiently.Since logic cells are independent many arithmetic units can process concurrently, with custom routing between them.In addition, the memory subsystem can be tuned to exploit the on-chip memory banks, reducing the access to the external memories.
Regarding Computer Vision, FPGAs are widely used both in industry and research.They offer a high degree of flexibility and performance to handle many different applications.Most compute-intensive algorithms were migrated to FPGAs: stereo vision (Jin et al., 2010), geometric algebra (Franchini et al., 2009), optical flow (Martineau et al., 2007), object recognition (Meng et al., 2011) or video surveillance (Nair et al., 2005) (Salem et al., 2009) to name some examples.Low and mid-level image processing stages, based on SIMD/MIMD paradigms can be efficiently implemented.However, high-level processing, although it could be possible to implement on an FPGA, fits better on conventional processors.Nevertheless, FPGAs can use external processors connected through high-speed off-chip communications or include general-purpose processors.There are available soft-core (emulated) (Lysecky & Vahid, 2005) and hard-core (Veale et al., 2006) embedded processors.While the first offer a large degree of configuration, adapting all parameters of the design to the particular needs of the applications, the last feature better performance.This way FPGAs are able to implement all the stages of a complete application (Shi, 2010).

Application-specific integrated circuits
An Application-Specific Integrated Circuit (ASIC) is a device designed for a particular task instead of for general purpose functionality.In this case, designers have to work at the very bottom level of design so this process is long and error-prone.As there are elements present in almost all ICs, a set of libraries is usually provided making easier system-design.This way it is possible to work at different levels of abstraction.Full Custom designs require a greater effort because it is necessary to design both functionality and physical layout.However, it allows better optimizations and performance.Standard Cells allow to focus on the logic operation instead on the physical design, splitting the process into two parts.Physical design is usually done by the manufacturer, who provides from simple logic gates to more complex units such as Flip-Flops or adders.Apart from simple units, third party manufacturers provide more complex modules for specific functions, known as IP Cores.This way, they can be used as subcomponents in a large design.There is a large variety of cores to tackle all needs, as it happens with the FPGAs, and there is available from IO controllers (RAM, PCI-Express, ethernet) to arithmetic cores (signal processing, video and audio decoding) or even complete microprocessors.
The IC-level design allows to build embedded systems, System-on-Chip, efficiently.It is possible to include all the elements of the system on a single chip, even if there are digital, analog and mixed-signal modules.Nowadays, custom ASICs are the unique alternative for complex SoCs although FPGAs size grows with each generation and are becoming a viable alternative.IC design leads to high costs, both in design and manufacturing when production volume is low.However, some applications still need ICs because the alternatives do not match with performance, power consumption or size (Kuon & Rose, 2007).
It is true that some of the devices previously mentioned are in some way ASICs.However, the design of a custom architecture for a concrete algorithm provides the best possible results.Many custom chips were designed and built instead of mapping them in a programmable device, for specific algorithms (Kim et al., 2008), domain applications (Stein et al., 2005) or complete general purpose SoCs (Khailany et al., 2008a).
Because of this flexibility a large set of exotic devices can be found in the market or in the specialized literature.While some aims to address very specific tasks or novel computing paradigms, others are taking their first steps on the market after that technology has allowed its viability.Pixel-parallel Processor Arrays are the natural platform for low-level vision.They are massively parallel SIMD processors laid down on a 2D grid with a processor-per-pixel correspondence and local connections among neighbors.Each processor, very simple, can also include an image sensor to eliminate the IO bottleneck.Some representative examples are (Foldesy et al., 2008;Garrido et al., 2008;Lopich & Dudek, 2008).There are also approaches closer to the biological vision, as (Koyanagi et al., 2001) or (Constandinou et al., 2004).More information is available in (Zarándy, 2011).Massively Parallel Processor Arrays (MPPAs) provide hundreds to thousands of processors.They are encapsulated and independent and have their own program and memories.They work in MIMD mode but also include internal improvements as pipelines, superscalar capabilities or SIMD units.Some examples are (Bell et al., 2008;Butts et al., 2007;Duller et al., 2003).

Discussion
As described previously in this section, there is a wide range of hardware devices suitable for Computer Vision.Depending on the application requirements, a compromise between performance, cost, power consumption and development time is needed.Commercial applications are heavily constrained by the time-to-market so suboptimal solutions are preferable if the development cycle is shorter.This way, software-based solutions are usually better from the commercial point of view.
As discussed before, the large amount of highly optimized libraries make PCs (conventional microprocessors) the first choice both as development and production platform.Multi-core and SIMD programming are key for performance, although this can significantly increase the development time.One of the benefits of choosing a PC as platform is that a GPU is included "at no cost".This is, most application requires some kind of graphical display so including a GP-capable GPU provides a much greater benefit with a very low cost/performance ratio, even using a low-cost card.The combined use of CPU-GPU has proved to be very effective, although very restricted in terms of form factor and specially in power consumption.Only if these parameters are very constrained DSP-based solutions are preferable.This is the case of mobile applications, although low-power microprocessors are taking advantage in this field.As these are extensively used, development kits, compilers and libraries are very optimized, helping to cut down time-to-market and related costs.
However, if the aforementioned devices do not provide acceptable results, it will be required to go into the hardware.As application developers, we need to search for exotic devices such as MPPAs or dedicated image processors.In contrast to the previous devices, these are not normally industry standards therefore a greater effort during development is needed.However, we are still under the software coverage.On the contrary, if the requirements are very strict or if the production volume is very high, a custom chip is the unique alternative to reach the desired performance or to lower the cost per unit.FPGAs are an excellent platform for testing before manufacturing the final design in a custom chip.As they are reconfigurable, different architectures can be evaluated before sending to fabric.On the other hand, some authors claim than software development is the bottleneck in the current ASIC development.The software stage can not start until a device where to do tests has been built.Although emulators (both functional an cycle-accurate) are used, their performance is very low, resulting in very large test cycles and poor feedback for those programmers at the hardware control layer.This is why FPGAs are widely used as proof-of-concept devices, as they enable software development cycles many months before a test chip was built.
Tab. 1 shows performance comparison of a computationally intensive algorithm as SURF (Bay et al., 2006) for different platforms.PCs offer uneven performance if heavily use multithreading programming (Zhang, 2010) or a straightforward implementation (Bouris et al., 2010).GPUs feature very large performance at the expense of high power consumption.However, FPGA-based implementation delivers the best performance in terms of speed and 260 Machine Vision -Applications and Systems www.intechopen.compower consumption.Tab. 2 shows a comparison between a low-power CPU and GPU, compared to an standard laptop CPU.Authors conclude that optimization is critical and a carefully analysis of low-level operations must be performed but the achieved performance is quite close to standard conventional processors.Complex algorithms which include a higher abstraction level as Viola-Jones detector (Viola & Jones, 2001) are also candidate for mobile platforms.In (Aby et al., 2011), a Beagleboard xM board with a Texas Instruments DM3730 SoC (ARM Cortex-A8 and TMS320C64X DSP) achieves around 0.5x speed-up compared with a conventional Intel 2.2 GHz processor, both using openCV (Bradski & Kaehler, 2008) library.In (Arth & Bischof, 2008), a DSP-based embedded system for object recognition achieves up to 4fps, including SIFT-based feature detection and description and object recognition.Although with lower performance than other approaches, the major advantage is that the whole application fits in a single device reducing also power consumption.Some operations, as the Fourier transform, are computationally very expensive and yet required in many applications, including low-power, low-cost or high performance devices.Optimized libraries for both CPU (Takahashi, 2007) and GPU (Ogata et al., 2008) aim to exploit SIMD units and multitasking, achieving high performance with regard to straightforward implementations.In particular, Fourier transform operation is very suitable for FPGAs an ASICs if performance and power consumption is critical.In (He & Guo, 2008), a FFT core design for FPGAs is proposed, consuming less than 1W in the worst case and lowering manufacturing cost more than x15 compared with a DSP implementation.In (Guan et al., 2009), a more aggressive approach is done, developing an application-specific instruction set processor (ASIP).With very little hardware overhead and a consumption of few tenths of a watt, outperforms standard software and DSP implementations more than x800 and x5 times respectively.However, these designs have larger development cycles, as (Pauwels et al., 2011) depicts.In this work, some low-level operations (phase-based optical flow, stereo and local image features) are compared both on FPGA and GPU.Tab. 3 summarizes some of the results of this work.As authors conclude, high-performance or low-cost implementations should be done on CPUs with GPU co-processing.GPUs overcome FPGAs in terms of absolute performance due to their memory throughput.However, if a standalone platform is needed, an FPGA board should meet the requirements or establish the basis for testing and validating an ASIC design.
Finally, general-purpose custom designs converge form factor, power consumption and performance.For instance, SCAMP processor (Dudek & Hicks, 2005) exploits the massively spatial parallelism of low-level operations, integrating processing units and sensors in a processor-per-pixel fashion.As it is an analog design, the integration density and the performance is very high, keeping power consumption under 240 mW.Current digital solutions also offer similar performance and many advantages as faster development and array scalability.ASPA processor (Lopich & Dudek, 2007) includes novel techniques to increase performance, specially on global operations, without sacrificing the other trade-offs.Hardware-oriented algorithms are also key to take advantage of custom hardware.Tab. 4 shows a comparative between different specific-purpose processors executing an active contour algorithm for retinal-vessel tree extraction.The algorithm is designed for focal-plane processing arrays.The cycles-per-pixel parameter illustrates how specific platforms are able to lower hardware requirements and even to increase performance.As well as other custom designs depicted in this chapter, they are not suitable to handle a whole Computer Vision 261 Towards the Optimal Hardware Architecture for Computer Vision www.intechopen.comapplication as they are intended to reduce the workload of the main processor in the more computational expensive tasks.Approaches as (Khailany et al., 2008b) or (Fijany & Hosseini, 2011) can completely embed highly complex applications without compromise its efficiency.

Device
Performance Power (W)  (Nieto et al., 2011) for complete performance analysis.Note: the algorithm was slightly adapted for SCAMP-3 and substantially lightened for Ambric Am2045.The FPGA implementation was carried out faithfully the original algorithm.Virtex-6 results are an estimation.

Looking ahead
The progress of new technologies, marked by Moore's Law allows increasingly integration density.More hardware resources, with higher clock frequencies, are available for the designer.However, although the ultimate goal is to increase the performance, other parameters come into play.Nowadays, one of the critical trade-offs is power consumption, directly related with energy efficiency and power dissipation, some of the most decisive limitation design constraints (Kim et al., 2003).

262
Machine Vision -Applications and Systems

www.intechopen.com
Power consumption is driven by two sources, dynamic and static.Static consumption is a result of the leakage current and it refers when all inputs are held so the circuit is not changing state.On the contrary, the dynamic term refers to the circuit switching at a given frequency.This power is dominated in today CMOS circuits, being directly proportional to frequency.This is one of the capital reasons why the semiconductor industry moves from a race for frequency to a race for parallelism.In recent years, the industry is making a big effort to increase the parallelism of most devices to keep the performance increase rate.Apart from more arithmetic units, leading architectures integrate more systems previously contained on separated circuits, as microcontrollers or GPUs.To achieve these results, it is still necessary to scale down the transistors.In this sense, the advent of emerging technologies like CMOS-3D (Philip Garrou, 2008) will permit to integrate heterogeneous functions on the same monolithic solution more easily.A vision-oriented ASIC could integrate the image acquisition stage to an eventual processor.At the same time, more conventional solutions as PCs or FPGAs would yield large parallelization using this and other advances such as the Tri-Gate technology (Intel, 2011).However, this involves problems as the increment of leakage currents, thereby increasing static power consumption, not negligible at all nowadays (Koch, 2005) (Kim et al., 2003).In addition, new manufacturing methods are more expensive because the yield is lower and more time is needed to discount the investments.Or equivalently, it is necessary to sell more devices to continue growing at the rate set by Moore's Law.
Conventional microprocessors are in the leading edge of evolution.There is a large market which justifies large investments in R&D to meet the growing needs of consumers, especially by large increase in media consumption.This way, it is now possible to find low-cost multicore microprocessors.It is expected that the current evolution towards a greater number of cores will be maintained but increasingly including more elements previously located on external chips, reducing the bottleneck when communicating with off-chip elements (Singhal, 2008).New parallel computing techniques need to be developed to take advantage of the available multithreading capabilities.
PCs also benefit of GPU capabilities.GPU performance grows at a higher rate than microprocessors.As they are very specialized devices, although featuring general purpose computing, the technical improvements in the semiconductor industry are clearly more beneficial.As discussed previously, there are available hardware resources to increase the parallelism and enhance the datapath pipeline.Leading GPUs have more than 1000 processing units and high speed and bandwidth memories.It is also possible to combine multi-core GPUs to work together, achieving a very large throughput.Still, their major disadvantage is being the power consumption.New architectures are taking advantage of the fixed-function hardware to improve area usage and power efficiency.GPU design will focus entirely on improving GPGPU computing (Brookwood, 2010a).
DSPs are also moving to multicore architectures.As specialized microprocessors, they can take advantage of all the improvements in the consumer market, both in hardware and software improvements such as compilers or other optimization techniques.Although competitors are strong, DSP will continue to be used because they lead to compact circuit boards, lower power consumption and cost, if the appropiate device is selected based on the application requirements.In addition, the benefits of the extensive experience in DSP development, with shorter time-to-market thanks to the very optimized compilers and

263
Towards the Optimal Hardware Architecture for Computer Vision www.intechopen.comlibraries.This is specially relevant in embedded applications, to take advantage of the multi-core capabilities of modern DSPs.This way it is possible to integrate several DSP cores, each one optimized for a specific task, on a single chip (Friedmann, 2010).Low-power devices which still keeping reasonable performance are fundamental in handled and portable devices, where traditional microprocessors are not suitable.
Microprocessors and GPUs tend to converge on a single chip.Apart from the obvious benefits of integration, reducing cost, size, power consumption, the performance will increase because the reduction of off-chip communications.In addition, architectures as AMD Fusion integrate in the same units 3D acceleration, parallel processing and other functions of GPUs (Brookwood, 2010b).On the other hand, mobile microprocessors are becoming more important.These microprocessors embed very low-power GPUs and auxiliary DSP units for co-processing in the same chip (NVIDIA, 2011).
Programmable systems, not only FPGAs, are able to get the same performance as recent past ASICs, keeping time-to-market and non-recurring engineering costs lower compared to custom ICs.As discussed previously, FPGAs are between software and hardware solutions.Modern FPGAs experienced a large increase in hardware resources, both in dedicated units and logic cells.High-level programming languages are another major reason why FPGAs are becoming increasingly competitive, specially when dealing with complex FPGAs and to maintain and keep the designs portable (Singh, 2011).Nowadays these devices can address complete SoCs, integrating memory and IO controller natively.Manufacturer roadmaps show their inclusion in a very near future and it is expected a big leap in performance and flexibility (DeHaven, 2010).

Summary and conclusions
The large variety of Computer Vision applications makes difficult to classify them into tight categories.As a result, it is extremely difficult to design a unique hardware architecture which handle efficiently all processing stages of any Computer Vision algorithm.In the literature there are available several studies where different platforms are tested under the same conditions (Asano et al. (2009); Baumgartner et al. (2009); Kisacanin (2005); Wnuk (2008)).They show that to tune-up is key for performance and that new parallel computing techniques are a requirement to exploit parallel devices.However, the increase of the market makes investment in new platforms that implement different algorithms a necessity.
The most accessible platform is a Personal Computer equipped with a GP-capable GPU.
As test or final platform, it cuts down developing time and costs.GPUs give enough performance for most intensive tasks, while using the CPU multimedia extensions it is possible to meet the requirements in the other stages.In addition, they include all necessary elements for user IO, communication, storage and information display.The availability of models is large enough to select the adequate platform according to the application trade-offs.When CPU performance is not adequate, DSPs are a serious alternative.In addition, it becomes almost mandatory when dealing with embedded devices without compromising performance, where power consumption and form factor are very restrictive.They are widely used for prototyping custom ICs but FPGA-based applications have their own niche.costs.Although all devices described in this chapter are ASICs, they were not conceived for an unique application.To lower costs, the manufacturer expands their range of application although it is possible to find families specialized in specific tasks.But there are available devices very specific for critical tasks, where the requirements are very tight and any other device complies with them.Flexibility is complete and there is not restriction to employ cutting-edge technologies which are not available in commercial devices until a near future.
Almost all Computer Vision applications need to face all processing stages in a lesser or a greater degree.Generally, this leads to implement efficient mechanisms to tackle massively spatial parallelism, mixed spatial and temporal parallelism and sequential processing.Each stage matches with a level of processing so all mechanisms have to be implemented in most applications.Low-level stages benefit of massively parallelism with simple data distribution systems as operations.When the data abstraction level grows, during mid-level tasks, more information about the problem is required by the algorithms increasing their complexity.This leads to complex architectures, where information distribution and sharing makes difficult to exploit spatial parallelism, although it is usually present.Task-parallel architectures are able to exploit better their possibilities.Low and mid-level processing stages can be implemented in pure hardware solutions because they often implement kernel operations.However, high-level is closer to software and designers can take advantage of this to build complex systems easier by using general purpose processors.In addition, the device which performs the image processing related tasks needs to communicate or to control other devices.This is not strictly related with the Computer Vision domain but it is clearly a requirement in the final solution.In this case, the use of a general purpose processor is beneficial because it allows easier control and it increases the flexibility of the whole system.
Although it is almost impossible to develop a system able to run all operations in an optimal way due to the rich nature of the Computer Vision applications, it is desirable to provide the capability to perform any operation.The design must be scalable to adapt it to the specific needs of each application.This way, a product ranging from low to high-end devices can be easily built.The internal architecture should be also modular, so that from a basic outline more features could be added without dramatic changes.In general, a high-end microprocessor is a requirement to manage complex operations and communications between the system and the external components of the complete system.An on-chip SIMD-MIMD hybrid co-processor would tackle the most expensive computation, reconfiguring its internal interconnections according to the current task.Embedded high-speed memory controllers are also key to reduce the data-access bottleneck.All these elements, together, are able to face efficiently most of the situations described throughout this chapter.

Acknowledgments
This work is funded by Xunta de Galicia under the projects 10PXIB206168PR and 10PXIB206037PR and the program Maria Barbeito.

253
Towards the Optimal Hardware Architecture for Computer Vision www.intechopen.com

255
Towards the Optimal Hardware Architecture for Computer Vision www.intechopen.com Integration and high flexibility besides a large number of available IP Cores allow to drop NRE 264 Machine Vision -Applications and Systems www.intechopen.com

Table 3 .
(Pauwels et al., 2011)s for optical flow, stereo and local image features implementation.See(Pauwels et al., 2011)for complete details and performance results.

Table 4 .
On-chip retinal vessel-tree extraction (hardware-oriented algorithm) on different platforms.See