Real ‐ time multi ‐ window stereo matching algorithm with fuzzy logic

Stereo matching obtains a depth map called a disparity map that indicates or shows the positions of the objects in a scene. To estimate a disparity map, the most popular trend consists of comparing two images (left ‐ right) from two different points from the same scene. Unfortunately, small window sizes are suitable to preserve the edges, while large window sizes are required in homogeneous areas. To solve this problem, in this article, a novel real ‐ time stereo matching algorithm embedded in an FPGA is proposed. The approach consists of estimating disparity maps with different window sizes by using the sum of absolute differences (SAD) as a local correlation metric. Once the disparity maps are obtained, the left ‐ right consistency for each window size is computed. At the end of this stage, the centre pixel deviation is estimated through a 5 � 5 window and the Sobel gradient is extracted from the left image. Finally, both parameters are processed by a Fuzzy Inference System (FIS), which combines the calculated disparities and generates a final disparity map. An architecture embedded in FPGA is established and hardware acceleration strategies are discussed. Experimental results demonstrated that this algorithmic formulation provides promising results compared with the current state of the art.


| INTRODUCTION
Depth estimation is one of the most important tasks of the current computer vision systems and it is also the basis of several real-world applications such as positioning systems for mobile robotics [1], medical applications [2] and so on. Eventhough there are several approaches and resources to estimate depth in a scene, in order to do such action, stereo matching has become the most popular trend to estimate the depth from images obtained by a stereo configuration. For the stereo-matching approach, point correspondences between stereo pairs (two different viewpoints with epipolar constraint from the same scene) allow to obtain images of depth called disparity maps. In order to estimate a disparity map, it is necessary to measure the similarity of the points inside the stereo pair. The stereo matching process generally consists of the following four steps: (1) matching cost computation; (2) cost aggregation; (3) disparity computation/optimization; and (4) disparity refinement. In order to guarantee real-time processing, stereo-matching algorithms are often implemented using dedicated hardware such as FPGA [3] or CUDA technologies [4].

| Stereo-matching algorithms suitable for hardware implementation
The current state of the art has demonstrated that local algorithms reach promising results (in terms of processing speed and embedded capabilities) and are being implemented inside dedicated hardware [5,6]. As a result, in present study, local algorithms implemented inside FPGA or GPU architectures are the most popular solution for applications in which realtime processing and embedded capabilities are required [7][8][9]; FPGA is the technology with a higher performance in terms of embedded capabilities [10,11].
For real-time processing and embedded capacities, several stereo-matching algorithms in the present study have been implemented using FPGA technology [3,12]. Depending on the configuration of the cameras, the range of disparity levels varies for algorithms implemented inside FPGA devices; this implies an increment of hardware resources consumption. This problem has motivated diverse authors to study the possibility of decreasing hardware resources usage while a large range of disparity levels for each pixel in the scene is estimated. However, this is still an open problem, therefore, several works in the current literature search for new approaches to implement 'efficient' stereo-matching algorithms inside FPGA devices [13].

| Motivation and scope
Over the last decade, several works have demonstrated that local stereo-matching algorithms implemented inside FPGA devices are highly useful under several real-world applications. Unfortunately, for high flexibility (regarding depth estimation) high range of disparity levels for each pixel in the scene is required but this implies high hardware resource consumption. Besides, estimation of disparity levels for each pixel in the scene demands the increase of hardware requirements. So, in this article, several investigations looked for new approaches to implement 'efficient' stereo-matching algorithms inside FPGA.
In this work, a novel real-time stereo matching algorithm embedded in a FPGA device was proposed . The approach consisted of estimating disparity maps with different window sizes by using a simple local metric correlation. Then, a Fuzzy Inference System combined the previously computed disparities in order to generate a final disparity map. On the one hand, this disparity map used small correlation windows under object boundaries and on the other hand, it used large correlation windows for homogeneous areas. In order for the embedded capabilities to be guaranteed, an FPGA architecture was presented and hardware acceleration strategies were discussed.
Concerning the metric correlation, the use of the widely known Sum of Absolute Differences (SAD) was proposed as the local metric correlation. Advanced stereo matching techniques such as those based on semi-global matching or based on cross-based aggregation could outperform SAD in terms of accuracy, but here small system design is analysed(suitable for smart cameras), and to the authors' knowledge, SAD provided a better trade-off between accuracy and hardware requirements. Furthermore, in order to guarantee low system requirements, the design of a parallel-pipeline architecture was presented. So, the main contributions were : 1. A novel stereo matching algorithm which used multiple correlation windows that guaranteed not only high performance in terms of accuracy but guaranteed embedded capabilities as well.
2. An FPGA architecture suitable for smart cameras.

| RELATED WORK
The 'Middlebury Stereo Evaluation v.3' is a popular dataset for stereo matching evaluation. In order to have a proper comparison baseline, only other consulted related works are presented on this section. The system presented in [14] shows an architecture called Multi-Scale and Multi-Dimension network (MSMD). First, a new multi-scale equalisation costing subnet, in which two different receptives in parallel field sizes were implemented and developed. In addition, it was demonstrated that a multidimensional aggregation subnet containing 2D convolution and 3D convolution operations provided context-rich information and semantic information to estimate an accurate initial disparity.
In [15] a differentiable PatchMatch module that allows management of most disparities without requiring full cost volume evaluation is presented. The proposed representation (PatchMatch) was exploited to learn which range to prune for each pixel. By progressively reducing the search space and effectively propagating such information, it was able to efficiently compute the cost volume for high likelihood hypotheses and achieve savings in both memory and computation. All models were trained on four Nvidia-TitanXp GPUs.
In [16], dual support windows for each pixel were established, that is, a local window and the whole image window were considered at the matching process. As such, the primitive connectivity between each pixel and its neighbouring pixels in the local window are maintained, and then each pixel not only gets appropriate supports from neighbouring pixels within its local support window, but also receives more adaptive support from the other pixels outside the local window. Furthermore, a local edge-aware filter and a non-local edgeaware filter were merged to achieve collaborative filtering of the cost volume. All the experiments were conducted on a PC with a 3.2 GHz Intel Core i5-6500 CPU and 8-GB memory.
The architecture developed in [17] obtains dense stereo reconstructions using high-resolution images for infrastructure inspections. The proposed approach used a less resourcedemanding non-learning method, guided by a learning-based model. This allows to handle high-resolution images and achieve accurate stereo reconstruction. First, a deep-learning model produced an initial disparity prediction with uncertainty for each pixel of the down-sampled stereo image pair. Then, a downstream process performed a modified version of a generic semi-global block matching method with the upsampled per-pixel searching range. This approach was trained with a mini-batch of four on four NVIDIA TITAN X GPUs for five epochs.
The authors of [18] presented a local stereo matching algorithm, whose main novelty relies on the block-matching aggregation step. This is an adaptive support weights approach in which the weight distribution favours pixels that share the same displacement with the reference is presented. The VÁZQUEZ-DELGADO ET AL.
proposed weight function depended additionally on the tested shift, by giving more importance to those pixels in the blockmatching with a smaller cost. The proposed approach was embedded in a pyramidal procedure to locally limit the search range, which helped to reduce ambiguities in the matching process and saved computational time. The method has been implemented on a MacOS system using a 3.3 GHz Intel Xeon CPU.
In [19], a GPU architecture for real-time semantic stereo matching was proposed. The proposed framework relied on coarse-to-fine estimations in a multi-stage fashion, allowing: 1) Very fast inference even on embedded devices, with marginal drops in accuracy. 2) Trade accuracy for speed, according to the specific application requirements. Experimental results of high-end GPUs as well as on an embedded Jetson TX2 demonstrated the superiority of semantic stereo matching compared with standalone tasks and highlight the versatility of their framework on any hardware and for any application.
In [20], a deep self-guided cost aggregation method, to obtain an accurate disparity map from a pair of stereo images was proposed. Based on observations, each cost volume slice could guide itself based on the internal features. However, finding a direct mapping function from the initial and filtered cost volume slice without any guidance image was difficult. To solve this problem, an advanced deep learning technique to perform self-guided cost aggregation was proposed. The algorithm was implemented on an Intel i7 4770 @ 3.4 GHz with 16 GB RAM.
In [21], a novel approach to binocular stereo for fast matching of high-resolution images was presented. The presented formulation built prior on the disparities by forming a triangulation on a set of support points which could be robustly matched, reducing the matching ambiguities of the remaining points. This allowed for the efficient exploitation of the disparity search space, yielding accurate dense reconstruction without the need for global optimization. Further, the proposed approach automatically determined the disparity range and could be easily parallelised. All experiments were conducted on a single i7 CPU core running at 2.66 GHz.
A new taxonomy of Recursive Edge-Aware Filters (REAF) suitable for the stereo matching problem was provided in [22]. The one-tap recursive filters were classified according to recursion rate calculation, recursion type, and the unification of reverse directions. Experimental results demonstrated the advantages of un-normalized recursion for matching accuracy and sequential integration of reverse directions for execution speed. These are important conclusions for future directions of REAFs. The experiments were conducted on i5 3.10 GHz 32bit CPU with 4 GB RAM plus Nvidia Quadro4000 graphics card.
The authors in [23] presented an improved stereo matching algorithm which utilised per-pixel difference adjustment for Absolute Difference and gradient matching to reduce the radiometric distortions. Then, both differences were combined with census transform to reduce the effect of illumination variations. A new approach in which an iterative guided filter was introduced at a cost aggregation to preserve and improve the object boundaries. The undirected graph segmentation was used at the last stage to smoothen the low textured areas. All of the experiments were run on the central processing unit (CPU) with CORE i5 3.2 GHz and 12 GB of RAM.
In [24], a multi-stage stereo matching technique was developed to improve the matching cost computation stage where windows size of the Sum of Absolute Difference (SAD) and threshold adjustment were applied. They also used the median filter as the primary filter in the refinement disparity map stage to overcome the precision limitation of the disparity map.
Finally, in [25] a deep learning architecture for stereo disparity estimation was proposed. The proposed Atrous Multiscale Network (AMNet) adopts an efficient feature extractor with depthwise-separable convolutions and an extended cost volume that deployed stereo matching costs on the deep features. A stacked atrous multiscale network was proposed to aggregate rich multiscale contextual information from the cost volume which was allowed for estimating the disparity with high accuracy at multiple scales. All models were implemented with PyTorch and NVIDIA GPU trained.

| THE PROPOSED ALGORITHM
Several previous works that looked for real-time processing and embedded capabilities have used SAD (Sum of Absolute Differences) as a local metric correlation [26][27][28]. This is due to the fact that SAD has a regular structure with a fixed running time which facilitates hardware implementations that are suitable for real-time processing. To obtain the SAD formulation, the correlation windows were compared as shown in Equation (1): where I l (x + i, y + j) (left image) and I r (x + i + s, y + j) (right image) were the greyscale values of the pixels within the window in both images. The equation (2 � w + 1) 2 represented the correlation window size, where s was the shift of the window in the right image and s m was the maximal shift of the correlation window in the right image. Finally, a correlation coefficient was determined for each pixel and the shift which minimised the correlation coefficient was retained as the disparity.
To improve the scope and performance of the traditional SAD-based stereo matching formulation, in this work a hypothesis was proposed stating that a disparity map that uses small correlation windows under object boundaries and large correlation windows for homogeneous areas could deliver promising results in terms of accuracy. Furthermore, the keystone of the proposed algorithm was a local metric correlation (SAD) and FPGA implementation was feasible and suitable for real-time processing and embedded capabilities. In Figure 1, the block diagram of the proposed algorithm was shown; first, dense disparity maps were computed. To obtain this, the Sum of Absolute Differences (SAD) was used as a local metric correlation for different windows sizes (w1, w2, w3, …, wn). Then, left-right consistency at each window size was estimated. Finally, Sobel and the CPD (Centre Pixel Deviation) were used as a source to estimate each disparity map; the aforementioned maps were used as inputs to a fuzzy inference system which combined the disparities in order to generate a final disparity map. This disparity map used small correlation windows under object boundaries and large correlation windows for homogeneous areas.
In order to validate the hypothesis, in Figure 2, the performance of different combinations of correlation windows sizes were shown; moreover, Table 1 showed the calculation of six different window sizes and the combination between them; as a result, the percentage of success was obtained by comparing the hypothetical, estimated disparity map with the ground-truth (which combined the different disparity maps generated for each correlation window size to minimise the estimation error). The column labelled as Window (see Table 1) represents the combinations of the different window sizes. The Row w allows for interpretation of an indicative value for the window size; where, the number one is the 3 � 3 window, two of 5 � 5, 3 has a size of 9 � 9, four is 13 � 13, five of 17 � 17 and finally six of 21 � 21. For the dataset used, it is the well-known Tsukuba scene which consists of a stereo pair with 77 maximum disparity. The error percentage is defined as the percentage of 'bad' pixels. A 'bad' pixel is defined as any pixel (pix i,j ) whose F I G U R E 1 Block diagram of the proposed algorithm Error percentage graph of window combinations 1 2 3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 Window combinations (w) absolute difference between the ground truth and the estimated disparity (groundTruth i,j -estimatedDisparity i,j ) is greater than a quality threshold (in this hypothesis validation threshold = 1). To select a "proper" window combination, two important criteria were considered; the first was the lower percentage of error, and the second was the fewer windows used in the process. For the combination "1 − 3 − 5", an error equal to 7.62% was reached; other combinations were capable of improving this performance, nevertheless, more windows were used which led to the increase of the usage of more computational resources. Although all the windows were used, the obtained error (6.66%) has insignificant outperformance compared with other more efficient combinations in terms of computational requirements. Therefore, the combination '1 − 3 − 5' reached the best performance (in terms of accuracy and computational requirements), as illustrated in Figure 3 and this is the configuration that is recommended when implementing the proposed algorithm.

| Correlation computation and inference parameters
Let SAD(x, y, d ) be the mathematical formulation of the Sum of Absolute Differences, Equation (2); where I was the value in the grayscale and the subscript r and t were from the reference image and the target, respectively; located in the (x, y) position. 2 � w + 1 was the size of the correlation window, d was the correlation index and the maximum shift of the correlation window in the search image is d max , as illustrated in Equation (3). The disparity d(x, y) between the one pixel of the target image and the same pixel in the reference image was defined as the displacement d that minimises the correlation index. The original SAD formulation used a square fixed window (correlation window) around the pixel of interest in the target image; the window correlated with a second window that moved in all positions d in the second image; The position where the lowest correlation value was obtained determined the disparity value corresponding to the pixel of interest. Three different windows sizes (3 � 3, 9 � 9 and 17 � 17) were used in our algorithmic formulation. Two disparity maps were calculated for each window size, one based on the correspondences of the left image with those on the right, and another based on the correspondences on the right with those on the left; then, inconsistent disparities were produced by occlusions in the scene, there was visualization in a certain region that was not seen in the other [29].      Once the disparities for each window size were estimated, the smaller size was selected as the base. A FIS (fuzzy inference system) with two different input parameters was used to infer which correlation window size was more suitable for each pixel in the scene. Let (I l ) be the left image of the input stereo-par, the first input parameter for the inference process was the image gradient. The gradient of the base window S(X, Y ) was calculated using the Sobel operator; where the mask G x was determined by the Equation (4) for the horizontal borders and G y for the vertical borders by the Equation (5) the combination of both allows to obtain the gradient per pixel Equation (6). ð4Þ Sobel ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi The result obtained with the horizontal and vertical gradient masks of the Tsukuba scene is shown in Figure 4. Another input parameter for the inference process is the Centre Pixel Deviation (CPD), where x cp is the reference point and the neighbours are evaluated to know the difference of the values by means of the Equation (7), that is, the Centre Pixel Deviation (CPD) is a numerical index of the dispersion of a data set in a 5 � 5 block.

F I G U R E 3
The original "Tsukuba" scene. It shows disparity maps by applying the original SAD formulation. Small correlation windows needed to be used, see (a) in order to preserve edges; for homogeneous areas, only large correlation windows delivered accurate results, as illustrated in (b) and (c). When small correlation windows under object boundaries and large correlation windows for homogeneous areas were used, (d) a significant improvement was achieved The greater the deviation of the centre pixel, the greater the dispersion of the neighbours, where the calculated deviation was an average of the individual deviations of each observation with respect to the reference pixel, which was the one in the centre. Thus, it allows measuring the degree of dispersion or variability. First, by measuring the difference between each value in the data set and the one in the centre. Then, adding all these individual differences to give the total of all the differences. Finally, dividing the result by the total number of observations to reach an average of the distances between each individual observation and the one in the centre. This average of the distances was the Pixel Deviation from the centre and thus represented dispersion Figure 5.
The main function of CPD was to evaluate the ratio of the pixel in the process concerning its neighbours. Having a greater dispersion represented two situations: the first was that the affected area had two or more objects or was a textured area and the second was that the scene had errors in its appearance. The advantage of using CPD helped homogeneous highlighted areas by employing larger windows.

| The inference process
The pixel deviation from the centre of the windows (d w2 , d w3 ), and the detection of the edges obtained by Sobel from the left image (I L(Sobel ) ) at each point (x, y) for each level of disparity (d) were subjected to a diffuse analysis that determined which window size and its previously computed disparity was assigned in the final disparity map. For the FIS formulation, entry membership functions CPD w2 , CPD w3 were used and Sobel determined the degree of membership of the fuzzy sets

Min, Med and Max for CPD wn and NoEdge, PosEdge and
Edge. In the CPD membership function for the sets Min and Med, it was proposed to use triangular type functions and for Max keystone functions, where Min was an environment similar to that of the central pixel, Med had values with similar neighbours and some with a high difference and finally Max was an expression of a totally radical neighbourhood in its values.
Membership functions of fuzzy logic for the CPD parameter is shown in the Equation (8)(9)(10): The input functions for the NoEdge and PosEdge sets were defined using triangular type functions; and Edge used a trapezoidal function; the input membership functions settings were defined as shown in Equations (11)(12)(13).
μ S Edge ðx; 50; 100; 1100; 1200Þ The fuzzy inference rules transmitted by the restrictions are those presented in Table 2 number which was the x-coordinate (x) of the centre of gravity of the diffuse set of output; through the Equation (14); where μ y was the membership function of the output set Y, whose output variable was y. S was the domain or range of integration.

| FPGA ARCHITECTURE FOR REAL-TIME PROCESSING
In Figure (6) an overview of the FPGA architecture for the proposed algorithm is shown. The architecture was centred on an FPGA implementation where all recursive/parallelisable operations were accelerated in the FPGA fabric. First, the 'Buffer_ImL'/'Buffer_ImR' was used to hold local sections of the left-right (stereo pair) images that were processed and allowed for local parallel access that facilitated parallel processing. Then, 'BufferSAD_wn' units, where n was the correlation window size, were used to hold local patches for the correlation computation. Finally, the 'Module Sobel' and the 'Module_CPD' computed in parallel the input parameters for the fuzzy inference system which estimated the corresponding disparity value. In the following paragraphs, details about the algorithm parallelisation are shown. The Buffer_ImL consisted of input of '8 bits' and 41 outputs of '8 bits'. The same for the Buffer_ImR. These modules aimed to transmit the complete lines of the left-right (stereo pair) after providing the values of the image in form of windows. Both modules (Buffer_ImL and Buffer_ImR) were internally a buffer17x17 which was composed of 18 RAMs as presented in Figure (7), where 17 were in reading mode and one in writing mode, which was the last position. The one in writing mode stored the new values from the input. The reading and writing modes alternated each other in the process. The first RAM1 module in writing begins, then it goes to the RAM2 module until it reaches the RAM18 modules, this change of reading and writing operation was defined by the driver17x17 module which with the help of a counter defined the duration and the RAM module that operated in reading or writing mode. When the Ram1 module to the Ram17 module process is completed in the reading mode, this meant that it presented the corresponding 17 lines (rows) of the original image; thus, the matrix17x17 module delivered the corresponding 41 window data that worked with the pixel reallocation criteria.
The BufferSAD_wn; where n was the correlation window size, was responsible for calculating the disparity for the different windows. For example, in Figure 8 the FPGA architecture for the BufferSAD_w3 is shown. It consisted of three different modules implemented in parallel-pipeline form. The Buffer3x3 module was responsible for taking the 50 8-bit inputs of the previous modules and then applying the SAD correlation method, since the search range was in a disparity of 16, 16 outputs were present and search for correlation was from right to left and 16 more for the search from left to right with a total of 32 pins of 13-bit outputs; this process was carried out in order to realize the left-right coherence. The muxL_3x3 modules work with the correlation bits of the leftright search and with the right-left search data, respectively. Both modules worked on the principle of a 16-bit multiplexer and a 4-bit output; where only one with the lowest value can be selected and then assigned a disparity value based on the input that presented the value with the lowest correlation magnitude. Finally, the Minor_disp module was chosen as input of the two previous multiplexer modules and then it was stored in a buffer (memory) of the disparity size, and the disparity of the left image and the disparity of the right image is compared, the smaller one was chosen, obtaining the final disparity value.
The Sobel module: This module took the same eight inputs from the left image of the module SAD_size3x3, with the exception the central pixel, in which the Sobel mask was applied later in that value, it was divided into the size of the window that was 3x3, and to do this, the caseEdges module was used ,where it used a case in which the division has been done previously to save logical elements in the operations. For the hardware architecture, see Figure 9. In general, the Sobel_Module stored lines of the Sobel image that was created to then generate the data of a 3x7 window of the new image, while the module mux_Sobel consisted of a multiplexer that selects the largest value.
The Module_CPD: This module calculated the deviation of the centre pixel, the disparity map of the proposed windows had three inputs of '4 bits' and three outputs of '8 bits', and the module consisted of two main internal units: 1 -Buffer_Rules5: such as the previous buffers, it stored the part of the image that was created with the disparity values and then delivered the image values in the form of a 5x5 window, within a module of input of '8 bits' and 25 outputs of '8 bits'.
The DesvEstPixCent: In this module, the deviation rule for the 5x5 window with an 8-bit output was applied. One of the problems was to calculate the square root and the powers of the formula.
The module Rules_FIS: This module consisted of six inputs, three of them were of 4 bits, and the others were of 8 bits for a 4-bit output. The module acquired the resulting value of the gradient with Sobel, the three disparity values of the windows, and two of the Centre Pixel Deviation (CPD), as presented in the Figure 10. It consisted of 4 case modules, where the caseEdges module, CPDw2: case_CPDw2, and CPDw3: case_CPDw3 provided us 3-bit outputs with a value that depended on the input value for each one, where it was defined if it was high, medium, or low. The last caseRules module provided the three values that the previous modules calculated to convert it into a 9-bit vector, and once calculated, a case according to the order is entered and according to the fuzzy rules defined, one of the inputs was selected as the output: DispW1, DispW2 or DispW3 corresponding to the disparity value of the proposed windows.

| DISCUSSION AND ANALYSIS OF RESULTS
The developed FPGA architecture was implemented using a top-down approach. All modules were encoded in VHDL and simulated using post-synthesis simulations performed in ModelSim SE-64 10.5 to verify their functionality. Quartus Prime Lite Edition 18.0 was used for the synthesis process and the implementation of FPGA. A Cyclone V 5CSE-BA6U23I7 FPGA integrated into an Altera DE10-Nano development board was used. The consumption of hardware resources for the developed FPGA architecture is shown in Table 3.
The Table 4 shows quantitative results for the proposed algorithm compared with the current state of the art. To show this, we have submitted our results to the 'Middlebury Stereo Evaluation -Version 3'. The 'Middlebury Stereo Evaluation v.3' is a popular dataset for stereo matching evaluation. The datasets are split into test and training sets with 15 image pairs each. Both datasets have public tables listing the results of all submitted methods. Ground-truth disparities are only provided for the training set. To have a proper comparison baseline, the error margins, 'Average (Bad 2.0)', were obtained from the Middlebury webpage and considering the test images (without public ground truth) of the dataset. For the error metric, global accuracy can be defined as the percentage of 'bad' pixels; a 'bad' pixel is defined as any pixel (pix i,j ) whose absolute difference between the ground truth and the estimated disparity (groundTruth i,j − estimatedDisparity i,j ) is greater than a quality threshold "bad 2.0".
The analysis of Table 4 after comparing the results with previous works demonstrated relatively high performance in terms of accuracy and high processing speed; the percentage of error obtained was 30.9% and the processing speed was 0.03 s (for more details please see http://vision.middlebury.edu/stereo/eval3/and look for the 'MANE' algorithm). For the leader table, the proposed approach was ranked in the 84th place, however, most algorithms in the leader table were based on advanced stereo matching techniques and this limits the processing speed. As a result, the most accurate algorithms [30] (error 5.1%), [31] (5.43%), had a poor performance in terms of processing speed, 103th and 107th, respectively. Besides, most accurate algorithms in the leader table required high-end processors such as i7-7700K or i7-4770K and this limited the embedded capabilities.
For previous algorithms with similar accuracy, the most accurate approach was the ELAS_ROB (2010) [21] algorithm with an error of 27.3%. Other works, such as DAWA-F (2019) [18] and SGBMP (2019) [17] delivered similar performance in terms of accuracy, an error of 27.4% and 27.8%, respectively. But in all cases, the proposed approach obtained an error of 30.9%, it outperformed these previous works in terms of processing speed (this approach ranked sixth place of the Middlebury leader table in terms of processing speed), reaching 0.03 s of processing time compared with previous works; 0.48 s for ELAS_ROB (2010) [21], 1683 s for DAWA-F (2019) [18] and 9.12 s for SGBMP (2019) [17], see Table 5. Other previous works such as REAF (2015) [22] [23] with an error of 34.0%, the processing time of 132 s; or SM-AWP (2019) [24] with an error of 38.1, processing speed of 1.21 s, was outperformed in both accuracy and speed processing by the proposed algorithm.
In the other test, the results were submitted to the 'ETH3D Multi-view stereo/3D reconstruction dataset' proposed by Schöps et al. [32]. This dataset covered a variety of indoor and outdoor scenes. Ground-truth geometry has been obtained using a high-precision laser scanner. A DSLR camera as well as a synchronized multi-camera rig with varying field-of-view was used to capture images. For the two-view stereo challenge, 27 training and 20 test frames (data/results) for low-resolution two-view stereo on frames of the multicamera rig were provided. To have a proper comparison baseline, the error margins, 'Average (Bad 0.5)', were obtained from the ETH3D webpage and considering the test images (without public ground truth) of the dataset. For the error metric, the global accuracy can be defined as the percentage of 'bad' pixels; a 'bad' pixel is defined as any pixel (pix i,j ) whose absolute difference between the ground truth and the estimated disparity (groundTruth i,j − estimatedDisparity i,j ) is greater than a quality threshold 'bad 1.0'. Quantitative results for the proposed algorithm compared with the current state of the art were presented in Table 6. Qualitative results demonstrated relatively high performance in terms of accuracy and high processing speed; the percentage of error obtained was 23.11% (for more details please see https://www. eth3d.net/low_res_two_view?metric = bad-0-5 and look for the "MANE" algorithm). For previous algorithms with similar accuracy, the most accurate approach is the GANetREF_RVC [33] algorithm with an error of 25.41%. Other works, such as PASM [34], LSM [35] and DispFullNet [36] delivered similar performance in terms of accuracy, error of 28.19%, 29.98% and 30.27%, respectively. Other previous works such as, ELAS [21] with error of 33.68%, ELAS_RVC [21] with error of 33.79%, NVStereoNet_ROB [37] with error of 41.93%, the combination of SGM + DAISY [38,39] with error of 54.67, SPS-STEREO [40] with error of 55.62 or MFMNet_re [41] with error of 72.05% are outperformed in terms of accuracy.
With regard to the previous hardware implementations, the proposed approach outperformed previous FPGA-based approaches in terms of processing speed reaching 427@640 � 480 compared with the previous works; 427@640 � 480 for Perri et al. [42] (2013) and 68@640 � 480 for Humenberger et al. [43], this outperformed the proposed approach, but with an � 3 clock speed. In both cases [42] (2013), [43] a similar disparity range (15) was considered, please see Table 7. Other previous studies such as Zha et al. [44] with a processing speed of 30@1920 � 1680 or Ttofis et al. [45] with a processing speed of 60@1280 � 720, process higher image resolution compared with the proposed approach, however, these researches used a higher clock speed and this could have limited the small embedded capabilities. Concerning the computer hardware requirements, low hardware requirements were required to perform the different experiments presented in this academic research (a CycloneV 5CSEBA6U23I7 FPGA integrated into an Altera DE10-Nano development board). Comparing this system's hardware and software with previous works (Table 4) which used high end processors such as Intel Xeon 6-Core@3.33 GHz for DAWA-F (2019) [18], Nvidia Titan X for SGBMP (2019) [17] or Nvidia Titan XP for AMNet (2019) [25]; it was concluded that this approach allowed an efficient implementation into embedded processors (making it possible to develop a smart camera/sensor suitable for embedded applications). Finally, for qualitative results in Figures 11-16  Middlebury and ETH3D were presented. In all cases, accurate results with similar error compared with several works in the current literature were obtained.

| CONCLUSIONS
In this research article, a novel real-time stereo matching algorithm embedded in an FPGA was proposed. To address the SAD correlation window size problem, disparity maps with different window sizes were estimated. Then, the left-right consistency for each window size was computed. Finally, the centre pixel deviation and the Sobel gradient were estimated and used as a basis of a FIS (fuzzy inference system) which combined the previously computed disparities in order to generate a final disparity map that used small correlation windows under object boundaries and large correlation windows for homogeneous areas. To guarantee embedded capabilities, an FPGA architecture was proposed and experimental results demonstrated that our algorithmic formulation provided promising results (in terms of accuracy, processing speed, and hardware requirements) compared with the current state of the art. -221