Neuro-inspired edge feature fusion using Choquet integrals

It is known that the human visual system performs a hierarchical information process in which early vision cues (or primitives) are fused in the visual cortex to compose complex shapes and descriptors. While different aspects of the process have been extensively studied, as the lens adaptation or the feature detection, some other,as the feature fusion, have been mostly left aside. In this work we elaborate on the fusion of early vision primitives using generalizations of the Choquet integral, and novel aggregation operators that have been extensively studied in recent years. We propose to use generalizations of the Choquet integral to sensibly fuse elementary edge cues, in an attempt to model the behaviour of neurons in the early visual cortex. Our proposal leads to a full-framed edge detection algorithm, whose performance is put to the test in state-of-the-art boundary detection datasets.


Introduction
The Human Visual System (HVS) has been, historically, a source of inspiration for researchers in computer vision. The advent of Convolutional Neural Networks [1], featuring far more complex and powerful schemas than initial attempts for image processing using neural networks [2], might seem to indicate that the ultimate simulation step has been reached: the (yet modest) simulation of neural tissue. However, it is well known that the human visual system does not consist of randomly connected layers of neural networks coping with the information gathered by cones and rods. Instead, it features a more evolved system in which information is subsequently analysed to detect basic shapes, which are further combined to detect more complex structures. Ever since the experiments by Barlow [3,4] and Hubel and Wiesel [5,6], it is known that the ganglion cells in the very retina use organised cell structures to seek for pictorial primitives. This study has been repeated in different animals, with similar conclusions [7,8]. Specifically, humans are known to have center-surround receptive fields at the retina, which are further combined at the Early Visual Cortex (V1) to compose lines, boundaries, etc. This discovery had a massive impact in computer vision in the early 80's, and in fact led to the introduction of Marr-Hildreth's Laplacian-based edge detection method [9,10]. This edge detection method aimed at faithfully simulating the role of retinal ganglions using Laplacian (or, equivalently, Difference of Gaussian) kernels [11,12], in an attempt to simulate the early stages of recognition in the HVS. Different theories have been developed in the context of image processing, trying to mimic the features of the HVS. A relevant case is the use of techniques from Fuzzy Set Theory to cope with ambiguity, noise and blurring [13,14]. These ideas are often combined with bio-inspired optimization techniques [15]. In this way the inherent uncertainty of the edge detection process is channelled, bringing better performance to edge detection. Other approaches in literature avoid the (human-directed) modelling of the happenings in the HVS, relying instead on machine learning techniques (such as random decision forests [16]) to somehow simulate the neural interaction in brain tissue. While these detectors often give very good performance and real-time detection, they do not attempt to mimic the HVS and require a training phase for model learning.
In general, (convolutional) neural networks deeply rely on the optimization of their weights, as well as on the presence of multiple maxima in their parameter space, to mimic neural plasticity and reach good computational solutions. Some of these networks have shown excellent performance, being close or even surpassing the human performance in statistical terms [17,18]. In addition, some have extreme difficulties in performing when input data is significantly different from training data [19]. However, the artificial neural network might not be neither the only nor the best way to simulate the HVS, despite the fact that most of this system is actually supported by neural tissue. Specifically, we believe that a system based on the extraction of primitive features, followed by some sensible fusion process, might have a better rapport with how the HVS actually works.
We believe that a sensitive structure for object detection (in this case, for edge detection) is a multi-step algorithm ( Fig. 1), featuring the phases visual information undergoes in the HVS. First, visual data is to be set to the adequate level of focus, a task which is carried out by the lens in the HVS. Then, some basic features are extracted from the resulting image, representing the role of retinal ganglions. The third step consists of fusing the feature information to produce goal-oriented representations, as it happens in successive layers of the visual cortex. Finally, the different stimuli need to be discriminated among those actually corresponding to the presence of an object (edge), and those to be discarded. This thresholding process embodies the basic spike firing profile of neurons in human neural tissue. The fact that this 4-phase schema becomes both computationally handy and familiar [20] is not a random coincidence. In the early steps of computer vision, computational solutions were highly influenced by advances in neuroscience and psychology. These solutions were further adopted by researchers in upcoming decades, e.g. the edge detection frameworks by Bezdek et al. [20] and Law et al. [21].
Analysing the previous schema, it comes clear than some phases have been significantly more studied than others. While image regularization (either adaptive or not) and feature extraction are well-covered in literature, feature fusion and thresholding are notably understudied. Feature fusion is normally done trivially and, when studied in depth, it rather focuses on solving the problems and computational challenges generated in previous phases of the algorithms. For example, famous strategies such as edge tracking [22,23] or multiscale edge surfaces [24] are designed to cope with multiscale information, while Di Zenzo-Cumani operators [25,26] are presented because of the multidimensional nature of colour inputs. Some authors have indeed elaborated on the idea of intelligently fusing edge cues in an interpretable manner [21], but they are rather exceptional.
In this work, we propose an edge detection algorithm which tries to simulate the information flow in the human visual system for boundary detection. This algorithm is faithful to the 4-phase schema presented above, and is guided by two principia: first, trying to keep all phases to the minimum complexity, in an attempt to mimic basic unitary behaviour in human neural tissue; second, using a sensible, adaptive feature fusion method to combine edge features. The first principium renders into a simplistic, interpretable edge detection structure. The second principium leads to edge cue fusion based on the generalizations of the Choquet integral [27], an aggregation operator that has been intensively studied in recent years [28]. The application of the Choquet integral in Fuzzy Rule-Based Classification Systems [29] led to an improvement of the system [30]. This is due to the fact, that this aggregation when applied in the Fuzzy Reasoning Method (FRM) [31] consider all the information related with the example and its relation with the fired rules. Observe for example, that the classical FRM of the Winning Rule (WR) [32] just take into consideration one rule, the one having the largest relation with the example, and discard the remaining information.
In a similar way, we believe that the Choquet integral and its generalizations are a good computational representative of neuronal behaviour in terms of accumulating ionic charge, with the weights in the operator representing the neuronal spiking profile. As their properties have shown to succeed in a variety of tasks [28], we think that in order to fill the gap between the HVS and edge detection the generalizations of Choquet integral permits to better detect the presence of edges as they consider the relation between the elements to be aggregated. Hence, generalizations of the Choquet integral are taken as the keystone for the edge cue fusion.
In this context, the objectives of this work are: (i) Introduce a novel approach to the edge detection problem, which consists in the application of the gravitational force smoothing, a technique that was initially proposed by Marco-Detchart et al. in [33], combined with the use of generalizations of the Choquet integral in the feature blending phase of the Bezdek Breakdown Structure (BBS); (ii) Perform a quantitative evaluation of the obtained results, comparing them to different non-supervised edge detectors, like the Canny method [34], Fuzzy Morphology [35] and Gravitational approaches [36,37], using both the gravitational image smoothing mentioned in (i) and the standard Gaussian [38] image smoothing approach.
This paper is organized as follows. In Section 2, we present some preliminary and base concepts necessary to develop the paper. In Section 3, we present and discuss about some of the generalizations of the Choquet integral considered in the paper. Section 4 is devoted to present the proposed methodology for edge detection. In Section 5 we show some examples of edges obtained with our proposal, along with the statistical analysis and discussion of the obtained results. Finally, in Section 6, some conclusions are exposed along with future work.

Preliminaries
In this section, we recall some basic concepts about the theory of aggregation that will be used further on in the work.
In this paper, any function F : [0, 1] n → [0, 1] is called a fusion function [39]. A particular family of fusion functions is that of aggregation functions [41,42]. In the context of this work, the following two properties are important for aggregation functions: Observe that idempotency and averaging behavior are equivalent concepts in the context of aggregation functions.

Definition 3. [45]
An aggregation function T : [0, 1] n → [0, 1] is said to be a t-norm if, for all x, y, z ∈ [0, 1], the following conditions hold: A very relevant class of aggregation functions, which covers means [42], is that of Choquet integrals. To define these functions, we first need to recall the notion of fuzzy measure. In the following, consider N = {1, 2, . . . , n}. In the following, we present an example of fuzzy measure, namely, the power measure, which is adopted in this work: where |X| is the number of elements to be aggregated, n the total number of elements and q > 0. We have selected this measure due to the fact that it is the one achieving the highest accuracy in classification problems [46,30], where the exponent q is learned by using a genetic algorithm. Here, however, we shall consider fixed values for q.
The Choquet integral is idempotent and presents an averaging behavior. Observe that the Choquet integral is defined using a fuzzy measure, which allows it to take into consideration the relations or the interaction among the elements to be aggregated (i.e., the components of an input x). This is the key property of Choquet-like integrals. Also remark that when the fuzzy measure only depends on the cardinality of the criteria the Choquet integral collapses to OWA operators [47].
For some specific applications, imposing monotonicity might be too restrictive (e.g., the mode is not increasing with respect to all its arguments, but it is a valid function for certain applications). This consideration led Bustince et al. [39] to introduce the notion of directional monotonicity.
That is, an r-increasing function is a function which is increasing along the ray (direction) determined by the vector r. For this reason, we say that H is directionally monotone, or, more specifically, directionally r-increasing. In fact in applications as classification tasks or decision making, the strict monotonicity is not necessary and this property has been proved to have no interference in the quality of the results [48,49]. This is why we intend to test this behaviour in our experiments. (1) H is r-increasing, for some r ∈ R n , r = 0; (2) H satisfies the boundary conditions: H(0, . . . , 0) = 0, H(1, . . . , 1) = 1.
Observe that idempotency and averaging behavior are not equivalent concepts in the context of preaggregation functions.

Generalizations of the Choquet integral
The first generalization of the Choquet integral was proposed by Lucca et al. in [50]. In that work, the authors replace the product operator of the Choquet integral by a t-norm T . This generalization is referred to as C T -integral, and is defined as follows: where x (1) , . . . , x (n) is an increasing permutation on the input x, that is,

Choquet-like integral
Base Function Family [52,53]) Non-averaging, Non-Idempotent Averaging, Idempotent Table 1: Families of C T and C F -integrals used in this work, as generalizations of the Choquet integral.
Then, for any fuzzy measure m, C M m is a (1, . . . , 1)pre-aggregation function which is idempotent and averaging.
where x (1) , . . . , x (n) is an increasing permutation on the input x, that is,   Table 1 presents some of the best performer functions in our experimental study that can take up the role of the mapping F , combined with the corresponding fuzzy measure. Note that the generalizations are done by replacing the x and y variables in each function by, (x (i) − x (i−1) ) and m A (i) , respectively.

A methodology for edge detection using C F -integral-based feature fusion
In our proposal, we intend to create an edge detection framework based on (a) the computation of simplistic local edge features and (b) the aggregation of such features using generalizations of the Choquet integral. In this manner, we intend to simulate the process of early detection of visual primitives in the retina, followed by a neural combination of such features in the early visual cortex [54].
We consider images to be functions D : Edge detection is often regarded as applying a kernel over the image to obtain a set of cues that represent the level of existence of an edge. In this sense, image features are obtained as a sliding window over each pixel of the image. Then those features are fused and filtered to obtain the final edge image. As the features obtained are usually a set of values (in our case there are 8 features per pixel), we need to represent them with a unique value. This is why we fuse them using some type of operator, which in our proposal is done using the generalizations of the Choquet integral.
Our edge detection framework sticks to the Bezdek Breakdown Structure (BBS, [20,55]). In this structure, the process undergone by an image for edge detection consists of four different phases: image conditioning (P1), feature extraction (P2), blending (P3) and scaling (P4). This structure, recap in Fig. 2, allows for the process of edge detection to be understood as four independent, coordinated phases. In the conditioning phase, the image is enhanced for a better discrimination of its edges. Examples of conditioning techniques include regularization and content-aware smoothing. The feature extraction phase consist of gathering local or semi-local cues of the presence/absence of edges. These cues are further fused at the blending phase to produce scalar representations of the likeliness of an edge being present at each pixel. Finally, the scaling transforms the blended information into the desired representation, which is usually binary, 1-pixel wide edges. Each of such phases is detailed individually in the upcoming sections.

Conditioning (P1)
As one of the steps in edge detection, image conditioning aims at removing noise or extra information not suitable for the edge extraction process. It is a crucial step as it helps to remove spurious artefacts in image acquisition, as well as to reduce the effect of textures in edge detection. Ideally, the conditioning process shall decrease the visual saliency of non-edge structure, while preserving or maximizing the visibility of edges.
We consider, in our proposal, two different conditioning methods. Firstly, the well-known Gaussian smoothing, regularizes the image with a Gaussian pulse of standard deviation σ. This σ will control the level of smoothing over the image and generates a 2-D kernel which is applied over the image with a convolution. The Gaussian smoothing is known to produce a blurring effect which affects edge regions. An example of this effect can be seen in the first two images of Figure 4. Secondly, we use a more recent proposal for image regularization based on gravitational forces between particles known as Gravitational Smoothing (GS ) [33]. This approach, framed within the class of content-aware smoothing operators, iteratively simulates the grouping of pixels in a 5D, spatial-tonal space. Similarly to the effects caused by Anisotropic Diffusion (in any of its proposed models [56,57]), Mean Shift [58] or Bilateral Filtering [59], GS succeeds in regularizing intra-object regions, while maintaining or improving the visibility of actual edges.
The regularization process (Figure 3) is done considering each pixel as a particle that exerts a force over every surrounding pixel. The total force exerted over a particle is the sum of all the forces made by the remaining particles over it as indicated by Equation 4.

#»
Where the mass of each particle is considered as 1 and #» r is the 5-D distance in the spatial-tonal space. Unlike Gaussian Smoothing, GS is highly configurable. According to [33], we consider the following process-controlling parameters: ω c controls the influence between the spatial and colour information of each pixel, t indicates the number of iterations to be performed, and G is the gravitational constant. The influence of these parameters on a conditioned image can be seen in the last two images in Figure 4.
In this work, we consider the parameter settings in Table 2.
Name Smoothing method Parameter(s)

Feature extraction (P2)
At the feature extraction phase (Alg. 1, Step 1), we intend to gather simplistic edge cues at each pixel. The evident alternative would have been using Laplacian filters [9,54], hence mimicking the receptive fields at the retina. However, we have opted out by an even simpler set of edge cues.
At each pixel, we select the 8-point immediate neighbourhood (3 × 3 window) around each position (x, y) ∈ D (i.e. neighbours from (x − 1, y − 1) to (x + 1, y + 1)). From such neighbourhood, we compute the absolute discrete difference (Alg. 1-step 3) in the direction of each cardinal point, that is: This could have been done in more complex manners, e.g. anisotropic Gaussian kernels [60,61]. However, the present configuration is maintained to preserve the simplicity of the process.

4:
Order the eight values of step 3 in an increasing way;

5:
Apply the generalization of the Choquet integral C T m or C F m to the values obtained in step 4 to fuse the different features; 6: Assign as intensity of the pixel (x, y) of I f the value obtained in step 5. 7: end for

Blending (P3)
The blending phase is in charge of aggregating, at each pixel, the information gathered at the feature extraction phase into a scalar representative of the edges. This phased aims at mimicking the hierarchical composition of features in the early visual cortex (V1-V4), leading to the recognition of complex shapes from simple cues.
We understand that the proposed generalizations of the Choquet integral are adequate computational units to simulate this process of neural aggregation. Although, in general, any operator could be used (e.g. any aggregation operator [43,62]), we believe that the generalizations of the Choquet integral are a better fit for the non-linear spiking profile of neurons in the visual cortex. In fact, their advantage relay in considering all the information related to the problem and modelling the relation between the cues being fused by means of the fuzzy measure.
In our method, edge features are blended (Alg. 1-step 5) using either C T m or C F m . The process is done at each pixel of the image obtaining a blended feature image I [0,1] .

Scaling (P4)
In our proposal, we stick to the standard representation of edges as thin, binary lines. Hence, the scaling phase needs to convert the scalar, real-valued representation generated in the blending phase into a binary image. This is done by applying non-maxima suppression [63], considering cues orientation, in order to thin the peak values obtained from the blending step, and edge binarization using hysteresis. Note that the process of setting the thresholding parameters needs to be carefully taken care of, and can be either determined using generic algorithms [64,65] or task-dependent procedures.

Experimental framework
In this section, we compare the performance of our proposed framework with that by other well-established proposals in the literature. First, in Section 5.1 we briefly introduce the dataset used for the experimentation along with the different quality measures used for the quantification of the results. In Section 5.2 we briefly describe the methods to whom we compare. Then, in Section 5.3 we present the quantitative results in the comparison.

Dataset and quantification of the results
For our experiments, we have used the Test Set of the Berkeley Segmentation Dataset and Benchmark (BSDS500, [66]), which contains 200 natural images together with over 1000 ground truth labellings (see Fig. 5).
The evaluation of an edge detection method is normally done by comparing its results with the handlabelled results by human experts. Considering both the output of an edge detection method and the ground truth labellings are represented as binary images, edge image comparison can be taken as a binary classification problem. This strategy renders into the generation of a binary classification matrix. Note that, in order to compute the confusion matrix, an intelligent strategy needs to be taken in order to match the edges in the automatically-generated results with those in the ground truth. The reason is that edges which are at slightly displaced positions in different images should actually be accounted for as true positives, as long as such displacement is within a certain margin. Among the different alternatives for computing the displacement-tolerant correspondence of edges, we have opted out by the standard procedure Estrada and Jepson [67], (the code is available at [68]). The spatial tolerance has been set to the 2.5% of the length of the image diagonal.
Normally, the confusion matrix in a binary image comparison procedure is used as intermediate result for the generation of scalar descriptors of the similarity between the images. We select the most-recognized descriptor, which is the F 0.5 measure [69,70]. This measure is computed from the precision and recall descriptors: Prec = TP TP + FP , Rec = TP TP + FN (6) according to the following formula: The quantification of the result of an edge detection method at some given image is computed from the individual comparison to the images in its ground truth. Specifically, we keep the triplet (Prec, Rec, F 0.5 ) for the ground truth yielding the greatest F 0.5 . This process is repeated, then averaged, for all images in the dataset.

Methods in the experimental comparison
The performance of our proposal is compared with that of other methods in the literature, specifically: • The Canny method [34]. The Canny method is normally implemented using a double filtering with Gaussian filters. First, at the conditioning phase, a zero-th order Gaussian pulse (standard deviation σ 1 ) is used to regularize the signal. Then, two orthogonal first order Gaussian filters (standard deviation σ 2 ) are used to estimate the partial derivatives of the underlying signal at each pixel. The blending is performed as the Euclidean norm of the gradient at each pixel. For this experiment, σ 1 ∈ {1, 2} and σ 2 = 2.25, both being fairly standard values for images in the BSDS [71,72].
• Gravitational Edge Detector based on a t-norm T [36,37]. This method is based on the computation of the resulting local gravitational forces around each pixel. Those forces are expected to be small in magnitude in homogeneous regions, while being significantly larger at boundaries. The method uses Gaussian smoothing for regularization, then computes the attractive gravitational forces by replacing the product in Newton's formulation by any t-norm or t-conorm. In the present experiments, the functions used for such task are the probabilistic sum (S P (x, y) = x + y − xy) and the maximum (S M (x, y) = max(x, y)), which are both t-conorms.
• Fuzzy Morphology [35]. This method is based on a generalization of the morphological operators considering general t-conorms (t-norms) for erosion (dilation). The morphological operators (erotion and dilation) consists in growing or shrinking an image. Thus, by enlarging or eroding away those areas where there could be a presence of edge new images with enlarged or shrinked cues are obtained. Those pair of new images (e.g., substracting them) permit to obtain the final cues so that edges can be extracted. The fuzzy erosion and dilation are based on the Schweizer-Sklar [73] t-norm and t-conorm, respectively, which are defined, for all x, y ∈ [0, 1], by: min(x, y) otherwise.
As advised in [35], λ = −5 for this experiment. We refer to this approach as FM SS .
The process used for performance evaluation is as common as possible to all the methods compared. The conditioning is based on zero-th order Gaussian filters (as in the Canny method). Then, after blending, the images are scaled using non-maxima suppression [34] and hysteresis [65].

Experimental results
In this section we present the results obtained with our method compared to the ones obtained with well-known alternatives in the literature. We analyse both the qualitative and quantitative results, showing firstly a visual analysis where the different configurations are compared at a feature and boundary level and secondly a statistical analysis of the results as described in Section 5.1.
The difference obtained at a feature level can be seen in Fig. 6 where we show the result of the blending phase (P3) of our proposal. We can observe that the use of the Gaussian smoothing clearly penalizes the edge extraction process. Gaussian smoothing is not content-aware and makes cues to be extracted where very small intensity variation exist and hence edges detected.
On the contrary when the GS approach is used, as it is neighbour dependent and is capable of building homogeneous intensity regions, most of the small intensity variations are removed and less spurious elements are detected.  Table 1 along with methods from the literature as Canny, F M SS , G S P ,  Table 1   When looking at the edges result, in Fig. 7, we can observe the effect of both smoothing processes. On one hand, the literature methods avoid many edges even with spurious cues present when using the Gaussian smoothing, except in the case of FM SS . On the other hand, our proposal is greatly affected by the smoothing method showing that the Gaussian approach and the texture cues generate unwanted edges. Performing a global analysis, it is noticeable that our method performs visually better when using the GS approach. With the best smoothing configuration (S 3 ) we can see in a different set of images that our four different functions perform quite equally, outperforming the literature methods.
In terms of quantitative analysis, we show in Table 3 results related with the (Prec, Rec, F 0.5 ) measures for each smoothing configuration and for each edge processing method. With the gathered results we can confirm in numerical terms the results observed in the previous visual analysis.
When using the Gaussian smoothing, our method (with the four different alternatives) obtains worse results than the Canny method when using S 1 configuration. Although, it remains over the other literature methods.
In the case of a more intense Gaussian smoothing (S 2 configuration), our method decays considerably and is beaten by G S M . This results indicate that our method is subject to the influence of the blurring effect and to slight differences in intensity variations. When using an alternative smoothing, i.e., the Gravitational one, we can clearly see that our method is the best performer when using any of the generalizations of the Choquet integral, getting the best result in terms of F 0.5 with C C F p 0.8 and C O B p 1 . If we compare the results globally, we can observe that both methods achieve a higher result than the Canny method and even G S M .
The results when using the GS method (S 3 and S 4 ) are very similar removing most of the spurious edges that are present when using any of the Gaussian smoothing configurations (S 1 and S 2 ). It is worth mentioning that the influence of the smoothing has the opposite effect on the literature methods to whom we compare, removing artifacts but also losing a great quantity pf edges.
When using S 3 configuration we have a less smoothed image, getting a good compromise between regularization (with some blurring) and a good edge definition. With S 4 configuration regions are homogenized, with a clearer definition in possible intensity variations (with less blurring effect) getting large parts of the image with very close intensity values. On the literature methods we can clearly see that the most strict smoothing (S 2 ) is the best approach.
More results are available in the repository at GitHub 1 , where a larger number of functions for the Choquet generalizations has been put to the test along with different values for the power measure.

Conclusions
In this work, we have proposed a different use of the generalizations of the Choquet integral for aggregating information in images, concretely for fusing extracted image features in the context of edge detection. With this method, we are able to represent the relationship between all the variations of intensity around each pixel, not only considering some directions, simulating in a better way the process done in the HVS.
We have analysed the behaviour of different functions for the Choquet integral generalizations, showing that the results are very similar between them and outperforms the classical methods of the literature as the Canny method. From our experiments, the use of the Choquet integral allows for our proposal to perform better than other methods. This fact is arguably related to the ability of the generalization of the Choquet integrals to work in terms of the relation between the elements (in this case, the image pixels). Specifically, r-monotonicity allows for the detection of edges according to the tonal direction. In fact, in order to extract the edge cues, the relation between the pixels plays a crucial role in discriminating between true edges and outliers.
Moreover, we have tested our method with two alternative techniques for the regularization process, showing how this initial step in edge extraction process is crucial. In fact, a suitable image smoothing can make one method to be the best performer.
We can conclude that the best combination for our Choquet-like feature extraction process is to use the Gravitational smoothing, as the regularized image obtained is more homogeneous while preserving edges. This makes our method not to detect edges where small variances in the intensity information are present (e.g., textures).
As future research lines we intend to study a large variety of generalizations of the Choquet integral, like the C F1F2 -integrals [30,74], which have shown promising results in Fuzzy Rule-Based Classification Systems, as well as other generalization structures such as the d-Choquet integrals [75], taking advantage of their power in comparison processes such as interval-valued data, which is also one of our future goals to improve our proposed edge detection method. Moreover, other types of fuzzy measures (e.g. weighted mean, OWA, etc.) should be studied as we have stick to the power measure, as well as using methods to learn the measure so that it better represent the relation between elements to be aggregated.