A trainable monogenic ConvNet layer robust in front of large contrast changes in image classification

Convolutional Neural Networks (ConvNets) at present achieve remarkable performance in image classification tasks. However, current ConvNets cannot guarantee the capabilities of the mammalian visual systems such as invariance to contrast and illumination changes. Some ideas to overcome the illumination and contrast variations usually have to be tuned manually and tend to fail when tested with other types of data degradation. In this context, we present a new bio-inspired {entry} layer, M6, which detects low-level geometric features (lines, edges, and orientations) which are similar to patterns detected by the V1 visual cortex. This new trainable layer is capable of coping with image classification even with large contrast variations. The explanation for this behavior is the monogenic signal geometry, which represents each pixel value in a 3D space using quaternions, a fact that confers a degree of explainability to the networks. We compare M6 with a conventional convolutional layer (C) and a deterministic quaternion local phase layer (Q9). The experimental setup {is designed to evaluate the robustness} of our M6 enriched ConvNet model and includes three architectures, four datasets, three types of contrast degradation (including non-uniform haze degradations). The numerical results reveal that the models with M6 are the most robust in front of any kind of contrast variations. This amounts to a significant enhancement of the C models, which usually have reasonably good performance only when the same training and test degradation are used, except for the case of maximum degradation. Moreover, the Structural Similarity Index Measure (SSIM) is used to analyze and explain the robustness effect of the M6 feature maps under any kind of contrast degradations.


I. INTRODUCTION
O NE important feature of the mammalian visual cortex is its built-in capacity to recognize objects independently of size, contrast, illumination, angle of view, or brightness, among other transformations [1]. Achieving an equivariance or invariance response to these transformations is an important goal in Deep Learning (DL) [2], [3]. In fact, an increasing number of studies have found various weaknesses in the generalization capacity of ConvNets models [2], [3] related to large contrast changes. In addition, previous evidence demonstrates that the deployment of DL models in the real world could be affected significantly by day to night light changes, haze, or the effects of glare, such as self-driving cars [4], surface glazes in medical images [5], or 24-hour surveillance.
Data augmentation is one idea for overcoming illumination and the contrast variations in image classification problems [6], [7]. However, previous work indicates that learning of an invariant response may fail even with large data augmentations in the training process [8]. In addition, data augmentation techniques have three main problems [9], [10]: i) they must be tuned (manually) by human experts, which causes large variances in the DL-model performance in practice; ii) because of a lack of analytic tools (even for simple models), it is not well-understood how training on augmented data affects the learning process; and iii) data augmentation approaches focus on improving the overall performance of a model, and it is often imperative to have a finer-grained perspective. This means that data augmentation techniques are required to mitigate weak performance when dealing with under-represented classes.
Another common approach to tackling contrast and illumination problems is data normalization, such as local response normalization [11]. However, the main problem of these approaches appears when the change of illumination is nonuniform across the images in the dataset [12]. Rad et al. proposed using an adaptive local contrast normalization based on a window (region) difference of Gaussians for image detection. Although their approach is novel, the parameters of the Gaussian functions are based on dataset illumination. As a result, overall detection performance will be reduced for a new image with very different contrast.
Generative Adversarial Networks (GANs) are capable of restoring contrast affected images [2], [13]. Nevertheless, these architectures are primarily designed for visual restoration, not for image classification.
The limitations of these approaches reveal opportunities to explore novel methods. This work proposes a new strategy to progress toward achieving contrast-illumination invariance in image classification tasks. Specifically, a new bio-inspired trainable layer, M6, is presented which detects low-level geometric features (such as lines, edges, and orientations), which are not unlike similar patterns detected by the V1 visual cortex [14].
The M6 layer is based on the 2D extension of the analytic signal, called the monogenic signal. As a result, each pixel value of an image I(x, y) ∈ R is mapped to a Hamilton quaternion (see Figure 1). The geometry of this approach has the remarkable simple property that the quality of the representation is not affected by large changes of the pixel intensities, a fact that confers a degree of explainability to the networks.
On the experimental side, to evaluate the predictive performance and robustness of M6, contrast changes were simulated in three different ways using four datasets: MNIST [15], Fashion MNIST [16], CIFAR-10 [17], and Dogs and Cats [18], with three architectures: A1 (shallow), A2 (medium), and A3 (very deep). The performance of M6 was compared to a 2D conventional ConvNets layer (C) and the Q9 layer [19], which follows a similar approach to handle contrast changes. To evaluate the robustness response the models were trained with a specific data degradation and tested with other types of data degradations. The numerical results in the test set confirm that M6 achieves a remarkable resilient response to contrast variations when compared to standard convolution (layers) networks and to the Q9 approach.
The rest of the paper is organized as follows: Section II, acknowledges previous related works. Section III summarizes background materials concerning the local phase computation, the monogenic signal, and bio-inspired tools.
Section IV, presents the computational aspects of the monogenic layer M6, and Section V the data and experimental arrangements. The Section VI describes the experimental outcomes and their analysis and Section VII presents the authors' conclusions and future work.

II. RELATED WORK
Promising approaches were advanced in the bio-inspired papers [1] (which introduces VisNet) and [20]. These introduce and study, via quite distinct approaches, interesting hypotheses about how the cortex achieves various invariant representations. However, testing is performed on relatively small datasets, and although lighting invariance is considered, contrast invariance is not.
The mathematical tools used in [21]- [23] involve only complex numbers, but are otherwise somewhat akin to those used here. Their goals, however, are quite different, and in particular, they do not seek a robust response to large contrast or illumination changes. The first presents (complex) harmonic networks (H-nets) and shows, by means of complex circular harmonic functions, that they exhibit equivariance to translations and to in-plane rotations. The aim of the second is to present a complex-valued learnable Gabor-modulated network which features orientation robustness. The third reference applies the analytic signal (defined in section III) using the Hilbert-Huang transform to decompose the ranked pooling features into finite and often few data-dependent functions. This approach can deal with occlusion or heavy camera movement in action recognition. However, the scope of this work do not include the occlusion problem.
Recent developments dealing with contrast robustness have been reported. In [24], for instance, the authors present an entry layer (VOneNet) which uses Gabor functions and report the results of testing for different data degradations and perturbations. The layer is deterministic (non-trainable) and consequently requires fine tuning several parameters (related to input size, normalization, and random parameters) to set the Gabor function. For contrast degradation, the test accuracy drops from 75.6% to 28.5% on the Imagenet dataset. Another deterministic approach is presented in [25]. The authors propose combining the chromatic component of a perceptual color-space to improve image segmentation. Nevertheless, tests in this work deal only with outdoor images with normal degradations, and their parameters must tuned according to the dataset.
A previous version of the M6 layer still deterministic was presented in [26], but already with quite robust responses for illumination or contrast variations. In another recent work [19], the authors proposed a bio-inspired approach to leverage quaternion Gabor filters to improve classification even with contrast degradations. An adaptive local contrast normalization was proposed in [12]. Although this approach was more effective than conventional normalizations, the trainable parameters are based on the dataset illumination. As a result, overall detection performance will be reduced for images with different degradations.
To assess the performance and robustness of our M6 layer in more concrete terms, it was compared primarily to conventional 2D convolutional layers C, which are the most popular and provide the architecture for many ConvNet models. Because of this, the authors looked for studies that satisfied the following conditions: i) robustness to contrast or illumination changes in image classification tasks, ii) available and implementable code for different architectures and datasets. The condition selected for that purpose, already cited above, are [12], [19], [24], [25]; and on closer inspection mentioned above, it was determined that only [19] satisfies the four criteria. As we mention before, the main strength of M6 with respect to [19] is that the latter is based on a deterministic unit, while M6 is trainable, making possible to improved robustness and performance in image classification tasks.

III. BACKGROUND
This section explains background and notations used in this work. These include the analytic and monogenic signal properties. The bio-inspired connection to the proposed layer is also explained.

A. SIGNALS
We define 1D (resp. 2D) multivectorial signals as C 1 maps U → G from an interval U ⊂ R (a region U ⊂ R 2 ) into a geometric algebra G (see [27] for detailed definition). For G = R (G = C, G = H) we say that the signal is scalar (complex, quaternionic). For technical reasons, it is also assumed that signals are in L 2 (that is, their modulus is square-integrable). For more information about quaternions and geometric algebra, please see the Appendix and [27].

B. ANALYTIC SIGNAL
In 1946, D. Gabor proposed a complex-valued function (signal) called the analytic signal for removing negative frequencies [28]. Using the analytic signal instead of the original real-valued signal has mitigated estimation biases and eliminated cross-term artifacts due to the interaction of negative and positive frequencies [29]. However, the most interesting property of the analytic signal is its phase representation. In contrast to amplitude-based computer vision techniques, the phase-based methods are not sensitive to smooth shading and lighting variations [30], [31]. Moreover, beyond the global Fourier phase (not localized), the analytic signal encodes both local space and frequency characteristics of a signal simultaneously. Phase-based feature detection has been investigated extensively in the classic computer vision approach, as in [30], [32]- [34]. For a 1D real signal (function) f (x), its analytical signal f A (x) is defined as follows [31]: The amplitude A and the phase ϕ of f A are defined by the following expressions: The local phase computation needs an additional operator to enhance the localization of the features [30] which can be achieved by computing a filtered version of the signal f (x) with an even function f e (x), as follows [30]: where * represents the convolution operator. and H is the Hilbert transform. According to [30], [31] the approximation filter f e must be a symmetric band pass filter. In practice, the approximation of the local phase and the local amplitude uses a pair of band-pass quadrature filters such as log-Gabor.

C. MONOGENIC SIGNAL
The most accepted 2D generalization of the analytic signal, is the monogenic signal. Felsberg and Sommer in [35], proposed the monogenic signal, which satisfies the generalized Cauchy-Riemann equations of Clifford analysis using the G framework.
A monogenic signal I M = I M (x, y) ∈ 1, i, j ⊂ H associated to an image I = I(x, y) ∈ R (where x, y ∈ U , U a region of R 2 ) is defined as follows [35]: where the signals I 1 and I 2 are the Riesz transforms of I in the x and y directions. In this work, a quadrature filter approximation was applied, by using I = g * I, where * is the convolution operator and g = g(x, y) is a log-Gabor function (radial, isotropic, bandpass filter). Rewriting the monogenic signal equations (adding quadrature filters) in the Fourier domain returns: where Here, u 1 , u 2 are the frequency components, σ is the variance of the log-Gabor, ω is the minimum wavelength, f is a scale factor, and s is the current scale.

VOLUME 4, 2016
The local amplitude A M = A M (x, y) is defined by the expression [35]: The local phase I φ and the local orientation I θ associated to I are defined, again following [35], by the relationships where |I R | = I 2 1 + I 2 2 and the signal quotients are taken pointwise. The geometric interpretation of the monogenic signal is depicted in Figure 1. Note how changes in the pixel value intensity I do not affect the local phase I θ and the local orientation I φ values. This theoretically invariant response to large illumination changes to the local phase is in line with that reported in [19], [30]

D. BIO-INSPIRED PROPERTIES AND TOOLS
This subsection highlights the main properties of the V1 cortex layer and how they are reflected in the functionality of M6. The reasons for choosing two bio-inspired technical tools as key ingredients as guides in its design and construction are also outlined.
V1 cells form the first layer of the hierarchical cortical processing [36]. The analogy for M6, quite literally, is that it is meant to be the first layer of a ConvNet.
Moreover, V1 neurons respond vigorously only to edges (odd-signals) and lines (even-signals) at a particular spatial direction through their orientation columns [31]. The counterpart of this in M6 is realized by computing the local phase and orientation, as these have the capability of detecting lines and edges and their orientations. This also justifies placing M6 as the first layer, as the features in question are the most primitive, and hence, it would be the not very effective use the M6 layer in a deeper position.
In [37], the authors reported a cortical adaptation to brightness and darkness in the primary visual cortex V1 of a macaque. In relation to this, it is previously mentioning in [26], [30], [31] that the local phase and local orientation should be robust (invariant) with respect to contrast changes due to its geometry representation.
Daugman [38] discovered that the Gabor functions resembled the experimental findings of Hubel and Wiesel [14] on orientation selectivity of the visual cortical neurons of cats. However, Gabor functions (filters) may cause fairly poor pattern retrieval accuracy in certain applications (see [39]) because they have, for certain bandwidths, an undesirable non-zero value of the so-called DC component (cf. [39]). For pattern recognition applications, this DC component entails that a feature can change with the average value of the signal. Fortunately, this weakness can be overcome with log-Gabor filters, as explained in [39], and this is why they are use them henceforth.
The other bio-inspired tool is the HSV color space. Although the main virtue of this space is that it best fits the human perception of color [40], it is used in this work as a means to geometrically code the phase and orientation in the Hue channel. This is quite natural, as the purpose of this channel is to hold a phase. The use of this transformation in the M6 layer results in an increase of classification accuracy (see Figure 3).

IV. MONOGENIC LAYER M6
Monogenic signal characteristics (contrast invariance based on its geometry, feature extraction in frequency domains, and its similarities to the V1 properties) are the main stimuli for the design of our M6 trainable (top) layer.
A scheme of the computational flow of the M6 unit on a one-channel image as input is displayed in Figure 2. For color images (I c ) the mean value of the channels were computed and used as the input value. The convolution operations are carried out in the Fourier domain, that is, on F(I), where I is the input image. This is to allow a straightforward implementation of the monogenic signal and avoid the problem of having to select a convolution kernel size proceeding instead to with a band-pass filter.
It is fundamental to note that M6 has only four trainable parameters (s, f, ω, σ), which are the log-Gabor function parameters (eq. (11)). Remember that s is the current scale indicator; σ, the variance of the log-Gabor; ω, the minimum wavelength; and f a scaling factor.
Although the local phase I φ (x, y) (Eq. 13) and local orientation I θ (x, y) (Eq. 14) are theoretically invariant to illumination changes, it was found that inserting additional procesing operations is beneficial to better mimic the V1 behavior and increase classification performance, as explained below. Figure 3 presents an example of how performance increases by leveraging the RGB phases specially for shallow architectures such as A1 (see Figure 10 for the ConvNet architecture definition).  After the local phase is computed, a normalization step, I θ , I φ , I R = Norm(I θ , I φ , I R ) is added, where Norm is generically defined by the following expression: I(x, y) − min(I(x, y)) max(I(x, y)) − min(I(x, y)) .
Next, using the polar geometry of the HSV color space, the normalized local phase I φ and local orientation I θ are stacked in a hue channel (H) of a HSV color space with I R as the saturation (S) and the constant matrix 1 [m,n] as the value component (V). Finally, the HSV images are converted by the standard function HSV 2RGB into RGB images (see [41, page 304]). As a result, an M6 unit produces six output feature maps per channel image input. Figure 4 depicts an example of M6 outputs using a white circle as the input image.
For backpropagation, the automatic differentiation and weight updating implemented by Tensorflow 2 (TF) were used, although their symbolic expression is not reproduced here because it lies beyond the scope of this paper. Instead, two graphics that document the learning process are provided. Figure 5 illustrates how the M6 trainable parameters are adjusted during the training process using the MNIST dataset. Note that the major changes in the parameters take place in the first fifty epochs. Figure 6 displays the accuracy  (I(x, y)) and RGB φ and RGB θ are the M6 outputs. The grey images are the corresponding RGB components. and loss in the training and validation processes corresponding to the same training job. The validation loss (val loss) and validation accuracy (val acc) undergo major changes before the fiftieth epoch, thus matching what was found in Figure 5. Validation loss rises slightly after the fiftieth epoch, a sign that the training has entered the overfitting regime. In addition, Figure 7 shows an example from another perspective, namely a CIFAR-10 image and the associated pre and post-training feature maps, revealing that the post training feature maps are sharper. This behavior is expected inasmuch as the trainable parameters define the band pass size of the log-Gabor filter.  Table 1 summarizes a comparison of some characteristics of a 2D convolutional layer with those of the proposed layer M 6. An important differences between a 2D conventional Con-vNet (C) and the proposed layer M6 is that the convolutions in the latter are carried out in the Fourier domain. This feature helps to avoid the problem of selecting a convolution kernel size depending on the input-image size, as the convolution in Fourier domain is a pointwise multiplication and the bandpass filter size is learned with the log-Gabor parameters. Furthermore, the number of trainable M6 parameters (four) is significantly lower than the number of C parameters VOLUME 4, 2016  (weights), which depends on the number and size of their filters. For example, for six C units with 3 × 3 filters, the number of parameters is 6 × 3 2 = 54. On the other hand, if both C and M6 can detect lines (l) and edges (e), C can usually detect more features, such as corners (c) or some irregular shapes, whereas M6 is also crucially sensitive to orientations (thus resembling V1). The orientation response is important to obtain a robust performance even with rotated images. Another important difference is the activation function. The ReLU (Rectifying Linear Unit) function is often used in C instances [42], while in M6 use arctan and arcsin as activation function. This notwithstanding, both M6 and C behave similarly with respect to all the tested optimizers (for instance, SGD, ADAM, and NADAM). An important remark is that M6's trainability sets it apart from deterministic (pre-processing) layers used in other systems, as stressed in Section II.

V. EXPERIMENTAL SETUP
The experimental setup is designed to evaluate the robustness and classification performance under different datasets (simple to complex task, different input shapes), architectures (shallow to deep) and degradations (three different types of contrast).

B. DEGRADING PROCEDURES
Three contrast transformations were used to degrade images I. The max-min scale transformation, C S I; The TF contrast transformation, C T F I; and the haze transformation, C H I. Figure 8 displays an example of three degradation levels d j (j = 1, 2, 3) for each degradation procedure applied to the image in column d 0 (no degradation). Max-min Scale transformation (C S ). Applying this to an image I(x, y) with respect to an interval S = [a, b], it produces an image C S I(x, y), as follows: C S I(x, y) = a + (I(x, y) − min(I(x, y))(b − a) max(I(x, y)) − min(I(x, y)) .
The four degradation levels d j corresponding to the intervals S defined by a = 0.3, 0.7, 0.9 and b = 1, respectively.
Contrast degradation using C T F . This is defined by where µ is the mean value of the input image I and F ∈ [0, 1] is a contrast factor. The four degradation levels d j correspond, respectively, to the values F ∈ {0.7, 0.3, 0.1}.
Contrast degradation by haze C H . This method works by adding non-uniform haze to a given image. The main basis for gauging haze and fog in images stems from the atmospheric scattering model proposed by [45], which can be summarized, following [46], by the equation In this expression, I(x, y) denotes what the image would be if haze were removed and C H I(x, y) is the measured (observed) hazed image. The rationale behind (18) is that I(x, y) undergoes an attenuation t(x, y)I(x, y) caused by the medium transmission t(x, y) ∈ [0, 1], which measures the fraction of light reaching the camera from the (x, y) direction. Its value is 1 for perfectly transparent air and 0 when no light from (x, y) reaches the camera. The term A(x, y) denotes the total atmospheric light, and hence (1 − t(x, y))A(x, y) measures the fraction of light contributing to C H I(x, y) not originating from the source (x, y), usually produced by scattering and reflection processes. Now the main point of contrast degrading C H is using equation (18), which means generating various values of A, and using these I(x, y) to estimate t(x, y) using the dark channel prior proposed in [46]. In the experiments, the degradation levels It remains to be seen how to generate the various As. The components of an RGB image I (I r , I g , I b ) allow handling I(x, y) as a 3-vector, I(x, y) = [I r (x, y), I g (x, y), I b (x, y)]. In this case, A(x, y) must also be a 3-vector, say A(x, y) = [A r (x, y), A g (x, y), A b (x, y)], and as a result, C H I(x, y) can be treated in the same way. The generation of A vectors is done here by choosing each channel c independently and randomly in the interval [0. 8,1], A c ∈ [0. 8,1]. Note that the contribution of A is not limited to the light intensity, as it produces changes in color, a fact that is in accordance with the physical effects of the atmospheric light. See in Figure 9 and example of the transmission maps estimation of CIFAR-10 and DvsC.

C. ARCHITECTURES, TRAINING AND TEST
The three architectures used in this work are labeled A1, A2, and A3. A1 is a shallow ConvNet; A2, a medium depth ConvNet; and A3, a deep ConvNet that which a ResNet20 [47] as a subnet or backbone. Each of these architectures actually stands for tree: one, AjC, topped by a standard 2D convolutional layer (C), other, AjQ9, topped by a Q9 layer defined at [19] and AjM6, topped by the M6 layer. See Figure  10 for details on each of them. The classification robustness of the M6 layer was compared against the conventional ConvNet layer (kernel 3 × 3, with six output channels and no data augmentation in the training process). The K in the last layer (with a softmax function), means the size of the output layer (number of classes), which is 10 in for CIFAR-10, MNIST, and f-MNIST and 2 for DvsC. Note that the large depth of A3 does not allow processing small size images such as those from MNIST and f-MNIST.  Table 2. The terminology used follows the conventions of the TF framework. For more details, see the source code available through the link provided below.
To evaluate the response robustness, the models were trained with a specific data degradation and tested not only with the same data degradation but also with three additional data degradations. This kind of experimental setup is common in other works to evaluate robustness, such as [2], [19], [24]. The experimental setup concerns the aforementioned four primary datasets, the nine networks summarized in Figure 10, and the three contrast degradation methods. For each degradation method and each primary dataset, three additional datasets were constructed by applying the degradation VOLUME 4, 2016 method to the primary dataset with degradation levels d 1 , d 2 , and d 3 . Then, one of the networks was trained four times, one for each degraded datasets, including the primary degradation level d 0 . Note that all training was done from scratch. Finally, each of the four trained classifiers was tested on the four degraded datasets. These arrangements are sketched in Table  3. As indicated above, the A3 networks can be used only on CIFAR-10 and DvsC data.
The hyperparameters used in the experimental setup are a learning rate of 0.001, 100 epochs, a batch-size of 128, ReLU as an activation function for C, and ADAM as an optimizer. Keras-turner (Bayesian optimization) [48] was employed to choose the kernel size of the first 2D convolutional layer and learning rate of the ConvNet. The initial parameters of M6 layer are s = 1, f = 1, σ = 0.33, and ω = 1 based on the previous version of the layer [26]. All datasets were normalized to one by dividing each pixel value by 255. The TF 2.1 deep learning framework, run on a V-100 Nvidia-GPU for all experiments, was used. Its reproducibility is supported by the supplemental material uploaded at the Gitlab link M6 project.

D. METRICS FOR ANALYSIS
A Structural Similarity Index Measure (SSIM) index was chosen in order to compare the effects of the degradation procedures on the M6 feature maps (in Section VI). This index was selected because it allowed numerically comparing the feature maps changes made by the degradations, taking into account some key aspects of human perception [49]. Moreover, SSIM could quantify the image quality as the perceived changes in the structural information (SSIM map), and at the same time, SSIM takes the luminance and contrast changes into account. The SSIM index formula is defined as follows [49] : where x 1 and y 1 are two arrays of size N × N , µ x1 , µ y1 are the averages, and σ 2 x1 and σ 2 y1 the variances, of x 1 and y 1 , respectively, while σ x1y1 is the covariance of x 1 and y 1 , c 1 = (k 1 L) 2 and c 2 = (k 2 L) 2 , with L being the dynamic range, k 1 = 0.01 and k 2 = 0.03. The SSIM value belongs to the interval [0, 1], with SSIM= 1 corresponding to maximum similarity and SSIM= 0 to minimum similarity.
In addition, the Peak Signal to Noise Ratio (PSNR) was used to evaluate the robustness of the proposed layer. The PSNR is used to calculate the ratio between the maximum possible signal (image) power and the power of corrupted or degraded image [50]. PSNR could be defined via the Mean Squared Error (MSE). Given a m × n monochrome image I and its corrupted approximation K, The MSE is defined as: The PSNR (in decibel) is defined as: where the MAX 2 I is the maximum possible pixel value of the image; for instance, when the pixels are represented using 8 bits per sample, this is 255. In general, the higher the PSNR value, the better. Note that for two equal images, the MSE will be zero and the PSNR value will be infinite. Figures 11, 12, and 13 present a grid of boxen plots of the classification testing-accuracy values according to the experimental setup. Each box represents five numbers: minimum (botton whisker), first quartile (Q 1 ), median (Q 2 ), third quartile (Q 3 ), and maximum (top whisker). C and Q9 were compared against M6 (grid-columns) using four datasets (box-color), three architectures (grid-rows), and four degradation levels (using C S , C T F , and C H , respectively). On a wider angle, the analysis of these results can be divided into two major classes: i) evaluating the robustness using the size of the box and whiskers, and ii) evaluating the maximum performance with the top whisker. For i), these tests highlight that the ConvNets with the M6 present the smallest boxes for all cases. These results provide indisputable evidence for the robustness effect of M6 under different contrast degradations. Please note that training was performed on each degradation level and testing with all degradation levels. According to these results, M6 can be trained with any degraded level and yet achieving almost the same performance. This is in marked contrast to the C models, which have quite low performance when trained with d 3 degraded images. For ii) dashed lines were added to facilitate view of the maximum values for each model type (A1, A2, and A3) and each dataset. Maximum performance occurs at 17/26 for C, at 3/26 for Q9, and at 6/26 for the M6 models. However, ConvNets with M6 have the closest performance to the maximum test accuracy value, easily confirmed this by computing the square of the difference with respect to each maximum over all models, resulting in 29.3, 4.7, and 1.0 for the C, Q9, and M6 architectures, respectively. It is important to note that the maximum values from the C models are generally achieved when the training uses the same data degradation as the test dataset. However, for the maximum degradation level d 3 , the classification performance is significantly lower, even with the same data degradation level as that used for training.

VI. RESULTS AND ANALYSIS
The results also reveal parallel behavior patterns between the C and M6 models. For instance, all models have higher test accuracy performance with simple datasets (MNIST and f-MNIST). In addition, performance tends to increase with the depth of the architecture. However, it is important to remark that the single M6 unit, which has only four weights, can extract enough features for different types of datasets, in contrast to the C layer with six convolution units and fiftyfour weights. Moreover, the increase in test accuracy of M6 in relation to the increase of the number of layers can be ascribed to the fact that the output features of M6 can be learned and combined by the later layers, similar to how they are processed by a conventional convolutional layer. Combining the robustness and the maximum classification performance of M6 leads to boosted efficacy of those layers when fed with the M6 output rather than having to elicit it from a raw input. In other words, M6 delivers sharper features for image classification.
As stressed in the description of M6, the robust performance stems from the geometry behind its design. One way to evaluate (numerically) its resilience to contrast changes is by using the SSIM index and PSNR. The main idea is to compute the SSIM and PSNR for d 0 and d j images (j = 1, 2, 3) and compare it to the two SSIM indexes of the image transformed by M6 (which is regarded as two images, RGB θ and RGB φ ). The SSIM index comparison is even more forceful if the indices are displayed together with the SSIM maps. In the top row of of Figure 14, for example, SSIM is applied to a d 0 image and its d 3 degradation by C S , and it is not surprising to find a low index, 0.31, and a discrepancy map that is quite close to the original image. In the bottom row, SSIM is applied to the same images after being transformed by M6, and it can be seen that the two indices are high (close to 1) and that the discrepancy maps have very small values, which means that M6 drastically reduces the discrepancy between M6(d 0 ) and M6(d3). This behavior is a compelling evidence of M6's robustness capability and suggests the potential of layers designed on similar, possibly more general principles. Figure 15 shows a box plot of the results of a similar computation, but this time involving 1000 random images, three degradation methods (C S , C T F , C H ), and three degradation levels d j (j = 1, 2, 3). It is clear that the box plot's M6 transformation values are more compact and quite close to one. This confirms and explains the findings the in the preceding example, namely that M6 tends to see the d j degraded images as a d 0 image. PSNR results are presented in VOLUME     In order to compare the performance analysis of M6 against C, the time and memory consumption in training and testing were averaged, showing that M6 achitectures spend 7% more time on training and 26% more time on testing and consume five times as much memory. This difference could be attributed to the fast Fourier transform computation. Further work will be needed to optimize memory consumption. Note also, its performance cannot be compared to that of the Q9 performance because its code is only available for CPU, while the M6 runs on GPU.

VII. CONCLUSIONS AND FUTURE WORK
A new trainable bio-inspired front-end layer for ConvNets has been designed and presented. This new layer generates a 3D geometric representation of each pixel value by computing the quaternion monogenic signal in the Fourier domain. As a result, it is possible to leverage the local phase and the local orientation to elicit low-level geometric features, such as oriented lines or edges. Couplingaaa the proposed layer with a regular ConvNet, or with dense networks, achieves an image classification that is little affected by severe contrast degradings and which, on the whole, has better accuracy. The experimental results are consistent with the SSIM and PSNR results and the geometrical observation that the local phase and the local orientation are invariant to variable contrast conditions. Concerning the impact of this work, it is to be searched in situations where an invariant response to contrast alterations is required. Among the possible scenarios we count self-driving cars under haze conditions, surface glazes in medical images (biopsies), or day-round autonomous video surveillance. The authors hope that this research will serve as one of the lines for future studies on equivariant and invariant representations by ConvNets. In addition, we will try to compare or combine this layer with other approaches such as deformable ConvNets [51] or depthwise ConvNets [52], among others. Moreover, further work is needed to compare the functionality of this approach to other methods and techniques, such as object detection or segmentation. .

APPENDIX A QUATERNION ALGEBRA
Hamilton's quaternions have many representations in geometric algebra }. It is possible to represent the Hamilton quaternions as the even subalgebra of G(3, 0) with basis {1, e 2 e 3 , e 3 e 1 , e 1 e 2 } and may be seen by identifying i → −e 2 e 3 , j → −e 3 e 1 and k → −e 1 e 2 . This work uses the most widely-known definition: quaternion algebra H is a four dimensional real vector space with basis 1, i, j, k, endowed with the bilinear product (multiplication) defined by Hamilton's relations, namely As it is easily seen, these relations imply that ij = −ji = k, jk = −kj = i, ki = −ik = j. (24) The elements of H are named quaternions, and i, j, k, quaternionic units. By definition, a quaternion q can be written in a unique way in the form q = a + bi + cj + dk, a, b, c, d ∈ R.
Note that (q +q)/2 = a, which is called the real part or scalar part of q, and (q −q)/2 = q − a = bi + cj + dk, the vector part of q.
Since the conjugates of i, j, k are −i, −j, −k, the relations (23) and (24) imply that the conjugation is an antiautomorphism of H, which means that it is a linear automorphism such that qq =q q. Using Hamilton's relations again, we easily conclude that qq = a 2 + b 2 + c 2 + d 2 .
Finally, for q = 0, |q| > 0 and q(q/|q| 2 ) = 1, which shows that any non-zero quaternion has an inverse and therefore that H is a (skew) field.