Learning 2D Gabor Filters by Infinite Kernel Learning Regression

Gabor functions have wide-spread applications in image processing and computer vision. In this paper, we prove that 2D Gabor functions are translation-invariant positive-definite kernels and propose a novel formulation for the problem of image representation with Gabor functions based on infinite kernel learning regression. Using this formulation, we obtain a support vector expansion of an image based on a mixture of Gabor functions. The problem with this representation is that all Gabor functions are present at all support vector pixels. Applying LASSO to this support vector expansion, we obtain a sparse representation in which each Gabor function is positioned at a very small set of pixels. As an application, we introduce a method for learning a dataset-specific set of Gabor filters that can be used subsequently for feature extraction. Our experiments show that use of the learned Gabor filters improves the recognition accuracy of a recently introduced face recognition algorithm.


Introduction
Gabor functions are extensively used for feature extraction in numerous computer vision applications such as face recognition [7,36,47,28,45], image retrieval [25], palmprint recognition [34,33], forgery detection [23], and facial expression recognition [46,19]. Neurologists have shown that receptive field of simple cortical cells can be modeled with Gabor functions [10,11,29]. For this reason, computational models for visual recognition that have inspired from visual cortex use Gabor filters in the very beginning phase of feature extraction [40]. Even nowadays that deep learning methods have remarkably influenced the field of machine vision, the combination of Gabor functions and convolutional neural networks(CNN) has been observed to improve recognition rates [8].
In this paper, we show that Gabor functions are translation-invariant (also called stationary) positive-definite kernels. It is somewhat strange that this fact, despite its simple proof, has been invisible to the eyes of researchers and it is neither mentioned in classical books such as [39,9,41] nor in seminal work of Genton [15] who reviewed the class of stationary kernels 1 . We believe that the positive-definiteness of Gabor functions can potentially be exploited in numerous ways for applying kernel algorithms to machine vision problems. In this paper, we target the problem of learning Gabor filters from data which in kernel methods terminology is a kernel learning problem. Perhaps the most widespread kernel learning algorithm is the multiple kernel learning (MKL) framework that seeks the best convex combination of a finite set of kernel functions [22,3,42,35,44,21,6]. One weakness of the MKL framework is that the initial set of kernel functions should be chosen by hand. To overcome this limitation, the infinite kernel learning (IKL) framework was introduced in which the set of initial kernels is extended to an infinite number of kernels parameterized over a continuous space [30,1,2,14,32,31,17]. Some of the solutions proposed for this problem were restricted to Gaussian kernels [30,1,2] and some were restricted to binary classification with support vector machines [14,32,31,17]. However, to apply these IKL algorithms to the problem of learning Gabor functions, one should formulate the problem of learning Gabor functions as binary classification which seems to be impossible. Fortunately, Ghiasi-Shirazi [18] generalized the SIKL algorithm [17] to a more general class of machine learning problems that includes the -insensitive support vector regression (SVR). In this paper, we reduce the problem of representing an image with Gabor functions to the problem of learning a convex combination of an infinite number of Gabor kernels for regression. This gives us a mixture of Gabor functions that, when placed at positions determined by support vectors, reconstruct the given image. As a practical application of the SIKL algorithm, we propose a simple method for learning Gabor functions for a specific dataset of images from a tiny fraction of its images. However, the representation obtained by the SIKL algorithm has the problem that all Gabor functions are present at all support vector pixels. This may arouse the suspicion that the SIKL algorithm learns a universal approximator kernel function that is subsequently used by SVR for representing the input image, rejecting any link between the Gabor functions generating an image and the learned Gabor functions. In fact, we will show experimentally that the mixture of Gabor functions learned by the SIKL algorithm is approximately a highly concentrated Laplacian kernel. Using LASSO algorithm [43], we obtain a sparse representation of the original image in which the Gabor functions are located at a very sparse set of pixels. Experimental results on artificial images generated by combination of two Gabor functions confirm the potential of our sparse representation algorithm in discovering the scales, orientations, and locations of the constituting Gabor functions.
In Section 2, we give a concise and simplified introduction to SIKL regression [the general form of SIKL algorithm and its mathematical analysis can be found in 18]. We introduce our method for representing an image as a mixture of Gabor functions in Section 3. Our algorithm for choosing the parameters of Gabor filters for a specific dataset is given in Section 4. In Section 5, we show how LASSO can be utilized to obtain a sparse Gabor-based representation of an image. We experimentally evaluate the proposed method in Section 6 and conclude the paper in Section 7. In Appendix A, we give a formal proof for positive-definiteness of Gabor functions.

Stabilized infinite kernel learning regression
The stabilized infinite kernel learning (SIKL) algorithm had been initially introduced in [17] for binary classification and then was generalized to more general classes of machine learning problems in [18]. In this section, we give a short introduction to SIKL regression in a simple and succinct way without going into mathematical details and without grounding the SIKL framework in its most general form. For a comprehensive introduction to the SIKL framework the reader is referred to [18].
Assume that the training set consists of the input samples x 1 , ..., x l ∈ R d and their corresponding target values y 1 , ..., y l ∈ R. Support vector regression attempts to learn the relation between input and output by a function of the form: The coefficientsβ and β are obtained by solving the following optimization problem [9]: where C ∈ (0, ∞] is a regularization constant. The above optimization problem can be rewritten in the following more succinct form: where K is the kernel matrix obtained by applying the kernel function k to the input samples x 1 , ..., x l , and I is the identity matrix of size . Consider the set of kernels {k γ : γ ∈ Γ}, where Γ is a continuously parameterized index set. Let P(Γ) be the set of all probability measures on Γ. It can be shown [see 30] that for any probability measure p ∈ P(Γ), the function is a convex combination of the set of kernels {k γ : γ ∈ Γ}. Conversely, any convex combination of the set of kernels {k γ : γ ∈ Γ} can be written in the form of Eq. (5). In the IKL framework, it is assumed that Γ is a compact Hausdorff space (e.g. a bounded and closed subset of R 2 ) and the problem is to find the best kernel in the form of Eq. (5). The SIKL framework relaxes the assumption on Γ to locally-compact Hausdorff spaces (e.g. R or R 2 ). For mathematical concreteness and for provisioning a mechanism to control the capacity of the learning machine, the SIKL framework introduces a vanishing function 2 G(γ) : Γ → [0, 1] into the framework. The stabilized convex combination of the kernels {k γ : γ ∈ Γ} with stabilizer G(γ) and probability measure p ∈ P(Γ) is defined as: Correspondingly, the set of stabilized convex combination of kernels {k γ : γ ∈ Γ} with stabilizer G(γ) is defined as: The problem of simultaneously learning the regression function along with a kernel functionk p ∈ K can be formulated as: where K(γ) is the kernel matrix that is obtained by applying the kernel function k γ to the input training data x 1 , ..., x l . Ghiasi-Shirazi [18] proved that the probability measure that optimizes the above problem is discrete with finite support. The SIKL toolbox optimizes the above problem by semi-infinite programming and returns the weights µ i and the parameters γ i ∈ Γ which identify the optimal kernelk by the following formula: We added the Gabor kernel to the SIKL toolbox and exploited some special properties of Gabor kernels to optimize the toolbox. Specifically, since Gabor kernels are twodimensional, we modified the global optimization algorithm of SIKL to search the space of parameters systematically.

Gabor-based image representation using SIKL
In this section, we show how SIKL regression can be applied to the task of image representation by Gabor functions. We consider the following form for Gabor functions which is essentially a slightly modified version of the from chosen by [36]: where the point (x 0 , y 0 ) is the center of the Gabor function in the spatial domain and the parameters ω and θ determine the scale and orientation of Gabor function, respectively. Note that, in Eq. (10), the only inputs are x and y and x 0 and y 0 are parameters of the Gabor function. By considering x 0 and y 0 as inputs, we arrive at the following definition for Gabor kernels: There is another parameterization for Gabor functions which is obtained from Eq. (10) by setting ω = π/2 2 ν/2 and θ = µ π 8 . This µν-parameterization is specially important since manual selection of Gabor parameters is usually done in that form. We use this form when a parameter is to be chosen by hand or when reporting the learned parameters of Gabor functions. Appendix A elaborates on the chosen form for Gabor functions and gives a proof for positive-definiteness of Gabor kernels. Now, assume that we want to search for the best convex combination of Gabor kernels whose scale parameters are in the range [ω 1 , ω u1 ]. This choice corresponds to a rectangular vanishing function in SIKL formulation which is not appropriate due to the jumps from 0 to 1 and vise versa. Therefore, we choose the following trapezoidal stabilizing function: The stabilized convex combination of Gabor functions with stabilizer G(ω, θ) and probability measure p is defined as: Consequently, the set of stabilized convex combination of Gabor functions with stabilizer G(ω, θ) can be expressed as: As stated previously, although the optimization is over a continuous space of parameters, the optimal kernel has a finite expansion of the form: For a given image I, we generate a training set that consists of positions of pixels as input and the intensity at those pixels as desired outputs. We then use the SIKL regression algorithm to learn the above kernel and the parameters of a SVR machine simultaneously in order to predict the intensity of each pixel correctly. The solution of the SIKL problem gives the number of participating kernels m, Gabor parameters ω i and θ i for i = 1, ..., m, and the support vector coefficientsβ i , β i for i = 1, ..., , where is the number of pixels in the image, such that: This representation signifies the Gabor functions that are contributing to the construction of the input image I.

Learning dataset-specific Gabor filters
When Gabor filters are used for feature extraction from a dataset, their parameters are usually tuned by hand and it is customary to use 40 Gabor functions with 5 scales and 8 directions [27,26,36,20]. However, since Gabor functions are defined over a pixelspace, appropriate choice of their parameters is sensitive to the resolution of the images. In Section 3, we proposed an algorithm for learning an image representation based on Gabor functions by SIKL. It is an accepted practice in machine learning that the first phases of information processing usually model the distribution of the input data while the task of discrimination is assigned to higher layers [5,12]. So, we take the assumption that the Gabor functions that are appropriate for representing an image, can also be used for feature extraction. By clustering the parameters obtained from a small fraction of images from a dataset using the k-means algorithm, we obtain a set of Gabor functions that are appropriate for representing any image in that dataset. Dataset-specific details on our method for learning Gabor filters for CMU-PIE and EYaleB datasets are given in Section 6.1.

Sparse image representation using Gabor kernels
The Gabor kernels learned by the method proposed in the previous section are global in the sense that each kernel is present at every location. In Section 6.2 we show that the mixture of the learned Gabor functions is approximately a concentrated Laplacian kernel. It may be questioned whether the Gabor kernels learned by the SIKL algorithm are those that are actually participating in the generation of an image or the learned combined concentrated Laplacian kernel acts as a universal approximator function that can be utilized by the SVR machine for approximating any input image. In this section, we aim to represent an image sparsely by a combination of Gabor functions such that each Gabor function is located at a small number of pixels. It has the benefit that it associates Gabor functions to the specific locations at which they are present. This problem has been previously considered by Fischer et al. [13] who proposed an algorithm based on local competition. It must be mentioned that the set of Gabor functions chosen by the SIKL algorithm is already sparse. This sparseness is the result of the implicit L1 constraint p 1 = 1 over the probability measure p in Eq. (8) which holds since the Lebesgue integral of any probability measure is 1. Thus, we assume that all the Gabor kernels that are found by the SIKL algorithm should be present in the sparse representation as well. We then try to sparsify the set of pixels at which each kernel is present. We start from Eq. (16) obtained in the previous section. By exchanging the order of summation we obtain: Our goal is to approximate the inner summation with a sparse combination of the training input data. Let b j be an × 1 vector whose n'th element is: Assume K j is the × #sv kernel matrix associated with the kernel function k ωj ,θj in which rows correspond to the image coordinates and columns correspond to the support vector image coordinates. According to Eq. (17), to obtain a sparse representation for image I, we should find a sparse vector ρ j such that, for n = 1, ..., , we have: Eq. (19) can be written in the matrix notation as: We have: where K j n * is the n'th row of the kernel matrix K j and sparseness of this representation follows from the sparseness of the vector ρ j . To find a sparse vector ρ j that satisfies Eq. (20), we use LASSO [43] which solves the following optimization problem: To undo the negative effect of the L1 regularization term in LASSO on the quality of approximation of Eq. (20), we again solve Eq. (20) using the least squares method with the constraint that the pattern of sparseness of ρ j found by LASSO would be preserved.

Experiments
The experiments of this section are designed with two goals in mind. First, we want to analyze the proposed algorithm in details and discover the nature of the learned Gabor functions. Second, we want to show the usefulness of the proposed method in automatic learning of Gabor functions for a given dataset. In Section 6.1, we report our experiments on the application of the learned Gabor functions to the face recognition problem and show that it yields favorable recognition accuracy over a hand-tuned choice made by experts. In Section 6.2, we analyze the learned Gabor functions and show that the weighted combination of learned Gabor kernels is equivalent to a concentrated Laplacian kernel. Finally, in Section 6.3, we analyze the proposed algorithm for Gabor-based sparse representation of images.

Selection of Gabor filter for face recognition
In this section, we want to show that use of Gabor filters learned by the method proposed in Section 4 can increase the accuracy of machine vision applications compared with Gabor filters chosen by hand. For this purpose, we chose the MOST system that is recently proposed by Ren et al. [36] for the task of face recognition and uses Gabor filters for feature extraction. The code of the MOST algorithm along with the CMU-Light and EYaleB face datasets were obtained by contacting Ren et al. [36]. CMU-Light is the name [36] gave to the illumination part of CMU-PIE dataset [4] which consists of 43 images captured at different illumination conditions from 68 persons, amounting to 2924 images. The Extended Yale B dataset [16], which is abbreviated as EYaleB, consists 8 of 64 frontal images from 38 persons again taken at different illumination conditions, amounting to 2432 images. Ren et al. [36] removed the 5 most dark images from the original 64 instances provided for each person in Extended Yale B dataset. In addition, all images had been histogram-equalized and were resized to width 46 and height 56. Since the SIKL algorithm is time-consuming, considering the locality of Gabor functions, instead of representing a whole image with Gabor functions, we break images into several smaller (sometimes overlapping) regions and represent each region with a set of Gabor functions. From each face, we extract four regions around the two eyes, the nose, and the mouth (see Figure 1). We used a trapezoidal vanishing function in the formulation of the SIKL algorithm with ω 0 = π 512 , ω 1 = π √ 2 512 , ω u1 = 2 √ 2π, and ω u0 = 4π. In µν-space, these choices correspond to ν 0 = 16, ν 1 = 15, ν u1 = −5, and ν u0 = −6.
From each dataset, we randomly selected 28 images (which in both cases amounts to less than 2% of the data) for learning the parameters of Gabor filters. We used the kmeans algorithm to cluster the Gabor parameters extracted from these 28 images to obtain 40 Gabor filters. Finally, we evaluated the original and learned Gabor filters on the task of face recognition using the MOST method. The number of training images used by the MOST algorithm, called ntrain, is an important factor in the accuracy of the face recognition system. We compare the accuracies obtained by the original 40 filters used by Ren et al. [36] and the 40 filters learned by our method. Each experiment is repeated 30 times. The results of these experiments are summarized in Table 1. As can be seen, when the number of training images for the MOST algorithm is low, use of the learned Gabor filters significantly increases the recognition rate. Figure 2 shows the parameters of the original and the learned filters in the µν-space. It is clear from the figure that the parameters of Gabor filters used by Ren et al. [36] do not cover the whole region of parameters that are indeed required for representing images with Gabor functions. In addition, the dataset-specific distributions for Gabor kernel parameters depicted in Figures 2.b and 2.c can be used as a guideline for manual tuning of parameters of Gabor filter.

Analysis of the learned Gabor functions
In Section 3, we showed how the SIKL algorithm can be exploited for representing an image with Gabor kernels. The learned representation can be equivalently obtained by support vector regression with a single kernel that is the weighted combination of the selected Gabor kernels (see Eq. 15). An interesting question is what is the single kernel that is equivalent to the combination of the learned Gabor functions. We answer this question empirically by drawing the shape of the combined kernel. Figure 3 shows the combined kernels for two sample images from CMU-Light and EYaleB datasets. As can be seen, the weighted combination of the learned Gabor functions is approximately a concentrated Laplacian kernel.

Discovering locations of constituting Gabor functions
In this section, we experimentally evaluate the SIKL+LASSO algorithm of Section 5 in discovering the exact location of Gabor functions participating in a sparse representation of an image. For this purpose, we first produced a few artificial images by combination of two randomly generated Gabor functions. Figure 4.a shows several examples of these artificially generated images. Then, we used the SIKL+LASSO method Table 1: Comparison of accuracies obtained by MOST algorithm on CMU-Light and EYaleB datasets when using the manually tuned Gabor filters of Ren et al. [36] and when using Gabor filters learned by the proposed algorithm. The parameter "ntrain refers to the number of training faces used by the MOST face recognition algorithm [36]. The proposed method uses less than 2% of images of each dataset for learning the parameters of Gabor kernels. For each dataset, the last column shows the two-tailed P-values for paired t-test. Results that are statistically significant are bold-faced. It must be emphasized that since a paired t-test is used, P-values cannot be computed from statistics summarized in this proposed in Section 5 for discovering the positions of the original Gabor functions. We used a regularization constant of λ = 0.1 for the LASSO algorithm. Figure 4.b shows the approximations of images of Figure 4.a generated by the SIKL algorithm. The set of support vector pixels found by the SIKL algorithm are depicted in Figure 4.c. The approximations of the images of Figure 4.a generated by the SIKL+LASSO algorithm along with the positions of the discovered Gabor functions are shown in Figure 4.d. As can be seen, both the SIKL algorithm of Section 3 and the SIKL+LASSO algorithm of Section 5 generate acceptable approximations to the original images. On the other hand, while the set of support vectors obtained by the SIKL algorithm contains many pixels, the SIKL+LASSO algorithm has been successful in obtaining a very sparse representation of the images. However, in some cases the Gabor functions learned by the SIKL+LASSO algorithm do not correspond exactly to the generating ones. It must be mentioned that since Gabor functions constitute an overcomplete system, it is natural that an image can be represented by different combinations of these functions. Noting that the SIKL+LASSO method uses exactly those Gabor kernels that had been obtained by the SIKL method, this experiment reveals that the set of Gabor kernels learned by the SIKL algorithm is strongly related to those generating an image.

Conclusion
In this paper, we exploited the fact that a practical form of Gabor functions is also a positive-definite kernel to find an image representation based on Gabor functions. This representation is learned by the stabilized infinite kernel learning regression algorithm that had been previously proposed by Ghiasi-Shirazi [18]. The obtained representation has the weakness that the learned Gabor kernels are not localized and are present at all pixels. We proposed a sparse representation algorithm based on LASSO and showed that in simple cases it can recover the underlying generating Gabor functions of images. As an application of our method, we proposed an algorithm for automatic learning of parameters of Gabor filters in the task of face recognition. Our experiments on CMU-PIE and Extended Yale B datasets confirm the usefulness of the proposed algorithm in automatic learning of Gabor filters.

Acknowledgment
The author wishes to express appreciation to Research Deputy of Ferdowsi University of Mashhad for supporting this project by grant No.: 2/38449. The author thanks Chuan-Xian Ren for providing him with the code of the MOST algorithm [36] and the processed versions of CMU-PIE and Extended Yale B datasets. The author also thanks his colleagues, Ahad Harati and Ehsan Fazl-Ersi for their valuable comments.