Adaptive Explicit Kernel Minkowski Weighted K-means

The K-means algorithm is among the most commonly used data clustering methods. However, the regular K-means can only be applied in the input space and it is applicable when clusters are linearly separable. The kernel K-means, which extends K-means into the kernel space, is able to capture nonlinear structures and identify arbitrarily shaped clusters. However, kernel methods often operate on the kernel matrix of the data, which scale poorly with the size of the matrix or suffer from the high clustering cost due to the repetitive calculations of kernel values. Another issue is that algorithms access the data only through evaluations of $K(x_i, x_j)$, which limits many processes that can be done on data through the clustering task. This paper proposes a method to combine the advantages of the linear and nonlinear approaches by using driven corresponding approximate finite-dimensional feature maps based on spectral analysis. Applying approximate finite-dimensional feature maps were only discussed in the Support Vector Machines (SVM) problems before. We suggest using this method in kernel K-means era as alleviates storing huge kernel matrix in memory, further calculating cluster centers more efficiently and access the data explicitly in feature space. These explicit feature maps enable us to access the data in the feature space explicitly and take advantage of K-means extensions in that space. We demonstrate our Explicit Kernel Minkowski Weighted K-mean (Explicit KMWK-mean) method is able to be more adopted and find best-fitting values in new space by applying additional Minkowski exponent and feature weights parameter. Moreover, it can reduce the impact of concentration on nearest neighbour search by suggesting investigate among other norms instead of Euclidean norm, includes Minkowski norms and fractional norms (as an extension of the Minkowski norms with p<1).

such as web search, image segmentation, image compression, gene expression analysis, recommendation systems and mining text data [4,17,8]. K-means clustering [20] is one of the most popular conventional clustering algorithms, despite its age. It aims to partition sample of N observation into K compact clusters in an iterative process. The K-means algorithm only works reasonably well when 1) clusters can be separated by hyper-planes and 2) each data point belongs to the closest cluster center. If one of these principles does not hold, the standard K-means algorithm will likely not give a good result. Kernel-based clustering methods overcome these limitations by using an appropriate non-linear mapping to higher dimensional feature space. Thus, it enables the K-means algorithm to partition data points by the linear separator in the new space, that has non-linear projection back in the original space. Various types of kernel-based methods such as the kernel version of the SOM (Self-organizing map) algorithm [19,16], kernel neural gas [21], one Class SVM (Support Vector Machines) [2] and kernel fuzzy clustering [26,25] have been proposed by researchers. In this paper, we focus on kernel k-means clustering because of its efficiency and simpleness. Furthermore, various studies [24,7,6] claim that different kernel-based clustering methods show similar result as kernel K-means.
Although kernel-based methods have received considerable attention from the machine learning community in recent years, they still suffer from the following problems in real applications. First, the high clustering cost due to either the repeated calculations of kernel values or requiring huge amounts of memory to store the kernel matrix makes it unsuitable for large corpora. Second, algorithms access the data only through evaluations of K(x, y). Therefore, many processes on the data points like fighting the concentration phenomenon, handling noise, etc. limited to work in the original space.
The aim of this paper is to develop a clustering method that can group data points with both linear and non-linear structures while try to address the two mentioned problems of kernel clustering methods. As the main contribution of this paper, we address both the space complexity of storing kernel matrix and lack of accessing to data in feature space by proposing new Adaptive Explicit Kernel Minkowski Weighted K-means (Explicit KMWK-mean) method. The proposed method combines the advantages of the linear and nonlinear approaches by using driven corresponding approximate finite-dimensional feature maps based on 1D Fourier analysis [23] for a large family of additive kernels, known as γ-homogeneous kernels. The proposed method, first map data to feature space. This feature space is data independent approximation of γ-homogeneous kernels. Then, in order to provide a better fit to cluster structure, the weighted version of Minkowski K-means in low-dimensional feature space would be applied. Especially, we analyze the concentration of the Euclidean norm and the impact of distance measure on concentration and investigate on Minkowski norms and fractional norms, an extension of the Minkowski norms with p < 1, as a measure of distance among data in the kernel space. Adaptive Minkowski metric allows to fight against possible concentration in high-dimensional spaces and the weighting property enables our algorithm to cover spherical and non-spherical (elliptical) structures in feature space and getting the chance of finding even more complex clusters in feature space.
The remainder of this article is organized as follows. Section 2, briefly describes k-means and kernel k-means as the preliminary notions. Section 3 introduces Explicit Feature Maps and especially focused on Homogeneous kernels. Section 4 presents our modified version of kernel K-means by analyzing the alternative Minkowski distance for arbitrary p ∈ R + instead of Euclidean one in feature space. In Section 5 we describe the results of experiments four benchmark data sets. Section 6 concludes the paper.

K-means and Kernel K-means
The K-means method is designed to partition N D-dimensional samples X = (x 1 , x 2 , . . . , x N ). in to K clusters C 1 ,C 2 ,...,C K and return centroid vector for each cluster M = (m 1 , m 2 , . . . , m K ). The batch mode K-means algorithm would operate by the following iterative procedure: 1. Initialize K cluster center m 1 , m 2 , . . . , m K . 2. Assign each sample x i to its closest center. Namely, compute the indicator matrix δ ik , (1 ≤ k ≤ K).
3. Update the cluster centers.
Note that, in (1), d(x i , m k ) is the Euclidean distance given by: The preceding procedure actually is an iterative solution to optimization problem that attempts to minimize the objective function as follows: In most cases, the distance in use is the squared Euclidean distance. However, the Euclidean distances tend to concentrate, when data are high dimensional. This means that, all the pairwise distances may converge to the same value. Accordingly, the relevance of Euclidean distance has been questioned in the past and alternative norms, especially fractional norms (L p semi-norms with p < 1) were suggested to reduce concentration phenomenon [13,1]. Obviously, by using different metrics, cluster centers don't follow from the equation in (2) anymore. Finding fractional and Minkowski's centers, whose components are minimizers of the summary corresponding distances, discussed in Section 4.1.3.
The K-means algorithm with Euclidean distance generally works on ellipse-shaped clusters. It is not applicable when elliptical regions does not hold. By applying some kind of transformation to the data, mapping them to some new space, the k-means algorithm may be achieve better performance than in original space. Again, suppose we are given N samples of X = (x 1 , x 2 , . . . , x N ) x i ∈ R D , and mapping function Φ that transforms x i from original space R D to a high dimensional feature space H. Kernel functions are implicitly defined as the dot product of two vectors in the new transformed feature space.
In the rest of paper, we use Φ i instead of Φ(x i ) for ease of description. Essentially, the transformation is defined implicitly, without knowing the concrete form of Φ [22]. Computation of distances in the transformed space is one of the most important issues when K-means is extended to the kernel K-means. The squared Euclidean distance between x i and x j in feature space would be as: The cluster center in transformed space can be calculated as given below: Therefore, the distance of each data point and the cluster center in new space can be computed without knowing the transformation Φ explicitly.
Where δ is the indicator matrix which δ jk indicates whether Φ i is assigned to C k or not. We will be moved from K-means to Kernel K-means by applying (8) to the standard K-means.
The following notation is used in the rest of the paper. Multidimensional additive kernels have been represented as will be used for the scalar ones. Multidimensional kernel function can be obtained from scalar ones as: The scalar feature map is also denoted by ϕ(x i ) and multidimensional ones are given by

Explicit Feature Maps
Namely, in kernel learning context, for each positive-definite (PD) kernel K(x i , x j ) there exists a corresponding mapping function Φ to an arbitrary dimensional space such that K(x i , x j ) = Φ i .Φ j . Even though the explicit form of the mapping function is useful conceptually, it is not often used in computations. Typically, these feature spaces are Table 1: Well-known kernel functions and their corresponding distance , signature and closed form feature maps. This table is adopted from [23].
infinite-dimensional, yet it is possible to find an efficient finite-dimensional approximation of Φ. In the following, we will introduce a class of kernels which commonly used in computer vision, called homogeneous kernels. Then describe deriving corresponding approximate explicit finite-dimensional feature maps proposed by Vedaldi and Zisserman [23]. The main focus of this paper, as in an influential paper [23], is a class of additive kernels such as the Hellinger's, χ 2 , intersection, and Jensen-Shannon ones, which are frequently used in computer vision applications. All of the mentioned kernels are mutual in two properties of additivity and homogeneity. Common homogeneous kernels with their properties are listed in table 1.
√ ab , the homogeneous kernel can be rewritten as : The scalar function κ is called signature function which is used to deriving associated feature map of homogeneous kernel functions. It is defined as : Bochner's theorem by using Fourier transform of signature function, κ(λ) can address the problem of which mapping function made the given homogeneous kernel. As a result, the feature map ϕ for γ-homogeneous kernels will be derived as: ϕ ω (a) = e −iω log a a γ ρ(ω) (12) where ω can be viewed as the index of vector dimension and ρ(ω) is the spectrum function and can be obtained from inverse Fourier transform of the signature κ(λ).
In (12) feature maps are continuous functions but still finite approximation feature maps can be generated by sampling the continuous spectrum and rescaling it. Closed-form of common kernel feature maps are described in table 1. Moreover, Hein and Bousquet [12] introduced a large class of γ-homogeneous Hilbertian metrics and correspondent kernels which encompasses all previously described kernels in table 1. This class of metrics defined as follows with two parameters α and β to be tuned.

Adaptive Explicit Kernel K-means
In this section, we present the Adaptive Explicit Kernel K-means method for homogeneous kernels. Homogeneous kernels have been introduced in the previous section and a data-independent method [23] for deriving approximate finite-dimensional feature maps was described. This class of kernels is a frequently-used measure for histogram image comparison due to its effectiveness rather than linear kernel(Euclidean distance). Using explicit feature maps alleviates issues around the storing huge similarity matrix or repeated calculation of kernel values. Moreover, we use the explicit feature map to take advantage of accessing through the data in feature space. We expect good matching between K-means model and real data structure of data in transformed space by choosing suitable kernel, but we still have this chance to get better fit by extending from the squared Euclidean to arbitrary weighted Minkowski metric in new space.
In addition to adding more flexibility, adaptive Minkowski metric allows to fight against possible concentration in high-dimensional spaces. Furthermore, weighting property with reflect to within-cluster feature variances,enables our algorithm to cover spherical and non-spherical (elliptical) structures. In this way, features with smaller within-cluster variances receive a larger weight and features which more evenly distributed across the cluster get a smaller weight.

Minkowski Metric
The distance we consider here is the Minkowski distance. It can be seen as a generalization of the Euclidean distance, which is defined as below: Note that, when 0 < p < 1 does not hold metrics properties therefore it is not actual metric and the corresponding norm are not are indeed norms. They are not satisfying the triangle inequality. They are usually called prenorms or fractional norms.

Concentration in high-dimensional spaces
Losing discriminative power of Euclidean distance to indexing points in high-dimensional spaces had been shown in the past. This means, as dimension increases, the distance to the nearest point appear to be as the same as the farthest one. This phenomena is known as concentration of distances. This inability of Euclidean distance to distinguish distances in high dimensions caused to alternative distance measures were suggested. In [13] a theoretical analysis of absolute difference between the farthest point distance d max and the closest point distance d min for Minkowski norms is presented. It point out that for D-dimensional i.i.d random vectors Where C is some constant independent of the distribution of the x i . This means the contrast between closest and farthest neighbor on average grows as D 1 p − 1 2 . The authors concluded that l 1 and l 2 norm may be more relevant than l p when p ≥ 3. In fact for l p with (p ≥ 3), the difference between to the farthest and nearest neighbor goes to 0 as dimensionality increases. These results encouraged researchers to examine fractional distances, l p distance with p ∈ (0, 1). In [1] authors extended previous work and proposed using fractional distance metric. They has been shown that fractional distance can provide higher relative contrast and meaningful result under same conditions as in (18). The obtained results in [13] and [1] cannot be used in general when the data are not uniformly distributed. Indeed, it is quite rare that data spread through such spaces.observation of real data shows that high dimensional spaces are mostly empty. In these spaces, it is common that data spread on a sub-manifold. In [10] data distribution instead of data set has been studied. For that purpose the relative variance ratio is proposed which is defined as follows: Similarly to the relative contrast (18), the relative variance can be seen as a measure of concentration. Smaller values of RV F,p indicate less concentration. We can see that the shape of distribution F and the value of p might affect the value of the RV F,p . Therefore, for adjusting the value of p, the shape of distribution should be considered too. It is completely possible that the higher relative variance acquired by higher-order norms.

Choosing the Optimal Value of p
In supervised learning tasks like classification, the value of p could be chosen to maximize model accuracy. However, we do not have that access to the class labels in unsupervised learning. In that case, a sensible way of choosing p could be investigating on use of RV F,p as an objective to be maximized. Also in [9] the relation between the phenomenon of concentration and hubness has been studied. The authors proposed an unsupervised approach for choosing the value of p by measuring hub and anti-hub occurrence as defined in the paper. Although RV F,p and hubness can give a sense of the concentration level, they are not excellent measures all the time. As shown in Figure (1), a different value of p reflects a distinctive measure of distance in a Euclidean space. It can be seen, the rotation of the coordinate system will also lead to changes in the distance measurement. The only exception is the circular shape, p = 2 (Figure 1d). By adopting p → 0 being exact in one dimension has more value than have balance in two (Figure 1a). conversely, by approaching p → ∞, just the maximum difference is matter (Figure 1f). So by going far from p = 2, the distance meaning completely changes therefore the value of p should be chosen by considering both being meaningful and reduction of the concentration. However, ground truth annotation is often costly and inaccessible in real-world applications, there are usually limited available class labels. In this case, semi-supervised learning provides a better choice by leveraging unlabeled data by using a small set of labeled data. We discover the optimal value of exponent p, by uncovering only a few percents of data. We employ the entire data set, either labeled or not to run a series of clustering experiments at various values of p as reported superior results rather than using only labeled data in [5].

Finding Centers
After deriving the approximate feature map and an optimal value for p then the kernel clustering algorithm can be started in order to optimize the objective function with considering Minkowski distance. It minimizes the sum of Minkowski distances between instances and the related centers. The Minkowski k-means objective function can be written as below by applying Minkowski distance on (4).
arg min It should be considered that, it is not possible to use the definition of the center which previously used as Eq (2) since it does not minimize (20) when δ is given constant. In the other words, For finding Minkowski's centers, we need to find value of vector m k for each cluster, minimizes the following function: arg min Being minimizer of (21) is required for vector m k in order to prove the convergence of Minkowski K-means to a local optimum. In other words, it lowers the cost in each iteration monotonically. By this way, it is proven that the algorithm will be converged in a finite number of iterations since the algorithm iterates a function whose domain is a finite set and the cost is decreased monotonically. The search space in order to find vector m k is too large for an exhaustive search because the algorithm needs accurate solutions. Note that finding vector m k is a single objective problem and elements on different dimensions are independent. Various evolutionary algorithms can be designed to find the best solutions. However, for p > 1 the Eq (21) is a convex function and more desirable algorithms like steepest descent can be used to find the global minimizer [5].

Weighted Version of Explicit Kernel K-means
Using the feature weighted version of Minkowski distance enables the algorithm to provide a better fit to the cluster structures than is possible with Minkowski K-means alone. It allows our algorithm to find both spherical and nonspherical (elliptical) structural clusters in feature space; accordingly, gives a much better fit to arbitrary shaped clusters in original spaces. Authors in [5] have extended weighted K-means variants work [3,14,15] through transforming feature weights to be feature scale. This means Minkowski exponent is assigned to feature weights too. The objective function for the weighted version of Minkowski K-means in Equation (20) can be written as the following equation: arg min The weight w kl reflects the relevance of feature l at the cluster k. They are assigned base on inverse proportion of dispersion of a feature within a certain cluster. Thus, a feature with small dispersion within a specific cluster would have a higher weight, and vice versa. More precisely, w kl is updated on each iteration as Equation (23), where To see the entire algorithm process easily we have also the following flowchart (Fig. 2):

Accelerate Minkowski K-means clustering
The Minkowski's center computation would significantly decelerate the running time. We know already, that at p = 2 the center is the mean, and at p = 1 it is given by the component-wise median. Otherwise, it requires an iterative, steepest descent process or an evolutionary computation that can be considered in computation costs. A significant computational speed-up can be achieved by appropriate initialization in order to lessen the number of iterations and consequently decrease the time of searching for Minkowski centers as a process of minimization. We use an output vector containing cluster centers of K-means with p equal 2 or 1 as an initialization. This initialization approach results in an impressive reduction in the time required for clustering, because all of these Minkowski's distances share the same properties and data would be half-clustered. In [11], the effect of the initialization of K-Means has been studied. The authors found when the clusters overlap, considering various strategies for initialization would not matter much on the results.

Experimental Results
In this section, We present the results of performance experiments of our Explicit Kernel MWK-means. We have conducted our experiments on four benchmark data sets: USPS dataset, MNIST dataset, caltech101, and MSRC-V1. Some relevant statistics of them are shown in Table 2.
Two standard metrics were used to measure the performance of the image clustering that is, Normalized Mutual Information (NMI) and Purity. USPS dataset includes 11,000 0-9 digits instances. The dataset is known to be very complicated with a recorded human error rate of 2.5% . The images are 16 by 16 gray-scale pixels.
Caltech101 dataset comprises 9144 images of items belonging to 101 classes and one background class. The number of images in each class differs. The size of each image is approximately 300×200 pixels. We have selected the commonly used 7 groups, i.e. Face, Motorbike, Dolla-Bill, Garfield, Snoopy, Stop-sign and Windsor-chair for our experiments.
MSRC-V1 dataset is from Microsoft Research in Cambridge. This data set is commonly used for scene recognition. We adopt Lee and Grauman's approach [18] to refine and getting 7 classes including tree, building, airplane, cow, face, car and bicycle where each class has 30 images.

Setting of the Experiments
To figure out which value of the exponent p has the best performance, a semi-supervised manner was employed. The value of p was derived from uncovering the class of labels on a 15% data sample though clustering was conducted over the whole dataset. After running a series of clustering experiments at different values of p, the p that produced the higher clustering accuracy was picked. Each of the experiments repeated 50 times and the average NMI is reported.
Images clustering by using raw pixels as features is unlikely to work effectively. The standard practice is to use visual descriptors such as HOG and SIFT. Both of HOG and SIFT use histograms of pixel intensity gradients in their descriptors. The class of homogeneous kernels are popular due to its effectiveness. Hog can describe the shape and edge information of object; and SIFT features are invariant to image scale, rotation, noise and illumination changes. For MNIST and USPS databases the HOG feature is used. We choose 4 × 4 grid cells, 2 × 2 cells in one block and 1 cell spacing as the parameter in HOG calculation. For the two other datasets, Caltech7 and MSRC-V1, the key-points are extracted from each image then represented each keypoint as a 128-dimensional SIFT descriptor. A randomly chosen subset of SIFT features were clustered in order to form a visual vocabulary for either of dataset. Each SIFT descriptor was then quantified into a visual word by considering the nearest cluster center. A 500-dimensional vector representation for each image is obtained. Table 3 clearly shows the match between the value of p learned in semi-supervised and fully supervised settings. The only exception is on MNIST with Using χ 2 kernel that there is no match between learned and optimal value of exponent p. However, table 4 shows it has still better performance than common kernel K-means with the exponent p = 2. We compared the clustering performance of our method (Explicit kernel MWK-means) with their respective equal weighted version and constant Euclidean one. In addition, we also compared the results of our method with the exact kernel K-means clustering. Table tables 4 to 7 demonstrate the clustering results in terms of NMI and purity on all data sets. It can be seen that, compared with common exact kernel K-means counterparts, our proposed Explicit kernel MWK-means improve the clustering performance on all the data sets.

Conclusions
In this paper, we proposed a kernel K-means method which is based on explicit feature maps with further matching in feature space. Using adaptive Minkowski metric and feature weighting in feature space enable our algorithm to get high clustering quality and show strong robustness to noise feature and concentration phenomena. Applying approximate finite-dimensional feature maps were only discussed in the Support Vector Machines (SVM) problems before. We suggested using this method as alleviates storing huge kernel matrix in memory, in addition to access the data explicitly in feature space. Our experiments demonstrate that our proposed method consistently achieves superiors clustering performances in terms of two standard metrics Normalized Mutual Information (NMI), and Purity, evaluated on four standard bench-mark data sets on object and scene recognition.