Generalized relevance learning vector quantization

doi:10.1016/S0893-6080(02)00079-5

Neural Networks

Volume 15, Issues 8–9, October–November 2002, Pages 1059-1068

https://doi.org/10.1016/S0893-6080(02)00079-5 Get rights and content

Abstract

We propose a new scheme for enlarging generalized learning vector quantization (GLVQ) with weighting factors for the input dimensions. The factors allow an appropriate scaling of the input dimensions according to their relevance. They are adapted automatically during training according to the specific classification task whereby training can be interpreted as stochastic gradient descent on an appropriate error function. This method leads to a more powerful classifier and to an adaptive metric with little extra cost compared to standard GLVQ. Moreover, the size of the weighting factors indicates the relevance of the input dimensions. This proposes a scheme for automatically pruning irrelevant input dimensions. The algorithm is verified on artificial data sets and the iris data from the UCI repository. Afterwards, the method is compared to several well known algorithms which determine the intrinsic data dimension on real world satellite image data.

Introduction

Self-organizing methods such as the self-organizing map (SOM) or vector quantization (VQ) as introduced by Kohonen provide a successful and intuitive method of processing data for easy access (Kohonen, 1995). Assumed data are labeled, an automatic clustering can be learned via attaching maps to the SOM or enlarging VQ with a supervised component to so-called learning vector quantization (LVQ) (Kohonen, 1997, Meyering and Ritter, 1992). Various modifications of LVQ exist which ensure faster convergence, a better adaptation of the receptive fields to optimum Bayesian decision, or an adaptation for complex data structures, to name just a few (Kohonen, 1997, Sato and Yamada, 1995, Somervuo and Kohonen, 1999).

A common feature of unsupervised algorithms and LVQ consists in the fact that information is provided by the distance structure between the data points which is determined by the chosen metric. Learning heavily relies on the commonly used Euclidian metric and hence crucially depends on the fact that the Euclidian metric is appropriate for the respective learning task. Therefore, data are to be preprocessed and scaled appropriately such that the input dimensions have approximately the same importance for the classification. In particular, the important features for the respective problem are to be found, which is usually done by experts or with rules of thumb. Of course, this may be time consuming and requires prior knowledge which is often not available. Hence, methods have been proposed which adapt the metric during training. Distinction sensitive LVQ (DSLVQ), as an example, automatically determines weighting factors to the input dimensions of the training data (Pregenzer, Pfurtscheller, & Flotzinger, 1996). The algorithm adapts LVQ3 for the weighting factors according to plausible heuristics. The approaches (Kaski et al., 2001, Sinkkonen and Kaski, 2002) enhance unsupervised clustering algorithms by the possibility of integrating auxiliary information such as a labeling into the metric structure. Alternatively, one could use information geometric methods in order to adapt the metric such as in Hofmann (2000).

Concerning SOM, another major problem consists in finding an appropriate topology of the initial lattice of prototypes such that the prior topology of the neural architecture mirrors the intrinsic topology of the data. Hence various heuristics exist to measure the degree of topology preservation, to adapt the topology to the data, to define the lattice a posteriori, or to evolve structures which are appropriate for real world data (Bauer and Villmann, 1997, Fritzke, 1995, Martinetz and Schulten, 1993, Ritter, 1999, Villmann et al., 1997). In all tasks, the intrinsic dimensionality of data plays a crucial role since it determines an important aspect of the optimum neural network: the topological structure, i.e. the lattice for SOM. Moreover, superfluous data dimensions slow down the training for LVQ as well. They may even cause a decrease in accuracy since they add possibly noisy or misleading terms to the Euclidian metric where LVQ is based on. Hence a data dimension as small as possible is desirable for the above mentioned methods in general, for the sake of efficiency, accuracy, and simplicity of neural network processing. Therefore various algorithms exist which allow to estimate the intrinsic dimension of the data: PCA and ICA constitute well-established methods which are often used for adequate preprocessing of data and which can be implemented with neural methods (Hyvärinen and Oja, 1997, Oja, 1995). A Grassberger–Procaccia analysis estimates the dimensionality of attractors in a dynamic system (Grassberger & Procaccia, 1983). SOMs which adapt the dimensionality of the lattice during training like the growing SOM (GSOM) automatically determine the approximate dimensionality of the data (Bauer & Villmann, 1997). Naturally, all adaptation schemes which determine weighting factors or relevance terms for the input dimensions constitute an alternative method for determining the dimensionality: The dimensions which are ranked as least important, i.e. they possess the smallest relevance terms, can be dropped. The intrinsic dimensionality is reached when an appropriate quality measure such as an error term changes significantly. There exists a wide variety of input relevance determination methods in statistics and the field of supervised neural networks, e.g. pruning algorithms for feedforward networks as proposed in Grandvalet (2000), the application of adaptive relevance determination for the support vector machine or Gaussian processes (van Gestel et al., 2001, Neal, 1996, Tipping, 2000), or adaptive ridge regression and the incorporation of penalizing function as proposed in Grandvalet, 1998, Roth, 2001, Tibshirani, 1996. However, note that our focus lies on improving metric based algorithms via involving an adaptive metric which allows dimensionality reduction as a byproduct. The above mentioned methods do not yield a metric which could be used in self-organizing algorithms but primarily investigate the goal of sparsity and dimensionality reduction in neural network architectures or alternative classifiers.

In the following, we will focus on LVQ since it combines the elegancy of simple and intuitive updates in unsupervised algorithms with the accuracy of supervised methods. We will propose a possibility of automatically scaling the input dimensions and hence adapting the Euclidian metric to the specific training problem. As a byproduct, this leads to a pruning algorithm for irrelevant data dimensions and the possibility of computing the intrinsic data dimension. Approaches like Kaski (1998) clearly indicate that often a considerable reduction in the data dimension is possible without loss of information. The main idea of our approach is to introduce weighting factors to the data dimensions which are adapted automatically such that the classification error becomes minimal. Like LVQ, the formulas are intuitive formulas and can be interpreted as Hebbian learning. From a mathematical point of view, the dynamics constitute a stochastic gradient descent on an appropriate error surface. Small factors in the result indicate that the respective data dimension is irrelevant and can be pruned. This idea can be applied to any generalized LVQ (GLVQ) scheme as introduced in Sato and Yamada (1995) or other plausible error measures such as the Kullback–Leibler-divergence. With the error measure of GLVQ, a robust and efficient method results which can push the classification borders near to the optimum Bayesian decision. This method, generalized relevance LVQ (GRLVQ), generalizes relevance LVQ (RLVQ) (Bojer, Hammer, Schunk, & Tluk von Toschanowitz, 2001) which is based on simple Hebbian learning and leads to worse and unstable results in the case of noisy real life data. However, like RLVQ, GRLVQ has the advantage of an intuitive update rule and allows efficient input pruning compared to other approaches which adapt the metric to the data involving additional transformations as proposed in Gath and Geva, 1989, Gustafson and Kessel, 1979, Tsay et al., 1999 or depend on less intuitive differentiable approximations of the original dynamics (Matecki, 1999). Moreover, it is based on a gradient dynamics compared to heuristic methods like DSLVQ (Pregenzer et al., 1996).

We will verify our method on various small data sets. Moreover, we will apply GRLVQ to classify a real life satellite image with approx. 3 mio. data points. As already mentioned, weighting factors allow us to approximately determine the intrinsic data dimensionality. An alternative method is the GSOM which automatically adapts the lattice of neurons to the data and hence gives hints about the intrinsic dimensionality as well. We compare our GRLVQ experiments to the results provided by GSOM. In addition, we relate it to a Grassberger–Procaccia analysis. We obtain comparable results concerning the intrinsic dimensionality of our data. In the following, we will first introduce our method GRLVQ, present applications to simple artificial and real life data, and finally discuss the results for the satellite data.

Section snippets

The GRLVQ algorithm

Assume a finite training set $X={(x^{i},y^{i})⊂ R^{n} ×{1,…,C}|i=1,…,m}$ of training data is given and the clustering of the data into C classes is to be learned. We denote the components of a vector $x∈ R^{n}$ by (x₁,…,x_n) in the following. GLVQ chooses a fixed number of vectors in $R^{n}$ for each class, the so called prototypes. Denote the set of prototypes by {w¹,…,w^M} and assign the label cⁱ=c to wⁱ iff wⁱ belongs to the cth class, c∈{1,…,C}. The receptive field of wⁱ is defined by $R^{i} ={x∈X|∀w^{j} |x−w^{i} |≤|x−w^{j} |}.$ The

Relation to previous research

The main characteristics of GRLVQ as proposed in Section 2 are as follows: The method allows an adaptive metric via scaling the input dimensions. The metric is restricted to a diagonal matrix. The advantages are the efficiency of the method, interpretability of the matrix elements as relevance factors, and the correlated possibility of pruning. The update proposed in GRLVQ is intuitive and efficient, at the same time a thorough mathematical foundation can be found due to the gradient dynamics.

Artificial data

We first tested GRLVQ on two artificial data sets from Bojer et al. (2001) in order to compare it with RLVQ. We refer to the sets as data 1 and data 2, respectively. The data comprise clusters with small or large overlap, respectively, of the clusters in two dimensions as shown in Fig. 1. We embed the points in $R^{10}$ as follows: Assume (x₁,x₂) is one data point. Then we add eight dimensions obtaining a point (x₁,…,x₁₀). We choose x₃=x₁+η₁,…,x₆=x₁+η₄, where η_i comprises noise with a Gaussian

Conclusions

The presented clustering algorithm GRLVQ provides a new robust method for automatically adapting the Euclidian metric used for clustering to the data, determining the relevance of the several input dimensions for the overall classifier, and estimating the intrinsic dimension of data. It reduces the input dimensions onto the essential parameters which is demanded to obtain optimal network structures. This is an important feature, if the network is used to reduce the data amount to subsequent

References (38)

P. Grassberger et al.
Measuring the strangeness of strange attractors
Physica
(1983)
M. Pregenzer et al.
Automated feature selection with distinction sensitive learning vector quantization
Neurocomputing
(1996)
M.F. Augusteijn et al.
A study of neural network input data for ground cover identification in satellite images
H.-U. Bauer et al.
Growing a hypercubical output space in a self-organizing feature map
IEEE Transactions on Neural Networks
(1997)
Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. Irvine, CA: Department of Information...
Bojer, T., Hammer, B., Schunk, D., & Tluk von Toschanowitz, K. (2001). Relevance determination in learning vector...
J. Campbell
Introduction to remote sensing
(1996)
W. Duch et al.
A new method of extraction, optimization and application of crisp and fuzzy logical rules
IEEE Transactions on Neural Networks
(2001)
B. Fritzke
Growing grid: A self-organizing network with constant neighborhood range and adaptation strength
Neural Processing Letters
(1995)
I. Gath et al.
Unsupervised optimal fuzzy clustering
IEEE Transactions on Pattern Analysis and Machine Intelligence
(1989)

T. van Gestel et al.

Automatic relevance determination for least squares support vector machine classifiers

Y. Grandvalet

Least absolute shrinkage is equivalent to quadratic penalization

Y. Grandvalet

Anisotropic noise injection for input variables relevance determination

IEEE Transactions on Neural Networks

(2000)

Gustafson, D., & Kessel, W. (1979). Fuzzy clustering with a fuzzy covariance matrix. Proceedings of IEEE CDC'79 (pp....

T. Hofmann

Learning the similarity of documents: An information geometric approach to document retrieval and categorization

A. Hyvärinen et al.

A fast fixed-point algorithm for independent component analysis

Neural Computation

(1997)

Kaski, S. (1998). Dimensionality reduction by random mapping: Fast similarity computation for clustering. Proceedings...

S. Kaski et al.

Bankruptcy analysis with self-organizing maps in learning metrics

IEEE Transactions on Neural Networks

(2001)

T. Kohonen

Learning vector quantization

Cited by (0)

View full text

2002 Special IssueGeneralized relevance learning vector quantization

Abstract

Introduction

Section snippets

The GRLVQ algorithm

Relation to previous research

Artificial data

Conclusions

Physica

Neurocomputing

A study of neural network input data for ground cover identification in satellite images

Growing a hypercubical output space in a self-organizing feature map

IEEE Transactions on Neural Networks

Introduction to remote sensing

A new method of extraction, optimization and application of crisp and fuzzy logical rules

IEEE Transactions on Neural Networks

Growing grid: A self-organizing network with constant neighborhood range and adaptation strength

Neural Processing Letters

Unsupervised optimal fuzzy clustering

IEEE Transactions on Pattern Analysis and Machine Intelligence

Automatic relevance determination for least squares support vector machine classifiers

Least absolute shrinkage is equivalent to quadratic penalization

Anisotropic noise injection for input variables relevance determination

IEEE Transactions on Neural Networks

Learning the similarity of documents: An information geometric approach to document retrieval and categorization

A fast fixed-point algorithm for independent component analysis

Neural Computation

Bankruptcy analysis with self-organizing maps in learning metrics

IEEE Transactions on Neural Networks

Learning vector quantization

2002 Special Issue
Generalized relevance learning vector quantization