Entropy of Overcomplete Kernel Dictionaries

In signal analysis and synthesis, linear approximation theory considers a linear decomposition of any given signal in a set of atoms, collected into a so-called dictionary. Relevant sparse representations are obtained by relaxing the orthogonality condition of the atoms, yielding overcomplete dictionaries with an extended number of atoms. More generally than the linear decomposition, overcomplete kernel dictionaries provide an elegant nonlinear extension by defining the atoms through a mapping kernel function (e.g., the gaussian kernel). Models based on such kernel dictionaries are used in neural networks, gaussian processes and online learning with kernels. The quality of an overcomplete dictionary is evaluated with a diversity measure the distance, the approximation, the coherence and the Babel measures. In this paper, we develop a framework to examine overcomplete kernel dictionaries with the entropy from information theory. Indeed, a higher value of the entropy is associated to a further uniform spread of the atoms over the space. For each of the aforementioned diversity measures, we derive lower bounds on the entropy. Several definitions of the entropy are examined, with an extensive analysis in both the input space and the mapped feature space.


I. INTRODUCTION
S PARSITY in representation has gained increasing popular- ity in signal and image processing, for pattern recognition, denoising and compression [1].A sparse representation of a given signal consists in decomposing it on a set of elementary signals, called atoms and collected in a so-called dictionary.In the linear formalism, the signal is written as a linear combination of the dictionary atoms.This decomposition is unique when the latter defines a basis, and in particular with orthogonal dictionaries such as with the Fourier basis.Since the 1960's, there has been much interest in this direction with the use of predefined dictionaries, based on some analytical form, such as with the wavelets [2].Predefined dictionaries have been widely investigated in the literature for years, owing to the mathematical simplicity of such structured dictionaries when dealing with orthogonality (as well as bi-orthogonality).When dealing with sparsity, analytical dictionaries perform poorly in general, due to their rigide structure imposed by the orthogonality.
Within the last 15 years, a new class of dictionaries has emerged with dictionaries learned from data, thus with the ability to adapt to the signal under scrutiny.While the Karhunen-Loève transform -also called principal component analysis in advanced statistics [3] -falls in this class, the relaxation of the orthogonality condition delivers an increased flexibility with overcomplete dictionaries, i.e., when P. Honeine is with the Institut Charles Delaunay (CNRS), Université de technologie de Troyes, 10000, Troyes, France.Phone: +33(0)325715625; Fax: +33(0)325715699; E-mail: paul.honeine@utt.frthe number of atoms (largely) exceeds the signal dimension.Several methods have been proposed to construct oversomplete dictionaries by solving a highly non-convex optimization problem, such as the method of optimal directions [4], its singular-value-decomposition (SVD) counterpart [5], and the "convexification" method [6].
Overcomplete dictionaries are more versatile to provide relevant representations, owing to an increased diversity.Several measures have been proposed to "quantify" the diversity of a given dictionary.The simplest measure of diversity is certainly the cardinality of the dictionary, i.e., the number of atoms.While this measure is too simplistic, several diversity measures have been proposed by examining relations between atoms, either in a pairwise fashion or in a more thorough way.The most used measure to characterize a dictionary is the coherence, which is the largest pairwise correlation between its atoms [7].By using the largest cumulative correlation between an atom and all the other atoms of the dictionary, this yields the more exhaustive Babel measure [8].Over the last twenty years or so, the coherence and its variants (such as the Babel measure) have been used for the matching pursuit algorithm [9] and the basis pursuit with arbitrary dictionaries [10], with theoretical results on the approximation quality studied in [11], [8]; see also the extensive literature on compressed sensing [1].
Beyond the literature on linear approximation, several diversity measures for overcomplete dictionary analysis have been investigated separately in the literature, within different frameworks.This is the case of the distance measure, which corresponds to the smallest pairwise distance between all atoms, as often considered in neural networks.Indeed, in resource-allocating networks for function interpolation, the network of gaussian units is assigned a new unit if this unit is distant enough to any other unit already in the network [12], [13].It turns out that these units operate as atoms in the approximation model, with the corresponding dictionary having a small distance measure.While the distance measure of a given dictionary relies only on its nearest pair of atoms, a more thorough measure is the approximation measure, which corresponds to the least error of approximating any atom of the dictionary with a linear combination of its other atoms.This measure of diversity has been investigated in machine learning with gaussian processes [14], online learning with kernels for nonlinear adaptive filtering [15], and more recently kernel principal component analysis [16].
In order to provide a framework that encloses all the aforementioned methods, we consider the reproducing kernel Hilbert space formalism.This allows to generalize the wellknown linear model used in sparse approximation to a nonlinear one, where each atom is substituted by a nonlinear one given with a kernel function.This yields the so-called kernel dictionaries, where each atom lives in a feature space, the latter being defined with some nonlinear transformation of the input space.While the linear kernel yields the conventional linear model, as given in the literature of linear sparse approximation, the use of nonlinear kernels such as the gaussian kernel, allows to include in our study neural networks with ressourceallocating networks, nonlinear adaptive filtering with kernels and gaussian processes.
All the aforementioned diversity measures allow to quantify the heterogeneity within the dictionary under scrutiny.In this paper, we derive connections between these measures and the entropy in information theory (which is also related to the definition of entropy in other fields, such as thermodynamics and statistical mechanics) [17].Indeed, the entropy measures the disorder or randomness within a given system.By considering the generalized Rényi entropy, which englobes the definitions given by Shannon, Hartley, as well as the quadratic formulation, we show that any overcomplete kernel dictionary with a given diversity measure has a lower-bounded entropy.These results on the high values of the entropy illustrate that the atoms are favorably spread uniformly over the space.We provide a comprehensive analysis, for any kernel type and any entropy definition, within the Rényi entropy framework as well as the more recent nonadditive entropy proposed by Tsallis [18], [19].Finally, we provide an entropy analysis in the feature space by deriving lower bounds depending on the diversity measures.As a consequence, we connect the diversity measures between both input and feature spaces.
The remained of this paper is organized as follows.Next section introduces the sparse approximation problem, in its conventional linear model as well as its nonlinear extension with the kernel formalism.Section III presents the most used diversity measures for quantifying overcomplete dictionaries, while Section IV provides a preliminary exploration with results that will be used throughout this paper.Section V is the core of this work, where we define the entropy and examine it in the input space, while Section VI extends this analysis to the feature space.Section VII concludes this paper.

Related work
In [20], Girolami considered the estimation of the quadratic entropy with a set of samples, by using the Parzen estimator based on a normalized kernel function.This formulation was investigated in regularization networks, and in particular leastsquares support vector machines (LS-SVM), in order to reduce the computational complexity by pruning samples that do not contribute sufficiently to the entropy [21].More recently, an online learning scheme was proposed in [22] for LS-SVM by using the approximation measure as a sparsification criterion.In our paper, we derive the missing connections between this criterion and the entropy maximization.
Richard, Bermudez and Honeine considered in [23] the analysis of the quadratic entropy of a kernel dictionary in terms of its coherence.We provide in our paper a framework to analyse overcomplete dictionaries with a more extensive examination, in both input and feature spaces, and generalizing to other entropy definitions and all types of kernels.The conducted analysis examines several diversity measures, including, but not limited to, the coherence measure.

II. A PRIMER ON OVERCOMPLETE (KERNEL) APPROXIMATION
In this section, we introduce the sparse approximation problem, in its conventional linear model as well as the kernelbased formulation.We conclude this section with an outline of the issues addressed in this paper.

A. A primer on sparse approximation
Consider a Banach space X of R d , denoted input space.The approximation theory studies the representation of a given signal x of X with a dictionary of atoms (i.e., set of elementary signals), x 1 , x 2 , . . ., x n ∈ X, and estimating their fractions in the signal under scrutiny.In linear approximation, the decomposition takes the form: This representation is unique when the atoms form a basis, by approximating the signal with its projection onto the span of the atoms, namely Examples that involve orthonormal bases include the Fourier transform and the discrete cosine transform, as well as the data-dependent Karhunen-Loève transform (i.e., the PCA).
Beyond these orthogonal bases, the relaxation of the orthogonality provides more flexibility with the use of overcomplete dictionaries, which allows to investigate different constraints more properly, such as the sparsity of the representation.In this case, the coefficients α i in (1) are obtained by promoting the sparsity of the representation.This optimization problem is often called sparse coding, assuming that the dictionary is known.In view of the vector [α 1 α 2 • • • α n ] ⊤ , sparsity can be promoted by minimizing its ℓ 0 pseudo-norm, which counts the number of non-zero entries, or its ℓ 1 norm, which is the closest convex norm to the ℓ 0 pseudo-norm [24].
Since the seminal work [25] where Olshausen and Field considered learning the atoms from a set of available data, data-driven dictionaries have been widely investigated.A large class of approaches have been proposed to solve iteratively the optimization problem by alternating between the dictionary learning (i.e., estimating the atoms x i ) and the sparse coding (i.e., estimating the coefficients α i ).The former problem is essentially tackled with the maximum likelihood principle of the data or the maximum a posteriori probability of the dictionary.The latter corresponds to the sparse coding problem.The best known methods for solving the optimization problem1 subject to some sparsity promoting constraint, are the method of optimal directions [4] and the K-SVD algorithm [5], where the dictionary is determined respectively with the Moore-Penrose pseudo-inverse and the SVD scheme.For more details, see [1] and references therein.It is worth noting that the sparsity constraint yields a difficult optimization problem, even when the model is linear in both coefficients and atoms.

B. Kernel-based approximation
Nonlinear models provide a more challenging issue.The formalism of reproducing kernel Hilbert spaces (RKHS) provides an elegant and efficient framework to tackle nonlinearities.To this end, the signals x 1 , x 2 , . . ., x n are mapped with a nonlinear function into some feature space H, as follows: Here, κ : X × X → R is a positive definite kernel and the feature space H is the so-called reproducing kernel Hilbert space.Let •, • H and • H denote respectively the inner product and norm in the induced space H.This space has some interesting properties, such as the reproducing property which states that any function ψ(•) of H can be evaluated at any Kernels can be roughly divided in two categories, projective kernels as functions of the data inner product (i.e., x i , x j ), and radial kernels as functions of their distance (i.e., x i − x j ).The most used kernels and there expressions are given in TABLE I. From these kernels, only the gaussian and the radialbased exponential kernels are unit-norm, that is κ(x, •) H = 1 for any x ∈ X.In this paper, we do not restrict ourselves to a particular kernel.We denote where κ(x, x) = κ(x, •) 2 H .For unit-norm kernels, we get R = r = 1.
While the linear kernel yields the conventional model given in (1), nonlinear kernels such as the gaussian kernel provide the models investigated in RBF neural networks, gaussian processes [26] and kernel-based machine learning [27], including the celebrated support vector machines [28].For a set of data x 1 , x 2 , . . ., x n ∈ X and a given kernel κ(•, •), the induced RKHS H is defined such as any element ψ(•) of H takes the form When dealing with an approximation problem in the same spirit of ( 1)-( 2), the element ψ(•) is approximated by κ(x, •).
Compared to the linear case given in (1), it is easy to see that the above model is still linear in the coefficients α i , as well as the "atoms" κ(x i , •), while it is nonlinear with respect to x i .Indeed, the resulting optimization problem consists in minimizing the residual in the RKHS, with On the one hand, the estimation of the coefficients is similar to the one given in the linear case with (2); the classical (linear) sparse coders can be investigated for this purpose.On the other hand, the dictionary determination is more difficult, since the model is nonlinear in the x i ; thus, conventional techniques such as the K-SVD algorithm can no longer be used.It turns out that the estimation of the elements in the input space is a tough optimization problem, known in the literature as the pre-image problem [29].More recently, the authors of [30], [31] adjusted the elements x i in the input space for nonlinear adaptive filtering with kernels.In another context, the authors of [32], [33] estimated these elements for the kernel nonnegative matrix factorization.

C. Addressed issues
In either analysis or synthesis of overcomplete (kernel) dictionaries, with the grow in the number of atoms, an increase in the heterogeneity of the atoms is needed.Such diversification requires that the atoms are not too "close" to each other.Depending on the definition of closeness, several diversity measures have been proposed in the literature.This is the case when the closeness is given in terms of the metric, as given with the distance measure for a pairwise measure between atoms, or the approximation measure for a more thorough measure.This is also the case when the collinearity of the atoms is considered, such as with the coherence and the Babel measures.These diversity measures are described in detail in Section III, within the formalism for a kernel dictionary {κ(x In this paper, we connect these diversity measures to the entropy from information theory [17].Indeed, from the viewpoint of information theory, the set {x 1 , x 2 , . . ., x n } can be viewed as a finite source alphabet.A fundamental measure of information is the entropy, which quantifies the disorder or randomness of a given system or set.It is also associated to the number of bits needed, in average, to store or communicate the set under investigation.A detailed definition of the entropy is given in Section V, with connections between the entropy of the set {x 1 , x 2 , . . ., x n } and the aforementioned diversity measures of the associated kernel dictionary {κ(x 1 , •), κ(x 2 , •), . . ., κ(x n , •)}.Several entropy definitions are also investigated, including the generalized Rényi entropy and the Tsallis entropy.Finally, Section VI extends this analysis to the RKHS, by studying the entropy of set of atoms {κ(x 1 , •), κ(x 2 , •), . . ., κ(x n , •)}.

III. DIVERSITY MEASURES
In this section, we present measures that quantify the diversity of a given dictionary {κ(x 1 , •), κ(x 2 , •), . . ., κ(x n , •)}.Each diversity measure is associated to a sparsification criterion for online learning, in order to construct dictionaries with large diversity measures.

A. Cardinality
The cardinality of the dictionary, namely the number n of atoms, is the simplest measure.However, such measure does not take into account that some atoms can be close to each others, e.g., duplicata.

B. Distance measure
A simple measure to characterize a dictionary is the smallest distance between all pairs of its atoms, namely In the following, we consider a tighter measure by using the distance between any two atoms, up to a scaling factor, which is a tighter measure since we have A dictionary is said to be δ-distant when Since the above distance is equivalent to the residual error of approximating any atom by its projection onto another atom, the optimal scaling factor ξ takes the value κ(x i , x j )/κ(x j , x j ), yielding When dealing with unit-norm atoms, this expression boils down to A sparsification criterion for online learning is studied in ressource-allocating networks [12], [34] with the "novelty criterion", by imposing a lower bound on the distance measure of the dictionary.Thus, any candidate atom is included in the dictionary if the distance measure of the latter does not fall below a given threshold that controls the level of sparseness.

C. Approximation measure
While the distance measure relies only on the nearest two atoms, the approximation measure provides a more exhaustive analysis by quantifying the capacity of approximating any atom with a linear combination of the other atoms of the dictionary.A dictionary is said to be δ-approximate if the following is satisfied: This expression corresponds to the residual error of projecting any atom onto the subspace spanned by the others atoms.By nullifying the derivative of the above cost function with respect to each coefficient ξ j , we get the optimal vector of coefficients Here, K \{i} and κ \{i} (x i ) are obtained by removing the entries associated to x i from K and κ(x i ), respectively, where K is the Gram matrix of entries κ(x i , x j ) and κ(•) is the column vector of entries κ(x j , •), for i, j = 1, . . ., n.By plugging the above expression in (6), we obtain: The sparsification criterion associated to the approximation measure is studied in [35], [36] and more recently in [37] for system identification and [16] for kernel principal component analysis.This criterion constructs dictionaries with a high approximation measure, thus including any candidate atom in the dictionary if it cannot be well approximated by a linear combination of atoms already in the dictionary, for a given approximation threshold.

D. Coherence measure
In the literature of sparse linear approximation, the coherence is a fundamental quantity to characterize dictionaries.It corresponds to the largest correlation between atoms of a given dictionary, or mutually between atoms of two dictionaries.While initially introduced for linear matching pursuit in [9], it has been studied for the union of two bases [38], for basis pursuit with arbitrary dictionaries [10], for the analysis of the approximation quality [11], [8].While most work consider the use of a linear measure, we explore in the following the coherence of a kernel dictionary, as initially studied in [39].
For a given dictionary, the coherence is defined by the largest correlation between all pairs of atoms, namely It is easy to see that this definition can be written, for a socalled γ-coherent dictionary, as follows: For unit-norm atoms, we get max The coherence criterion for sparsification constructs a "lowcoherent" dictionary, thus enforcing an upper bound on the cosine angle between each pair of atoms [23].In this case, any candidate atom is included in the dictionary if the coherence of the latter does not exceed a given threshold.This threshold controls the level of sparseness of the dictionary, where a null value yields an orthogonal basis.

E. Babel measure
While the coherence relies only on the most correlated atoms in the dictionary, a more thorough measure is the Babel measure which considers the largest cumulative correlation between an atom and all the other atoms of the dictionary.The Babel measure can be defined in two ways.The first one is by connecting it to the coherence measure, with a definition related to the cumulative coherence, namely The second (and most conventional) way to define the Babel measure is by investigating an analogy with the norm operator [40], [8].Indeed, while the coherence is the ∞-norm of the Gram matrix when dealing with unit-norm atoms, the Babel measure explores the ℓ 1 matrix-norm, where K 1 = max i j |κ(x i , x j )|.As a consequence, a dictionary is said to be γ-Babel when Connecting this definition with (10) -for not necessary unitnorm atoms -is straightforward, since the latter can be boxbounded for any γ-Babel dictionary defined by (11), with For this reason and for the sake of simplicity, we consider the definition (11) in this paper.
The sparsification criterion associated to the Babel measure constructs dictionaries with a low cumulative coherence [41].To this end, any candidate atom κ(x t , •) is included in the dictionary if (and only if) does not exceed a given positive threshold.

IV. SOME FUNDAMENTAL RESULTS
Before proceeding throughout this paper with a rigorous analysis of any overcomplete dictionary in terms of its diversity measure, we provide in the following some results that are essential to our study.These results provide an attempt to bridge the gap between the different diversity measures.

A. Coherence versus Babel measure
The following theorems connect the coherence of a dictionary to its Babel measure by quantifying the Babel measure of a γ-coherent dictionary, and vice-versa.The following theorem has been known for a while in the case of unit-norm atoms.
Proof: Following the definition (11), the Babel of a γcoherent dictionary is upper-bounded as follows: Furthermore, it is also easy to provide an upper bound on the coherence of a dictionary with a given Babel measure, as given in the following theorem.
Theorem 2: A γ-Babel dictionary has a coherence that does not exceed γ/r 2 .
Proof: The proof follows from the relation and the inequality between matrix norms:

B. Analysis of a δ-approximate dictionary
The following theorem is fundamental in the analysis of a dictionary resulting from the approximation criterion.
Theorem 3: A δ-approximate dictionary has a Babel measure that does not exceed R 2 − δ 2 , and a coherence measure that does not exceed Proof: For a δ-approximate dictionary, we have from ( 7): K \{i} ξ = κ \{i} (x i ), for any i = 1, 2, . . ., n.By plugging this relation in (8), we obtain By considering the special case of the vector ξ with ξ j = sign(κ(x i , x j )), for any j = 1, 2, . . ., n and j = i, we get for all i = 1, 2, . . ., n.As a consequence, This concludes the proof for the Babel measure, since it is the left-hand-side in the above expression, while the upper bound on the coherence measure is obtained from the aforementioned connection between the coherence and the Babel measures as given in Theorem 2.

V. ENTROPY ANALYSIS IN THE INPUT SPACE
The entropy measures the disorder or randomness within a given system.The Rényi entropy provides a generalization of well-known entropy definitions, such as Shannon and Harley entropies as well as the quadratic entropy (see TABLE II).It is defined for a given order α by for the probability distribution P that governs all elements x of X.When dealing with discrete random variables as in source coding, this definition is restricted to the set {x 1 , x 2 , . . ., x n } drawn from the probability distribution P , yielding the expression Large values of the entropy correspond to a more uniform spread of the data2 .Since this probability distribution is unknown in practice, it is often approximated with a Parzen window estimator (also called kernel density estimator).The estimator takes the form for a given window function w centered at each x j .For more details, see for instance [42].
In the following, we provide lower bounds on the entropy of an overcomplete dictionary, in terms of its diversity measure.To this end, we initially restrict ourselves to the case of the quadratic entropy (i.e., α = 2), first with the gaussian kernel then with any type of kernel, before generalizing these results to any order α of the Rényi entropy as well as the Tsallis entropy.

A. The quadratic entropy with the gaussian kernel
Before generalizing to any window function in Section V-B and any order in Section V-C, we restrict ourselves first to the case of the gaussian window function with the quadratic entropy.The quadratic entropy is defined by for some bandwidth parameter σ, the Parzen estimator becomes Since the convolution of two gaussian distributions leads to another gaussian distribution, then H 2 ≈ − log where κ( is the gaussian kernel.This expression shows that the sum of the entries in the Gram matrix describes the diversity of the dictionary elements, a result corroborated in [20] and more recently in [42].This property was investigated in [21] for pruning the LS-SVM, by removing samples with the smallest entries in the Gram matrix. Each diversity measure studied in Section III yields a lower bound on the entropy of the dictionary under scrutiny.To shown this, we consider first the Babel measure with a γ-Babel dictionary.Following the Babel definition in (11), the entropy given in ( 16) is lower-bounded as follows: where we have used the following upper bound on the summation: This result provides the core of the proof.Indeed, Theorem 2 shows that this result holds also for a γ-coherent dictionary.Furthermore, we can improve this bound for the coherence measure, since n i=1 n j=1,j =i κ(x i , x j ) ≤ n(n − 1)γ, thus yielding the following lower bound on the entropy This result is also shared with a δ-distant dictionary, by substituting γ with √ 1 − δ 2 , since the distance is equivalent to the coherence when dealing with normalized kernels.Finally, Theorem 3 establishes the connection with a δ-approximate dictionary, where the above upper bound becomes All these results provide lower bounds on the entropy, with the following observations.These bounds increase with the number of elements in the dictionary, i.e., n, which is obvious as the diversity grows.They decrease when the coherence and the Babel measures increase, while they increase when the distance and the approximation measures increase.These results provide quantitative details that confront the fact that, when using a sparsification criterion for online learning, low values of the coherence and Babel thresholds provide less "correlated" atoms and thus more diversity within the dictionary, as opposed to high values of the distance and approximation thresholds.

B. The quadratic entropy with any kernel
The results presented so far can be extended to any kernel, even non-unit-norm kernels.To see this, we define the Parzen estimator in a RKHS, by writing the integral X P (x) 2 dx as the quadratic norm P 2 H of where the norm is given in the subspace spanned by the kernel functions κ(x 1 , •), κ(x 2 , •), . . ., κ(x n , •).Therefore, we have By following the same steps as in Section V-A, we can derive the following lower bounds on the quadratic entropy: Before providing the proof of these results, it is worth noting that the conclusion and discussion conducted in the case of the gaussian kernel are still satisfied in the general case of any kernel type.
Proof: The bounds for the δ-approximate and γ-Babel dictionaries are straightforward from Theorem 3 and the definition in (11).The lower bounds for γ-coherent and δdistant dictionaries are a bit trickier to prove.To show this, we use for the former the following relation and for the latter the following relation

C. Generalization to Rényi and Tsallis entropies
So far, we have investigated the quadratic entropy and derived lower bounds for each diversity measure.It turns out that these results can be extended to the general Rényi entropy and Tsallis entropy, as shown next.Special cases of the former are listed in TABLE II, including the Harley or maximum entropy which is associated to the cardinality of the set, the Shannon entropy which is essentially the Gibbs entropy in statistical thermodynamics, the quadratic entropy also called collision entropy, as well as the min-entropy which is the smallest measure in the family of Rényi entropies.
Corollary 4: Any lower bound ζ on the quadratic entropy provides lower bounds on the Hartley entropy H 0 , the Shannon H 1 , and the min-entropy H ∞ , with The proof is due to the Jensen's inequality and the concavity of the Rényi entropy for nonnegative orders.First, the relation of the Shannon entropy is given by exploring the following inequality: The connection to the Hartley entropy is straightforward, with H 0 = log n.Finally, it is more trickier to study the minentropy, since it is the smallest entropy measure in the family of Rényi entropies, as a consequence it is the strongest way to measure the information content.To provide a lower bound on the min-entropy, we use the relations which yields the following inequality: Furthermore, one can easily extend these results to the class of the Tsallis entropy, also called nonadditive entropy, defined by the following expression for a given parameter q (called entropic-index) [18], [19]: To this end, the aforementioned lower bounds on the Rényi entropy can be extended to the Tsallis entropy by using for instance the well-known relation log u ≤ u − 1 for any u ≥ 0. As a consequence, the lower bounds on the quadratic entropy given in Sections V-A and V-B can be explored to other orders of Rényi entropy and Tsallis entropy.

VI. ENTROPY IN THE FEATURE SPACE
By analogy with the entropy analysis in the input space conducted in Section V, we propose to revisit it in the feature space, as given in this section.By examining the pairwise distance between any two atoms of the investigated dictionary, we first establish in Section VI-A a topological analysis of overcomplete dictionaries.This analysis is explored in Section VI-B with the study of the entropy of the atoms in the feature space.By providing lower bounds in terms of the diversity measures, these results provide connections to the entropy analysis conducted in the previous section.

A. Fundamental analysis
The following theorem is used in the following section for the analysis of the atoms of a kernel dictionary.
Theorem 5: For any dictionary with a non-zero approximation measure, or a non-unit coherence measure, or a Babel measure below r 2 , we have a low-bounded distance measure.
Proof: The proof is straightforward for a δ-approximate dictionary, since For the coherence measure, we consider the pairwise distance in terms of kernels as given in (4).Since a γ-coherent dictionary satisfies max Therefore, to complete the proof, it is sufficient to show that this expression is always strictly positive.Indeed, it is a quadratic polynomial of the form u 2 − 2γuv + v 2 where u = κ(x i , x i ) and v = κ(x j , x j ) (this form is valid since κ(x, x) = κ(x, •) 2 H > 0 for any x ∈ X).Considering the roots of this quadratic polynomial with respect to u, its discriminant is 4 κ(x j , x j )(γ 2 − 1), which is strictly negative since γ ∈ [ 0 ; 1 [ and κ(x j , x j ) cannot be zero.Therefore, the polynomial has no real roots, and it is strictly positive.
Finally, for any γ-Babel dictionary, we have which is strictly positive when γ < r 2 .

B. Entropy in the RKHS
The entropy in the feature space provides a measure of diversity of the atoms distribution.In the following, we show that the entropy estimated in the feature space is lowerbounded, with a bound expressed in terms of a diversity measure.
We denote by P H (x) the distribution associated to the kernel functions in the feature space, namely by definition P H (x) = P (κ(x, •)).The entropy in the RKHS is given by expression (13) where P (x) is substituted with P H (x), yielding 31 1 − α log By approximating the integral in this expression with the set {x 1 , x 2 , . . ., x n }, we get Examples of radial functions are -up to a scaling factor to ensure the integration to one -the gaussian, the radialbased exponential and the inverse mutliquadratic kernels, given in TABLE I and applied here in the feature space.Radial kernels are monotonically decreasing in the distance, namely κ(x i , x j ) grows when x i −x j is decreasing.This statement results from the following lemma; See also [45,Proposition 5].
Theorem 7: Consider an overcomplete kernel dictionary with a lower bound ǫ on its distance measure, or any bounded diversity measure as given in Theorem 5. A Parzen window estimator, estimated over the dictionary atoms in the feature space, is upper-bounded by w(ǫ), where w(•) is the used window function.where the inequality is due to the monotonically decreasing property of the window function w and Theorem 5.
This theorem is the main building block of the following corollary that provides lower bounds on the entropy, with the Shannon entropy and generalizing to the Rényi entropy for any order α > 1.
Corollary 8: Consider an overcomplete kernel dictionary with a lower bound ǫ on its distance measure, or any bounded diversity measure as given in Theorem 5.The Shannon entropy and the generalized Rényi entropy for any order α > 1 are lower bounded by −n w(ǫ) log w(ǫ) and 1  1−α log n w(ǫ) α , respectively, where w(•) is the used window function.
More generally, the Rényi entropy for any order α is estimated by where we have used Theorem 7 and α > 1.
These results illustrate how the atoms of an overcomplete dictionary are uniformly spread in the feature space.

VII. FINAL REMARKS
This paper provided a framework to examine linear and kernel dictionaries with the notion of entropy from information theory.By examining different diversity measures, we showed that overcomplete dictionaries have lower bounds on the entropy.While various definitions were explored here, these results open the door to bridging the gap between information theory and diversity measures for the analysis and synthesis of overcomplete dictionaries, in both input and feature spaces.As of futur works, we are studying connections to the entropy component analysis [42], in order to provide a thorough examination and develop an online learning approach.

P
H (x j ) α .The distribution P H (•) is estimated with the Parzen window estimator.The use of a radial function w(•) defined in the feature space H yieldsP H (x) = 1 n n j=1 w( κ(x, •) − κ(x j , •) H ).

TABLE I THE
MOST USED KERNELS WITH THEIR EXPRESSIONS, INCLUDING TUNABLE PARAMETERS p, σ > 0 AND c ≥ 0. THESE KERNELS ARE GROUPED IN TWO CATEGORIES: PROJECTIVE KERNELS AS FUNCTIONS OF x i , x j , AND RADIAL KERNELS AS FUNCTIONS OF

TABLE II THE
MOST KNOWN ENTROPIES AS SPECIAL CASES OF THE GENERALIZED R ÉNYI ENTROPY.