Two dimensional histogram analysis using the Helmholtz principle

An algorithm for two dimensional histogram modal analysis is presented. A major challenge in two dimensional histogram analysis is to provide an accurate location and description of the extended modal shape. The approach presented in this paper combines the Fast Level Set Transform of the histogram and the Helmholtz principle to find the location and shape of the modes. Furthermore, the algorithm is devoid of any a priori assumptions about the underlying density or the number of modes. At the core, this approach is a new way to manage and search the number of regions that must be examined to identify meaningful sets. Computational issues required a new tail sum bound on the multinomial distribution to be stated and proven. This bound reduces to the Hoeffding inequality for the binomial distribution. The histogram segmentation procedure was applied to the two problems of image color segmentation and correlation pattern recognition. With no a priori knowledge about the color image assumed, the two dimensional modal analysis is applied to the CIELAB color space to find perceptually uniform dominant colors. The modal analysis is also extended to correlation pattern recognition to find multiple targets in a single correlation plane.


1.
Introduction. Histograms modal analysis is a concise and flexible technique to identify the global properties of large data sets. Despite a vast aggregate of research, a definitive answer on how to locate the modes of a histogram has been elusive. Segmentation of a histogram into its separate modes is a difficult task for many reasons. These reasons include: the number of modes is unknown, the number of candidate regions that must be examined to localize the modes is large, and the extent of the mode is often irregular. Delon et al. [9,10,11], successfully circumvented these limitations by developing a nonparametric approach for one dimensional histogram analysis based on the Helmholtz principle. This paper will extend their ideas to create a robust two dimensional histogram analysis procedure that combines the Helmholtz principle with the topographic map of the FLST tree structure.
The modes of a multinomial 2-D histogram are complex. The complexity exists since the location and number of local maxima are unknown. Furthermore, their geometrical shape is not necessarily simple but rather arbitrarily structured. This complexity has motivated many researchers to develop histogram modal analysis techniques. Nonparametric procedures are often preferred since these methods do not have embedded assumptions. The Expectation-Maximation (EM) algorithm, the mode tree of Minnotte and Scott [35] and the mean shift method are popular methods for mode seeking. These three procedures will be briefly reviewed below.
Mclachlan et. al provide a comprehensive introduction to both the EM algorithm [33] and finite mixture models [34]. The EM algorithm is a general method that maximizes successive approximations of the likelihood function. It alternates between an approximation (E) step, and one that maximizes it (M step). The EM algorithm works well in fitting mixture models and as a method to estimate unknown parameters, especially when they depend on latent or unobserved variables. It is appropriate for standard non-binned data, when the number of modes is known a-priori and the data distribution can be described well by a mixture of density functions. When the number of modes is unknown and are not sufficiently separated, then estimating the number of modes and the unknown model parameters is an important problem [34]. The EM methodology may exhibit difficulties in such situations. These difficulties include over fitting or over smoothing which results from incomplete information about the large number of parameters that must be estimated in an unsupervised way. Other complications include local convergence, or repeated random starting locations required to overcome poor initialization or the merging of two very close peaks into a single mode [34]. Moreover, the fitting of mixture densities to multivariate binned data intensifies the complexity of the EM iterations since analytical expressions must be replaced with numerical integration at each step [4].
Minotte and Scott [35] discuss building a mode tree based on kernel density estimates as a method of visualizing the modal structure in a one dimensional data sample of size M . Density kernel estimators are usually radially symmetric with a single bandwidth parameter. The monograph of Silverman [42] and the histogram focused text of Scott [39] discuss multivariable kernel estimators. The survey by Scott and Sain [40] provides some additional focus on multi-dimensional density estimators. The construction of the mode tree exploits a rare property of the univariate Gaussian kernel that insures modal continuity, a behavior described by Silverman [41]. The Gaussian kernel density estimates for a range of bandwidths reveals a splitting relationship between new modes and old modes. This behavior provides a graphical means to visualize the structure of an unknown data set. While the resulting mode tree is informative for 1-D histograms, visualizing the structure even for bivariate data is challenging.
The mean shift estimate of the gradient of a density function was developed by Fukunaga and Hostetler [24] with later convergence refinements by Cheng [6] and extensions to discrete data by Comaniciu and Meer [7]. It is a clever method to locate the modes without estimating the density. It locates modes by finding the stationary points of a kernel estimator. The analysis depends on the mean shift vector being proportional to the derivative of a kernel density estimator. When the directions of the mean shift and the gradient vector are aligned, then the mean shift will always point to a local maxima. It is a gradient ascent algorithm that avoids the step size pitfalls of traditional gradient based methods because under mild conditions the iterates will converge maintaining an adaptive step size. Artifacts induced by the termination thresholds must be eliminated by post-processing. In order to accommodate the hidden multiple local maxima, Comaniciu and Meer recommend tessellating the entire feature space by the kernels. As with all kernel based methods, the choice of bandwidth is critical since an inappropriate choice can lead to either merged peaks or generate false peaks [43].
Unlike the methods described above, the Helmholtz principle can be applied to two dimensional binned histograms to develop a direct and fully automated modal detection strategy. The method described in this paper does not require repeated random starting or adjustable bandwidths to overcome incomplete information. Also, post processing is not needed to adjust for truncation effects. Furthermore, the level sets and the associated graphical tree structure is a complete representation of a two dimensional histogram. Direct connectivity of the FLST sets guarantee that no false peaks are introduced and that the integrity of the modal shape geometry in a neighborhood of a peak is preserved since no smoothing, over-fitting or underfitting is introduced. In fact, the estimated modal shape is not constrained by the kernel size or support near a peak, but rather its shape will be a discrete set of bins that satisfy the Helmholtz selection criterion.
A form of the Helmholtz operational guidelines is summarized in the following quote: "the Helmholtz principle states that whenever some large deviation from randomness occurs, a structure is perceived" [20]. This principle has been developed and applied extensively to perceptual gestalt theory by Desolneux et al.. They successfully found alignments, shapes and other important geometric structures in digital images [15,18,14,16,19]. Recently, using the Helmholtz principle, Delon et al. [11] constructed an automatic and parameter-free methodology of one dimensional histogram segmentation. This procedure consisted of an initial screening step and a refinement step. Candidate modes were identified by the merging of histogram bins into meaningful intervals ordered by the number of false alarms (N F A). A maximal principle was then applied to refine the family of candidate modes into a collection of disjoint sets. The Helmholtz principle combined with a unimodal hypothesis provided a further improvement. This technique segments a one dimensional histogram into intervals that contain only one mode. It provides a complete segmentation of the histogram, and it also finds modes that are small compared to the largest mode. While their process segments the histogram, the unimodal construction developed in their work is not easily generalizable to two dimensional histograms.
A unimodal hypothesis for two dimensional histograms was not pursued in this work, but several of the features common with the work of Delon et al. are maintained. The maximal meaningful mode algorithm, presented in Section 3, consists of an initial screening step followed by further refinement steps. The initial screening step locates groups of bins in which the data concentrates. Such groups of bins are in accordance with a preconceived notion of a mode. The further refinement steps motivated by a maximal principle will locate disjoint sets where the data concentrates. The two dimensional domain of the histogram compounds the complexity of the algorithm when compared with the one dimensional case. This increased complexity introduces a merging problem which can be understood in the context of a maximal principle.
The Helmholtz principle has the benefit of being weakly dependent on the parameters. The dependence on the parameters is reduced by the application of a maximal principle to find disjoint groups of independent and identically distributed random variables that minimize the N F A. Meaningful groups that contain or are contained in other meaningful groups are common. For example, a large peak of the histogram may contain two or more separate peaks. It is then natural to consider whether the larger peak should be decomposed into the two smaller peaks, or whether the smaller peaks should be merged into the larger peak resulting in a single mode. A level set graphical structure that embeds the peaks in a containment tree structure is used in this work to define a maximal principle and address this merging problem.
The merging problem mentioned above in data grouping was tackled by Cao, Delon, Desolneux, Muse, and Sur [5]. Their method merges groups of data if the N F A for the merged group is less than the N F A for a pair of groups. This procedure requires the use of a trinomial since three outcomes are possible. In the analysis of two dimensional histograms, a countable number of outcomes is possible. The multinomial becomes an essential component of the analysis. The two dimensional nature of the problem and the need to use multinomial distributions introduces some difficulties.
The family of all possible merging candidates in a two dimensional histogram is combinatorially large. The number of merging candidates is reduced by imposing a data structure. Cao et al. uses bounding rectangles to reduce their search space. In this work, this problem is approached by applying the Fast Level Set Transform to the histogram, thus extracting a graphical structure. When combined with the Höeffding inequality, this methodology provides a merging and clustering framework for mode selection. Furthermore, the Fast Level Set Transform has the added benefit of detecting sets with arbitrary shapes. [1,2,36,37].
The candidate modes are identified by using a maximal principle. The maximal principle merges meaningful histogram peaks into other meaningful peaks according to their N F A. Initially, the N F A is defined in terms of binomial and multinomial probability distributions resulting in numerous calculations of multinomial tail sums. The multiple calculations of binomial and multinomial distributions is computationally expensive, and this computational complexity motives a modification of the N F A definition. This modification uses the Höeffding bound to define and merge meaningful peaks. A major contribution of this paper is a derivation of the necessary mathematical theory of a new multivariate Höeffding inequality for the multinomial tail probabilities. This inequality generalizes Höeffding's approximation for independent bounded univariate densities. With these new bounds, the Helmholtz principle, within the Cao et al. framework, can be extended to multidimensional histograms. The multinomial Höeffding bound will be shown to have the following desirable properties: it is the minimization of an upper bound on the multinomial distribution, it inherits the conditional probability factoring of the multinomial distribution, and it has a natural relation to the mode of the multinomial distribution. Moreover, the new bound is always less than the standard binomial product upper bound of Mallows and Joag-Dev et al.. Furthermore, when some typical parameter values are examined, the new bound is many orders of magnitude tighter [32,26].
The rest of the paper is organized as follows. The main line of reasoning is presented in Sections 2.1 through 2.2 and Section 3. Sections 2.1 through 2.2 reviews the Fast Level Set Transform and discusses the Helmholtz principle as applied to two dimensional histograms. The notion of a family of -meaningful modes in terms of the generalized Höeffding bound is also defined. Section 3 continues by defining maximal -meaningful modes, and the maximal meaningful mode algorithm Upper Tree FLST Tree Lower Tree Figure 1. The tree of connected shapes obtained through the lower level set transform, upper level set transform and FLST is shown. The root of the tree in all three cases is the entire image. All three tree structures are ordered by containment, but the upper tree and lower tree both allow shapes that contain holes. The FLST tree closes all the holes. is presented. This algorithm decomposes a histogram into its maximal -meaningful modes. Sections 2.3, 2.4, and 4.1 provide supporting matrial. A new relative entropy upper bound approximation for multinomials is derived in 2.3. This upper bound reduces to the Chernoff-Hoeffding bound for the binomial distributions. Section 4 provides some insight into the maximal -meaningful mode definition, and it presents some theorems that decrease the computational burden. The first theorem in this section furnishes a new simple algebraic test that allows the direct comparison of two sets and predicts their comparative relative entropy ordering.
Experiments and applications are addressed in section 5. The effect of binning on mode location is discussed, and the number of -meaningful modes found as a function of the parameter is explored. Two applications are presented, namely color segmentation and multiple target location in correlation plane pattern recognition. Additionally, the effectiveness of the approach using some of the examples in the works of Delon et al. [10,11] is shown.

Preliminaries.
2.1. The fast level set transform. The image level set decomposition of Monasse [36] is a morphological image representation that readily adapts to two dimensional histogram analysis. Two essential features of this representation are exploited in the following. First, at a fixed gray level, this representation partitions the image into connected non-overlapping regions. Second, as the gray level changes, the connected components are ordered by a containment tree structure. This tree structure will be utilized in this work to guide the search for two dimensional histogram modes.
Consider the level sets of a continuous real valued function h : R 2 → R. Let λ ∈ R and define the upper level set X λ and the lower level set X λ of h as Since h is a function it is clear that the level sets have a natural inclusion structure given by Both the upper and lower level sets are contrast invariant morphological representations of an image. These representations decompose the image into distinct regions for fixed λ, but the level sets by themselves lack some useful topological information. Monasse's level set tree structure, the Fast Level Set Transform (FLST) tree, organizes the connection between the upper and lower level sets as well as the level set containment properties. The set containment topological information is encoded through the parent-child relationship.
The graph theoretic definition of a tree is inadequate to describe the topology of upper and lower level sets since a level set parent may have an infinite number of children. Monasse overcomes this difficulty by defining a tree structure using a partial order relation. The following definition is from [36].
Definition 2.1. Let E be a family of sets and a partial order relation on E. We say that induces a tree structure in E if the following two conditions hold: 1. ∃R ∈ E, ∀E ∈ E, E R; 2. ∀A, B, C ∈ E, if A B and A C =⇒ B and C are comparable.
A tree structure is guaranteed to have a root by the first condition, while the second condition prevents closed loops.
The tree structure in Definition 2.1 is sufficient to build upper level set and lower level set trees where the partial order relation is set containment. For example, the upper level set X λ can be represented as a union of connected component equivalence classes. The inclusion relationship in Equation 2 demonstrates that for λ 1 ≥ λ 2 the connected components of X λ1 are contained in the connected components of X λ2 . This inclusion relation on connected components is sufficient to satisfy condition 2 of Definition 2.1. Furthermore, if h is bounded, then there exists a λ < ∞ such that X λ can serve as the root of an upper level set tree; thus both conditions for a tree structure are satisfied. Similar statements can be made about the creation of a lower level set tree.
Both the upper and lower level sets are complete representations of an image, but there is redundant information if both level sets are used. These two representations have been combined into one tree structure by Monnase. A complete discussion of his construction is too detailed and would distract from the main line of reasoning, but a few comments are essential. Ignore rigor for a moment and consider a bounded connected component, A, of an upper level set. The set R 2 \A can be decomposed into its connected components, and the bounded connected components of R 2 \A are surrounded by the set A. These sets will be called holes. The set obtained by taking the union of A with all of its holes forms a new set. Following Monasse's dissertation, these sets will be called shapes. The shapes derived from the upper and lower level sets can be partially ordered using set inclusion, and the domain of definition of h can be used as the root of a tree structure. The tree structure obtained by this procedure is called the FLST tree structure. A rigorous approach to the building of this tree structure is given in Monasse's dissertation [36]. Figure 1 demonstrates the tree of shapes obtained from the lower level set, the upper level set, and the FLST where lighter colors represent larger gray level values. All three constructions are ordered by set containment, but the FLST shapes do not have any holes.
The FLST tree structures for functions defined on a continuous domain can be constructed using the procedure outlined above, but images and histograms are defined on discrete domains. An image or histogram can be extended to the continuous domain using interpolation methods, and the interpolation method has consequences in the FLST tree construction. The interpolation techniques used by Monasse result in an upper semicontinuous function, and this interpolation scheme produces a simple relationship between connected sets in the discrete and continuous domains. In particular, upper level sets are eight connected while lower level sets are four connected.
There are many details not included in this discussion necessary for a rigorous approach to building the tree of shapes. The delicate aspect of the algorithm is deciding the connectedness of the shape and determining if there are holes in the shape. Rosenfeld showed that the number of holes can be counted using information derived from a pixel's neighborhood [38]. This local knowledge is sufficient to determine the number of holes after the whole region is grown. The thesis by Monasse discusses this issue and provides a rigorous development of the topic and an effective algorithm [36].

2.2.
Meaningful modes. The Helmholtz principle states that a group of features is "meaningful" whenever, given an a-contrario model, the group of features is unlikely to occur due to the a-contrario model. The works of Cao, Delon, Desolneux, Morel, Muse, and Sur [5,9,10,11,12,13,17,15,18,20,14,16,19,21] develop a mathematical procedure to find perceptual features based on the Helmholtz principle. Their implementation of the Helmholtz principle requires three coupled ingredients. First, an a-contrario background probability model must be defined. This background model can be given or learned from the data. Second, a procedure to group independent and identically distributed (IID) random variables must be given. The background model and the grouping procedure must work in conjunction to describe the absence of the feature of interest. Third, a function mapping the group of IID random variables to the real numbers, the N F A, must be defined. The N F A must be an upper bound on the expected number of feature groups that would occur due to the a-contrario model. The N F A must also scale with the sample size and the grouping procedure. Developing these three ingredients for to 2-D histogram analysis will be discussed in more detail.
A sample space, a set of events, and a probability function are the essential elements needed to describe the a-contrario probability model. The definition of such a space without any reference to the experiment it describes obscures the relationships between the probability model, the grouping procedure, and the N F A. For this reason, a constructive approach will be used to develop the probability model. The classic probability model of randomly placing M indistinguishable objects into N bins is a natural model for two dimensional histogram creation. If each object is equally likely to be placed in any bin, then the probability of a ball ending up in a given bin is 1/N . Since the balls are indistinguishable, the simple events correspond to the N -tuples (n 1 , n 2 , . . . , n N ) where n i is the number of balls in the i th bin. The model will be completed by deriving the probabilities for the compound events corresponding to the grouping procedure.
The grouping procedure must work in conjunction with the a-contrario hypothesis model to describe a feature's absence. Fundamentally, the grouping procedure identifies compound events that accurately characterize the feature. The features of interest in this work are modes of a histogram, and, as stated in the introduction, the modes of a histogram can be qualitatively understood as connected sets in the histogram's domain where the data concentrates. This qualitative notion of modes suggests the following compound event. Consider a histogram h : N → Z where N = } is the histogram domain. Let I ⊂ N be a connected set, and define The set function k(I) is the number of objects in I. The notation |I| will represent the number of elements in the set I. The probability that at least k(I) elements are in the connected region I is given by the binomial tail where p(I) = |I|/N . This model is a Bernoulli scheme with M trials and a probability of success p(I).
Briefly, let R denote the family of all connected sets in the histogram domain N . This definition of R will be modified in the sequel. Let |R| equal the number of elements in R. The number of false alarms (N F A) of the set I ∈ R is given by the following definition.
This definition leads directly to the notion of an -meaningful mode.
When = 1 the set I will be called a meaningful mode. The condition k(I)/M > p(I) is present since a mode should contain more elements than is expected.
A driving force for mode detection using the Helmholtz principle is a natural interpretation of the N F A. The N F A serves as an upper bound on the expected number of occurrences for the set I ∈ R given the a-contrario model. Furthermore, the |R| scale factor implies that the N F A scales appropriately with the number of histogram bins. This statement is made precise in the following theorem proven in the appendix.
Theorem 2.4. The expected number of -meaningful modes is less than or equal to .
The above definition of N F A suffers from two computational difficulties. First of all, the binomial tail is computationally expensive to calculate. This difficulty is being mitigated through the use of the tight Höeffding upper bound on the binomial tail [22]. This bound is given by where r(I) = k(I)/M . Define the function The Höeffding inequality motivates the following definition.
The second difficulty with definition 2.2 is subtler, and it will require the definition of R to be changed. A quick computation reveals that the number of connected sets in the histograms domain is large for large N . Assume that the histogram has L 1 rows and L 2 columns, then the number of rectangular regions in the histogram is L 1 L 2 (L 1 + 1)(L 2 + 1)/4. The number of connected regions will be substantially greater. A search of all the rectangular regions in a histogram of L 1 = L 2 = 100 bins would therefore consist of at least L 2 1 (L 1 + 1) 2 /4 ≈ 2.5 × 10 7 regions to search. Performing all the computations would be a time prohibitive procedure. This computational dilemma will be mitigated through the use of the FLST tree.
As shown in section 2.1, the FLST constructs a topographic map that combines the interpolated histogram's upper and lower level sets. The idea is to only search the histogram's FLST shapes. In effect, the FLST will guide the search for meaningful modes. There are no known estimates of the number of shapes in a FLST tree, but in all experiments performed by the authors, a 100x100 histogram contains less than 500 shapes. This is several orders of magnitude less than the above estimate. The FLST tree, however, does not consist of connected sets, but rather connected sets without holes. The definition of R needs to be modified to take this change into account. Hereafter and in Definitions 2.2, 2.3, and 2.5, R will be the set of all connected sets without holes.
The preceeding Helmholtz principle concepts can be extended to the case of Bernoulli trials with more than two outcomes, and this extension will play an essential role in Section 3. The following notational conventions will be used. The 1-norm of a non-negative q-tuple w will be denoted by |w| = w 1 + w 2 + · · · w q . Recall that the number of ways in which a population of M elements can be divided into q + 1 subpopulations is given by the multinomial coefficient where k = (k 1 , k 2 , . . . , k q ) is a q-tuple of non-negative integers subject to the condition |k| < M .
Let {G i ∈ R | i = 1, 2, . . . , q; G i ∩ G j = ∅ if i = j} be q mutually disjoint sets, then the set functions k i = k(G i ) and p i = p(G i ) are well defined. Furthermore, the q-tuples k = (k 1 , k 2 , . . . , k q ) and p = (p 1 , p 2 , . . . , p q ) satisfy the relations |k| < M, The probability that in M trials there are k 1 elements in G 1 , k 2 elements in G 2 etc., is the multinomial distribution thus the probability of at least k 1 elements in G 1 , k 2 elements in G 2 etc., is given by the multinomial tail Whenever r and p are 1-tuples, the multinomial tail M(M, k, p) is equal to the binomial tail B(M, k, p). The will be used to simplify notation. Also, to limit confusion, the family of sets {G i } q i=1 will be called a q-family of disjoint sets whenever q is a fixed number. In other words, all q-family of disjoint sets must consist of exactly q mutually disjoint non-empty sets. The multinomial tail motivates the following definition of the number of false alarms for a q-family of disjoint sets: be a q-family of disjoint sets, p(G i ) = |G i |/N , and let k(G i ) be the observed number of objects in G i . Suppose then the number of false alarms of This definition is the same as 2.2 when q = 1. A couple of comments on the scale factor are required. The (M + 1) q−1 term arises because the multinomial is a multi-dimensional probability distribution. The proofs in the Appendix elucidate its presence. The multiplicative factor |R| q is the number of q-families of sets, not necessarily disjoint, chosen from R. This term over counts the total possible number of q-families of disjoint sets. The N F A, however, is still an upper bound on the expected number of false alarms, and the experiments in Section 5 also indicate that this over counting is not critical.
The definition of an -meaningful mode follows mutatis mutandis for an -meaningful q-family of disjoint sets.
be a q-family of disjoint sets, p(G i ) = |G i |/N , and let k(G i ) be the observed number of objects in G i . Suppose is an -meaningful family of modes if and only if p < r and The following theorem, of which theorem 2.4 is a special case, will be proven in the appendix.
Theorem 2.8. The expected number of -meaningful q-families of disjoint sets is less than .
Multinomial tail calculations are more computationally expensive than their binomial equivalent. For computational ease, the N F A of a q-family of sets is given below in terms of an apparently newly generalized Höeffding bound for the Multinomial. This upper bound can be rewritten as the exponential of a relative entropy function, and using the definition r = k/M it is given by The subscript q of a subscripted function (e.g. H 1 (r 1 , p 1 ), H 2 (r 1 , r 2 , p 1 , p 2 )) identifies a particular q-state relative entropy function. Two qualities of the function H q (r, p) are worth noting. By the log sum inequality [8], the functions H q (r, p) are jointly convex in both variables. The functions H q (r, p) for q > 1 are symmetric under the interchange action (e.g. H 2 (r 1 , r 2 , p 1 , p 2 ) = H 2 (r 2 , r 1 , p 2 , p 1 )) while H 1 (r 1 , p 1 ) is symmetric about 1/2. (Ṫhe proof of this upper bound will be delayed until Section 2.3.
The following definition is a generalization of Definition 2.5.
be a q-family of disjoint sets, p(G i ) = |G i |/N , and let k(G i ) be the observed number of objects in G i . Suppose then the number of false alarms of This definition is used in all subsequent analysis. The Helmholtz principle has been formulated above for a single set and for a q-family of disjoint sets. Also, as mentioned in the text, the q-family formulation reduces to the single set formulation when q = 1. As a consequence, there is a redundancy between many of the definitions and theorems. This redundancy is intentional since the N F A of a single set will play a different role than the N F A of a q-family of disjoint sets. The different roles will be discussed more thoroughly in Section 3 when the notion of a maximal -meaningful node is given. A brief discussion on potential algorithms to find modes using the Helmholtz principle will highlight the rational behind this disaffiliation.
A direct application of the Helmholtz principle would calculate the family of disjoint sets {G i } q i=1 that minimizes the N F A. A simple estimate will highlight the impracticality of such a procedure. The number of q-families of sets, not necessarily disjoint, is given by the binomial term |R| q . The sum of this term over 1 ≤ q ≤ M is 2 |R| ; thus a simple estimate on the total number of families of sets to search is given by substituting the number of rectangular regions L 1 L 2 (L 1 + 1)(L 2 + 1)/4 for |R|. A histogram consisting of N = 25 bins consists of approximately 1.0863 × 10 47 searches. Performing such a search is an impractical task. This estimate is a rough approximation since the term |R| q is an upper bound on the number of sets while the number of rectangular regions in R is a lower bound. A more accurate estimate, however, would not make this procedure practical. Instead of searching all possible sets, the topological tree structure created using the FLST tree will be utilized to guide the search for meaningful modes. The details of this methodology will be presented in section 3.
In the next two sections, Sections 2.3 and 2.4, a proof of the generalized Höeffding upper bound of equation 20 is given and a relation between the ordering of H 1 (r, p) and H q (r, p) is discussed. This relation will be used to simplify the algorithms, but both sections can be omitted upon a first reading.

Generalizing the Höeffding upper bound.
A common requirement throughout all applications by Desolneux et al. is the need for repeated computation of the tail probabilities for the binomial and trinomial distributions [15,18,14,16,19]. For the binomial distribution, the Chernoff-Höeffding inequality provides a reasonably tight and mathematically useful relative entropy upper bound, but the authors are unaware of a Höeffding non-asymptotic type bound for the multinomial distribution. Such a bound simplifies all computations.
The following theorems show that are upper bounds to M(M, k, p), and the choice r = k/M minimizes this upper bound.
Theorem 2.10. Let M denote the number of trials, and suppose r and p are qtuples. If 0 < p ≤ r and |r| < 1, then and r * = k/M minimizes U (r, p).
A simple calculus exercise demonstrates that this equation is an increasing function in p 1 , p 2 , . . . , p m because the exponents x 1 − k 1 , . . . , x m − k m and |x| − |k| are all nonnegative. This observation combined with the inequality p ≤ r motivates the second relation in the following set of equations: The last two relations are true since M(M, k, r) is the tail of the multinomial distribution.
The minimal value is derived by noticing that the multinomial ratio bound holds for any r ∈ [0, 1] q such that p ≤ r. It is a simple calculus exercise to show that the gradient of U (r, p) with respect to r is zero at r * = k/M . Moreover for fixed p the map r → U (r, p) is convex in r. Thus the global minimizer of U (r, p) occurs at r * = k/M . The generalized Höeffding bound is much tighter than the Joag-Dev/Proschain product bound typically used for multinomial random variables. Their product bound is appropriate since the random variables of a multinomial distribution are negatively associative; therefore, for the q-tuples p and r that satisfy p < r, this bound can be written as [26] Combining the Höeffding inequality and the previous equation yields the result The next theorem shows that the ratio bound U (r, p) is less than or equal to this product bound for a multinomial random variable.
Theorem 2.11. Suppose M is the number of trials, p and k are q-tuples, and define r = k/M . Assume 0 < p ≤ r and |r| < 1 then and the inequality is strict when p i < r i for some i.
Proof. Define The function f must be shown to be an increasing function of p. This is equivalent The same argument shows that ∂g/∂p i ≥ 0 for i > 1. The inequality is strict if any i is greater than zero; therefore the inequality is strict if p i < r i for some i. Table I shows that the standard product bounds can be several orders of magnitude less accurate than the true tail sum bound. In contrast, the ratio bounds are approximately within a factor of ten.

2.4.
Family of sets. Just as the multinomial distribution can be factored into a product of a binomial and conditional multinomial distribution, the ratio bound can be factored resulting in a recursive sum of relative entropies [27,28]; therefore the ratio bound preserves the binomial factoring. This factoring helps demonstrate that the entire family of subsets almost always has greater relative entropy than a smaller collection of the sets. This observation will be made more precise below, and it will greatly reduce the number of tests needed in order to find the maximal -meaningful modes defined in section 3.
Theorem 2.12. Suppose M is the number of trials, p and k are q-tuples, and define r = k/M . Letr = (r 1 , r 2 , . . . , r q−1 ) andp = (p 1 , p 2 , . . . , p q−1 ) denote the (q − 1)-tuples obtained by deleting q th component. If 0 < p < r and |r| < 1, then Proof. The ratio bound can be factored as follows Using the rules of logarithms, these factors can be converted to yield the relative entropy sum.
The following theorem, which will be utilized in the next section, explores the relationship between the the ordering of two relative entropy functions and the N F A.
The last two theorems illustrate that removing one set from the q-family {G i } q i=1 of disjoint sets almost always results in an increase in the N F A in practical histogram applications. Theorem 2.13 asserts that the N F A ordering in Equation 37 is almost always implied by the ordering between H q (r, p) and H q−1 (r,p) for large M . This claim can be further understood by noting that the left hand side of Equation 36 depends on M and |R| while the right hand side does not. Moreover, |R| is a function of the histogram domain and does not depend on the number of samples M . As shown in Section 3.1, an upper bound on |R| in terms of the number of bins N is |R| < 2 N . As a consequence, the left hand side of Equation 36 decreases with increasing M faster than cN/M for large M and some constant c. Furthermore, in a typical histogram application, there will be many more samples than there are bins; thus N/M << 1.
Theorem 2.12 demonstrates that the difference H q (r, p)−H q−1 (r,p) is a positive quantity. Combining the last two observations indicates that removing one set almost always increases the N F A for M >> N . This comment will be employed to simplify the algorithm in Section 3.1.
3. Maximal -meaningful modes. The collection of all -meaningful modes typically contain many sets that overlap, but, to maintain coherence with the notion of a mode, a reduced set of disjoint -meaningful modes needs to be identified. This issue of overlapping -meaningful sets occurs in all previous applications of the Helmholtz principle. It is typically resolved using a maximal -meaningful containment condition. This containment condition locates disjoint -meaningful sets that are not contained in and do not contain sets with a lower N F A. The meaningful grouping procedure of Cao et.al [5], however, added another merging criteria to verify if two -meaningful sets should be merged into one or kept separate. The discussion in [5] demonstrates the necessity of this added test, and the same conditions can exist in the algorithm presented below. For this reason, an -meaningful mode will have to satisfy the containment conditions as well as a merging criteria. Before the maximal -meaingful definition is given, however, a short discussion of the interplay between the FLST tree structure and the notion of -meaningful will be presented.
Throughout all applications of the maximal principle to image analysis, events are ordered using the N F A. A clear interpretation exists for this ordering, namely, given that the image follows the a-contrario hypothesis model, the lower the N F A of an event the less the expected number of such events. As mentioned in Section 2.2, a search for a disjoint set of modes with a minimal N F A is computationally unreasonable. For this reason, the FLST tree structure is used to guide the search for maximal -meaningful modes. The following two definitions will simplify the discussion.
Definition 3.1. Let I ∈ R be a FLST tree structure node. The FLST tree structure node G ∈ R is a descendant of I if the shortest path from G to the root crosses I. Definition 3.2. Let I ∈ R be a FLST tree structure node. The FLST tree structure node G ∈ R is an ascendant of I if the shortest path from I to the root crosses G. Section 2.1 notes that two aspects of the FLST tree structure will be utilized. To reiterate, at a fixed gray level, the FLST tree structure partitions the image into connected non-overlapping regions, and as the gray level changes, the connected components are ordered in a containment tree structure. These two observations indicate that the maximal -meaningful containment conditions can be met by comparing the N F A of a set I with the N F A of its ascendants and descendants. The merging criteria poses a more complex problem.
Many solutions to the merging problem that exploit the FLST tree structure exist. The solution presented below is a compromise between computational efficiency and completeness. The most obvious solution would be to consider all possible collections of descendants. A node would then be labeled as not maximal -meaningful when the N F A of the node is less than the N F A of any family of descendent disjoint sets. Again, the computational burden of such a procedure would be too large.
Instead, a maximal meaningful set will be defined in terms of any q-family of disjoint sets such that all the sets are siblings.
In summary, maximal -meaningful sets will be found using the FLST tree structure, and these sets will satisfy three conditions: they will be -meaningful, they will satisfy a containment condition and they will satisfy a merging condition. The merging condition will be made more precise using the following two definitions.
The FLST tree structure and the notion of indivisibility motivates the definition of maximal -meaningful modes. Each non-leaf node of the FLST has one or more children; therefore for each non-leaf node there exists a q such that the collection of all children form a q-family of disjoint sets. This topological structure of the FLST tree inspires the following definition of a family of sibling descendants of a tree node G.
Definition 3.4. Let I ∈ R be a FLST tree structure node. The q-family of disjoint sets {G} q i=1 is a family of sibling descendants of I if there exists a FLST tree structure node H ⊆ I such that G i is a child of H for all 1 ≤ i ≤ q.
The notions of indivisible, ascendent and descendent allow for the following definition of a maximal -meaningful mode. The next lemma shows that two maximal -meaningful modes are indeed disjoint.
Lemma 3.6. Let G 1 and G 2 be two maximal -meaningful FLST tree structure nodes, then G 1 and G 2 are disjoint.
Proof. Suppose G 1 and G 2 are maximal -meaningful modes and G 1 ∩ G 2 = J with J = Ø, then by the FLST tree structure G 1 ⊂ G 2 or G 2 ⊂ G 1 . (This is proposition 2.4 in [36]). Without loss of generality assume that G 1 ⊂ G 2 . Note that, by the definition of maximal -meaningful, G 1 is indivisible with respect to any family of sibling descendants of G 1 and G 2 is indivisible with respect to any sibling descendants of G 2 . Since G 1 ⊂ G 2 , then G 1 is a descendant of G 2 . There are three possible cases: Case 2 contradicts item three in definition 3.5 of maximal -meaningful. Cases 1 and 3 contradict item four in definition 3.5 of maximal -meaningful. Therefore G 1 ∩ G 2 = Ø.
3.1. The meaningful mode algorithm. The algorithm to detect maximalmeaningful modes is a direct application of the definitions given above, but two computational issues need to be addressed before the algorithm is stated. First, the value used for |R| will be discussed. The FLST transform motivated defining R as the family of sets without holes. A direct count of the number of elements in this family is difficult to obtain. A lower bound on |R| was used in this work, and the effect of using this lower bound is discussed. Second, the algorithm was simplified using the results in section 2.4. More information on this issue follows. The quantity |R| is used in two different situations. It is used when a set is determined -meaningful and when a set G is determined to be indivisible with respect to a q-family of sets. The consequences on the definition of -meaningful modes when |R| is changed will be discussed first. Definition 2.3 for a -meaningful mode clearly demonstrates that changing |R| is equivalent to changing . In this paper the lower bound |R| = L 1 (L 1 + 1)L 2 (L 2 + 1) 4 (41) was used for the size of |R|. The relationship between and |R| implies that by using a lower bound more modes will be determined meaningful. Furthermore, figure 5, discussed in more detail in section 5.1, indicates that this choice of |R| is consistent with the notion of -meaningful given in Theorem 2.4.
The quantity |R| is also used to determine if a set is indivisible. Definition 3.3 for a indivisible set in conjunction with definition 2.7 imply that a set G is indivisible with respect to a q-family of disjoint sets with r = r(G), p = p(G), r = (r(G 1 ), r(G 2 ), . . . , r(G q )), and p = (p(G 1 ), p(G 2 ), . . . , p(G q )). The binomial term |R| q is bounded from below by (|R|/q) q ; therefore G is indivisible with respect to An upper bound on |R| is 2 L1L2 where L 1 and L 2 are the number of rows and columns of the histogram respectively. This upper bound is easily derived by noting that there are Note that q is typically much smaller than L 1 L 2 ; therefore the (q/M ) log q term is small compared with the other term. Two comments can be deduced from this equation. First, the factor (L 1 L 2 )/M is the ratio of the number of bins to the number of objects; therefore by choosing the number of bins to be significantly smaller than the number of samples the second term in the right hand side of equation 42 is small. Second, the choice of |R| in equation 41 is smaller than the number of sets. The right hand side of equation 42 is, therefore, larger than the true value and the calculations will err on the side of not merging sets. This completes the discussion on |R|.
The result given in Theorem 2.13 demonstrates that if a set G is indivisible with respect to the q-family of sets {G i } q i=1 , then G is almost always indivisible with respect to sub-collection of the sets {G i } l i=1 for l < q. These results motivate the following simplification in the algorithm. Definition 3.5 states that a node G is maximal -meaningful only if it is indivisible with respect to any family of sibling decendents. This definition implies that for a node with q children, the algorithm should check if G is indivisible with respect to all combinations of the q children. Consider a sub-collection {G i } l i=1 taken from the q-family of sets {G i } q i=1 . Theorem 2.13 implies that it is unlikely that G is indivisible with respect to the l-family {G i } l i=1 of disjoint sets and G is not indivisible with respect to the complete q-family {G i } q i=1 of disjoint sets. As a consequence, for each descendent node, indivisibility will be checked only for all of the nodes' children {G i } q i=1 . The authors believe that this simplification does not substantially reduce the algorithm's effectiveness.
The meaningful mode algorithm consists of two essential parts. The first part is to calculate the FLST transform. The FLST algorithm implemented in this work is due to Monasse, and the details of his methodology will be left to the reference [36]. The choice of quantization levels is the only aspect of his algorithm that requires some discussion. The FLST algorithm can be performed for an arbitrary number of quantization levels. The largest bin value was chosen as the number of quantization levels for all histogram applications presented in this paper, because with this choice of quantization, no information is lost.
The second part of the algorithm locates maximal -meaningful modes, and this algorithm's pseudo-code is presented in the pseudo-code below. The only required input to the algorithm is a FLST tree structure with the gray-level of each node recorded. The algorithm consists of four loops. The first loop, lines 2 to 10, calculates important quantities for each node. This loop also determines if a set is -meaningful. A set cannot be maximal -meaningful if it is not -meaningful; therefore the other loops only consider the meaningful sets. The second loop tests the indivisibility condition for each -meaningful mode. The third loop confirms if any of the -meaningful modes contains a mode with a lower N F A. Any mode in which all the descendants have higher N F A were called dominant ancestors. The last loop verifies that each indivisible -meaningful mode is not contained in another -meaningful mode that is both indivisible and a dominant ancestor. Any tree searching technique (depth first, etc.) that visits each node once for each loop will suffice to perform the algorithm.
The algorithm's complexity will be discussed next. Define the height of the tree, h, as the maximum distance from a leaf node to the root of the tree, where the distance is measured by the minimum number of parents that must be traversed to go from the leaf node to the root. The complexity of the algorithm can be estimated in terms of four integers; the height of the tree h, the number of leaves l, the number of objects M , and the number of shapes s. Since there are no assumptions on the number of modes or the shape of the modes, the height of the tree, the number of leaves, and the number of shapes are random variables with unknown distributions.
Discussion on the complexity of the FLST algorithm is given in Monasse's Ph.D. dissertation [36], and it will not be deliberated further in this work. The first loop makes calculation for each node and therefore has on the order of s calculations. The number of operations in the remaining three loops is bounded above by s 2 .

Meaningful Mode Algorithm
Estimating the number of shapes in a histogram is difficult; therefore an estimate of the complexity of the algorithm given the number of samples, or the number of bins in the histogram is more useful, but this estimate clearly overestimates the complexity. The difficulty in making a more precise estimate is due to the random nature of h, l, and s. The number of shapes, s, must be bounded above by the number of leaves times the height of the tree, or s < hl; therefore O < hl + 3(hl) 2 . The height of the tree is bounded above by the number of objects, h < M . For the case where the number of bins of the histogram, N , is less than the total number of objects, M , then the number of leaves is bounded above by the number of bins. Combining these terms, the total number of operations is bounded above by The estimates for the complexity appears to be quite large at face value, but only sets that are -meaningful need to be checked in the last three loops, and only -meaningful and indivisible sets need to be checked in the last two loops. It is impossible to predict a priori the number of -meaningful, indivisible sets in a histogram, but, in all applications presented in this paper, the number of sets that were both -meaningful and indivisible was much less than the number of sets found during the FLST. The meaningful shapes found in Section 5 were located at least ten times faster than the computational time of the FLST despite the reduced theoretical complexity of the FLST tree. Better estimates on the complexity of the algorithm could be obtained if there were better estimates on the height h of the FLST tree given the total mass of the histogram and the number of histogram bins. The authors are not aware of any such estimates.
The next section collects results that increase the efficiency of the algorithm as well as discusses properties of -meaningful modes. This section is outside the main line of reasoning, but is included to add an understanding of -meaningful modes.

Höeffding bound relations.
Calculations of the N F A based on Höeffding bounds serve as a basis for locating maximal -meaningful modes. Computational gains can be achieved by finding simple algebraic relations that infer the ordering between Höeffding bounds since multiplications are computationally inexpensive compared with logarithmic calculations. Furthermore, properties of the generalized Höeffding bound in terms of the variables r and p and relationships between H q (r, p) and H l (r, p) with q = l yield insights into the properties of -meaningful modes. This section begins with a simple algebraic test for the ordering of H 1 (r, p), and continues with two results about -meaingful modes. The relationship between the maximal modes of the multinomial and the variables p and r is also discussed. 4.1. Algebraic tests. The calculation of logarithms can be an order of magnitude more time consuming than simple multiplication; therefore algebraic tests that compare q-state relative entropy functions are important for fast calculations ofmeaningful modes. The following theorem was motivated by the Desolneux, Moisan, Morel mean value algebraic tests developed in their book [20]. They asked the question: Is there a fast way to infer the monotone ordering H 1 (r 1 , p 1 ) < H 1 (r 2 , p 2 ) and what properties does it reveal about meaningful intervals. They showed that the comparison of the 1-state relative entropy can be predicted by a simple equality test based on the mean value ratio µ(r, p) = r/p combined with the ordering p 2 < p 1 . Their mean value test is insightful since it reveals the type of scaling ratio that is sufficient to infer ordering [20].
The next theorem provides a quick test for the ordering of H 1 (r 1 , p 1 ) with H 1 (r 2 , p 2 ) based on the ratios µ = r/p and ω = (1 − r)/(1 − p). To simplify notation, let µ i = µ(r i , p i ) and ω i = (1 − r i )/(1 − p i ) for i = 1, 2. The method of proof and the ω(r, p) = (1 − r)/(1 − p) test are apparently new. The proof is in the appendix.
Monte Carlo simulations show that the µ(r, p) test is valid 81% of the time in the region 0 < r < p < 1, which can reduce the computation time by one third over computing H 1 (r, p) for every value of r and p.

4.2.
-meaningful mode properties. The next series of results provides information on the set properties of maximal meaningful modes. To simplify notation, let r(G i ) = r i and p(G i ) = p i . Theorem 4.2. Let G 1 and G 2 denote two -meaningful disjoint sets and G 3 = G 1 ∪ G 2 . If r 1 , r 2 , p 1 , p 2 ∈ [0, 1] with r 1 > p 1 , r 2 > p 2 , 0 < r 1 + r 2 < 1 and 0 < p 1 + p 2 < 1 then Proof. Since G 1 and G 2 are disjoint, both the area probability and frequency count are additive, which implies that min{µ(r 1 , p 1 ), µ(r 2 , p 2 )} < µ(r 3 , p 3 ). When µ(r 1 , p 1 ) = µ(r 2 , p 2 ) then max{µ(r 1 , p 1 ), µ(r 2 , p 2 )} ≤ µ(r 3 , p 3 ) The previous theorem illustrates the relationship between a parent shape and its decomposition into children sets. Statement 1 of this theorem indicates that when a set is decomposed into two other sets, then one of the other sets has a lower N F A than the parent set. Statement 2 of this theorem demonstrates that when the relative frequency µ(r, p) is equal for the two children sets then the N F A of one of the children sets is greater than the parent.
Simple and insightful tests to predict the comparative ordering between H 1 (r, p) evaluated at the union of two sets and H 2 (r, p) evaluated on the respective separate sets are unknown. However, under the hypothesis of the following theorem, their interrelation can be compared for the important case when the two sets are disjoint. The q set counterpart of this theorem, which clearly exists, is left unstated. Theorem 4.3. Let G 1 and G 2 denote two meaningful cells such that r 1 + r 2 < 1, Proof. Consider the difference Since the expression t log t is strictly convex the log sum inequality applies [8]; therefore This statement proves the theorem.
The previous theorem combined with Theorem 2.13 implies that if the children of a set G partitions the set into two disjoint subsets G 1 and G 2 such that G = G 1 ∪G 2 , then the set G is almost always indivisible with respect to {G 1 , G 2 }.

4.3.
Relationship to the maximal mode. The boundary between the inequalities 0 < p < r < 1 and 0 < r < p can be simply and directly related to the maximal term of a binomial random variable. The multinomial characterization of the maximal term is more complex, and for completeness only the highlights are summarized in this section. For this discussion, both the probability q-tuple p and the outcome q-tuple k are augmented with the additional term p q+1 = 1 − |p| and k q+1 = M − |k| respectively. This is not unreasonable since any one of the terms in the multinomial distribution is determined by the remaining terms and either formulation is recognized. Among all the possible occurrence q + 1-tuples k with q+1 i=1 k i = M , the mode is the one for which m q (M, k, p) is a maximum for fixed M and p. In the case of non-uniqueness, the joint modes are equiprobable neighbors. The inequalities for the multinomial maximal term are derived by considering the ratios of successive local neighbors of the multinomial distribution m q (M, k, p). It can be directly verified that a q + 1-tuple k is maximal if and only if for all pairs (i, j) for all i, j ∈ {1, 2, . . . , q, q + 1}. Since the inequality is trivially satisfied if i = j, there are only q(q + 1) distinct inequalities.
According to an exercise in Feller, these inequalities can be combined to yield the following sharp set of inequalities attributed to P.A.P. Moran: for all i ∈ {1, 2, . . . , q, q + 1}. In the region r > p, if all of the components of the augmented q + 1 tuples p and k satisfy the Moran inequalities, then they are possible modal candidates; otherwise, the probabilities m q+1 (M, k, p) are always less than the mode(s). It is interesting to note that the minimum Moran inequality implies that by increasing each component of the modal q + 1-tuple by one it falls in the region r > p and if M p i < 1, then k i = 0 can be a component of the mode(s). Finucan demonstrated that the modal region is unique if max i∈{1,2,...,q+1} k i p i < min i∈{1,2,...,q+1} or includes joint neighboring modes, which occur all in one cluster [23], whenever max i∈{1,2,...,q+1} k i p i = min i∈{1,2,...,q+1} It is instructive to rewrite the binomial example in the augmented form with p 2 = 1 − p 1 , k 2 = M − k 1 and reinterpret the modal analysis. When the condition M p 1 = k 1 is satisfied then it is easy to verify that the 2-tuple (k 1 , k 2 ) is the maximal mode, because the identity k 2 = M −M p 1 = M p 2 yields the inequality (k 2 +1)/p 2 = Figure 2. The average number of meaningful modes found in a L 1 = L 2 = 100 random histogram is plotted. For each = 2 j with j = 1, . . . , 50, a hundred 2D random histograms were created where the x and y values where uniform random variables ranging from 0 to 1. The number of meaningful modes for each were then averaged.

Number of Meaningful Sets
(M p 2 + 1)/p 2 > k 1 /p 1 + 1/p 2 , which is equivalent to Finucan's. Thus, for any other occurrence 2-tuple (k 1 , k 2 ) satisfying the meaningful inequality 0 < p < r < 1, its probability B(M, k 1 , p 1 ) = P [X = k 1 ] is less than the maximal probability. By similar reasoning the condition for the multinomial becomes M p i = k i for every i ∈ {1, 2, . . . , q} such that the q + 1-tuple (k 1 , k 2 , . . . , k q , k q+1 ) is the modal q + 1tuple. The condition (M + 1)p 1 = k 1 implies that the 2-tuples (k 1 − 1, k 2 ) and (k 1 , k 2 − 1) constitute a joint mode, because the identity k 2 + 1 = M − (M + 1)p 1 + 1 = (M + 1)p 2 yields the equality pairing needed in equation 53. This example motivates the following observation by Le Gall for the multinomial [30]. If for every i ∈ {1, 2, . . . , q}, (M + 1)p i = k i , then there are q + 1 joint modal q + 1-tuples (k 1 , k 2 , . . . , k q , k q+1 ) − γ j with j ∈ {1, 2, . . . , q + 1} where γ j is a q + 1-tuple whose only nonzero component is the jth one. Since these arguments are reversible it follows that the equality (M + 1)p 1 = k 1 is equivalent to the existence of a joint mode for the binomial. The meaningful inequality M p 1 < (M + 1)p 1 = k 1 is also satisfied; therefore an equiprobable maximal mode can also be meaningful. Thus the interrelation between simple algebraic constraints associated with the meaningful inequality and their interconnection with the modal behavior of the multinomial distribution are similar to those of the binomial distribution. Moreover, the role of the meaningful inequality test for the q-state entropy is consistent.

5.
Experiments. The experimental section is divided into three subsections. The first subsection investigates the effects of binning, sampling, and the support of the histogram. Two test problems are presented in this section. The first test problem consists of finding meaningful modes in a uniform random histogram. When the a-contrario model is correct then the number of -meaningful sets should be less than or equal to . The second test model is the geyser data analyzed by Venables and Ripley [45]. This data set illustrates the binning and sampling aspects of the algorithm.
The second and third subsections demonstrate two applications of the meaningful mode detection algorithm. The first application automatically finds dominant color features of an image. These features are established by transforming a color image to the CIELAB colorspace and analyzing the projection of this colorspace onto the (a, b) plane. The dominant color features located in this manner are shown to be robust to illumination changes. The third subsection extends the meaningful mode detection to correlation planes. A correlation plane is not a histogram, but this difficulty is overcome by using the Höeffding and ratio bounds instead of multinomial tail sums. This demonstrates that the meaningful mode algorithm extends to arbitrary distributions because the q-tuples r and p are the only variables of the relative entropy functions. The number of gray scale values for the FLST is reduced in the correlation plane problem from all the possible values to a fixed number. The algorithm is shown to be robust through a large range of quantization levels.
5.1. Test models. Theorem 2.4 asserts that for a random histogram the expected value of the number of -meaningful modes is less than or equal to , but an approximation to |R| was used and the FLST locates a reduced family of sets in R. The dependance of the number of false alarms on was investigated by creating 100 two dimensional histograms with M = 25, 000 points, and L 1 = L 2 = 100. The x and y values of these histograms were distributed uniformly between 0 and 1. The average number of meaningful modes found as a function of ranging from 1 to 2 50 on a logarithmic scale is shown in Figure 5. In the figure the average number of meaningful modes is always less than the parameter , and, in particular, for = 1 the average number of meaningful modes is less than one. This result is consistent with the interpretation of as an upper bound of the expected number of false alarms given in Theorem 2.4. This paper does not attempt to address the problem of histogram binning. The result of bin size on the number of maximal meaningful modes found is investigated in the second test example, the well known Old Faithful geyser data analyzed by Venables and Ripley [45]. This data is freely obtainable from the R software package. The Old Faithful data, shown in Figure 3, consists of the waiting time in minutes between eruptions and the duration in minutes of the first eruption. A scatter plot of the data shows many possible groupings. The scatter plot is a little misleading since there are several data points with the same value.
Three histograms are shown indicating the result of binning on the number of modes found. When the bin size shrinks the number of modes increase as more peaks are resolved. The two limiting cases of too coarse of a bin size and too fine of a bin size can be easily understood. If the histogram bin size is too coarse, then all the data will sit in one, or a few connected bins. In this case, one maximal meaningful shape will be found. If each of the data points consists of a unique value, then it is possible to choose a fine enough bin size such that each bin contains only one point. In this case the there will be no maximal meaningful modes. If there are repeated values in the data then some bins will contain more than one point, and enough repeated values will make that bin meaningful. These results are consistent with Venables and Ripley [45].

5.2.
Color feature extraction. Two dimensional histogram analysis applied to color image segmentation was investigated. Image color information is often stored in the RGB color space as a three dimensional vector per pixel, but for two dimensional histogram segmentation one of the color channels must be omitted. For this reason other color spaces were considered, and the CIELAB color space was chosen. The CIELAB color space attempts to be more perceptually linear and can be obtained from the sRGB color space through a nonlinear transformation. After transforming to the CIELAB color space, an image is still a set of pixels with a three dimensional vector representing the color value. Denote this vector as (l, a, b) with l ∈ [0, 1] indicating the luminous, a ∈ [−1, 1] representing the value between magenta and green, and b ∈ [−1, 1] representing the position between yellow and blue. In this color space an attempt is made to represent the color of the object b a Histogram of Beans Image in the two dimensional projection onto the (a, b) coordinates. The l vector determines the brightness of the color, thereby reducing, but not eliminating, the effect of shadows on the color feature extraction. The two dimensional analysis presented here can be contrasted with the Delon et al. one dimensional histogram segmentation [11]. They were able to correctly segment the image of beans, given in Figure 5, using a one dimensional histogram of the hue. The color features extracted from the beans image using the two dimensional histogram segmentation is also shown in Figure 5. The background is the predominant color feature with a log(N F A) = −1000645. The different color beans are then extracted. The yellow, green, purple, and red beans had values for the log(N F A) equal to -8741.11, -4438.61, -3413.64, -3374.42 respectively. In the analysis of Delon et al., all the pixels were assigned a color depending on the mode the pixel belonged to. In the analysis presented in this work, a pixel that does not belong to any maximal meaningful group was not assigned a color. Figure 6 shows the two dimensional (a, b) histogram that was obtained by converting the RGB data of the peppers image shown in Figure 7 to the (l, a, b) color space and then suppressing the l value. Through inspection of the histogram it is difficult to determine the number of modes or the functional form of the modes. The dominant mode is not symmetric; thus an isotropic Gaussian approximation is not appropriate for this mode.
There were six maximal meaningful modes found in the segmentation, and the first five ordered by increasing N F A are shown in Figure 7. The most meaningful Color Segmentation of Beans Image  Figure 7 is shown. It is difficult from this image to determine the correct number of modes of the histogram. The logarithm was used for visualization only, and the analysis was performed on the original histogram. This histogram had L 1 = 500 and L 2 = 500.
dominant color feature was the skin tones. The image in the top panel underwent two different transformations. The middle panel was obtained by scaling the luminous value l down by a factor of four before the dominant color features were found. The luminous value is independent of the (a, b) plane; therefore the algorithm is invariant to this change. Images, however, are typically stored in RGB format. To simulate the luminous change in real images, the l value was scaled, and then converted back to 8 bit sRGB values, before being re-analyzed. This transformation introduces some quantization error due to the 8 bit quantization. This error explains the slight differences in the top and middle right hand images. The bottom panel of Figure 8 was obtained by scaling the sRGB values by a factor of four before the dominant color features were extracted. The most dominant color feature was still the skin tones, but not all of the same pixels were extracted. The difference in the pixels being extracted can be explained by the nonlinear transformation from the sRGB to the CIELAB color space.
The theory presented for two dimensional histogram segmentation is readily adaptable to histograms of other dimensions. The only difficulty is in deriving an algorithm that will quickly construct a containment tree structure analogous to Color Segmentation of Peppers Image the FLST transform. For one dimensional histograms the algorithm is obvious and the application to a one dimensional histogram color segmentation is shown in Figure 9. In this example the RGB color space was first transformed to the HSV color space and then only the hue (H) histogram was analyzed to extract the lady bug from the image. With a two dimensional histogram, the lady bug is not extracted since the red pixels of the lady bug are spread over many bins, while in the one dimensional histogram the red pixels were concentrated in few bins.
Delon et al. analyzed the same lady bug image to find an automatic color palette using K-means clustering and the Helmholtz principle applied to one dimensional histograms [10]. Their work assigns a color to each pixel, and it is able to automatically determine the correct color to pixel mapping to obtain an accurate representation of the image. The three colors extracted by the method presented in this paper is analogous to the three modes in the Hue histogram found in the work of Delon et al.. and ratio tail bounds can be extended to analyze correlation plane for pattern recognition. The N F A definition is stated in terms of binomial or multinomial tail sums, and it depends on the variables r, p, |R|, M ; therefore the calculation of the binomial or multinomial tail sums require knowledge of the total number of objects in the histogram, the count of the elements in a given set, and the probability of an element to be in the set. The Höeffding upper bound and the ratio tail upper bounds, however, were used in the operational definition of the N F A, and these Color Segmentation of Lady Bug Image Figure 9. Using a one dimensional histogram, the lady bug is extracted from this image. Analysis of the one dimensional histogram gives three maximal meaningful modes. The first two modes correspond to the background leaves, and the third mode is the lady bug. A two dimensional histogram will miss the lady bug.
upper bounds only depended on the ratios, r, of the number of elements in the sets, k, to the total number of elements, M .
Correlation pattern recognition consists of using a reference signal as a template, and calculating the cross-correlation between the template and a input scene. The output is a correlation plane. Local maxima of the correlation plane correspond to potential targets, but a measure of the peak height and quality must be used to distinguish clutter from targets. There are many measures of peak quality, but one of the most popular is the peak to side lobe ratio (PSR) [29]. The PSR is measured by considering a square annulus around the local maxima consisting of two square windows of size R 1 and R 2 with R 1 < R 2 . If p is the location of the peak, then the annular region, A, is defined by Let c(x) be the correlation plane, and define the mean of the annular region, m, by and the standard deviation, s, as The PSR is then given by If the PSR is above a certain threshold, then the peak is identified as a targets. More details on the MACH filter and the PSR can be found in Kumar et al. [29].

Meaningful Modes Correlation Plane
Local Maxima Correlation Plane Figure 10. The application of finding maximal meaningful modes in a correlation plane is shown. A template was used to identify multiple targets in a single image. The targets are identified as peaks in the correlation plane that have a high peak to side lobe ratio (PSR). The top panel shows the original Lidar data with the potential targets in blue boxes. The middle panel illustrates the results using the PSR values of local maxima. The bottom panel shows the targets found in the correlation plane using the meaningful mode algorithm. The meaningful modes procedure reduced the false alarms by three, and none of the targets were repeated. The local maxima technique found 117 locations for targets, and moreover most of the targets were found multiple times using the local maxima technique.
There are many different types of correlation filters, and the one that will be investigated in this work is the maximum average correlation height (MACH) filter. This filter has been used to locate multiple target of the same type in synthetic aperture radar (SAR) and laser radar (ladar) imagery [29,31,44]. The MACH filter was chosen since it produces improved performance and reduced computational load over many other filters [44].
A problem exists when local maxima are used as a screening step in correlation planes generated with MACH filters because the correspondence between local  Figure 11. The robustness of the correlation plane algorithm under a change in quantization levels is explored. The threshold for the peak to side lobe ratio (PSR) was fixed at 5.3. There are slight differences in the number of false alarms and the number of missed targets until 12 quantization levels. The meaningful mode algorithm does not have adequate resolution to distinguish between separate peaks at a low quantization level since many of the peaks are merged into one. maxima with high PSR and targets is not one to one. This lack of pairing between local maxima and targets is partially due to mismatch between the template and the output signal, which is created due to a rotation and scaling with respect to the reference template. The mismatch between the reference template and the output signal results in smaller modes of arbitrary shape. Just as the meaningful mode algorithm can find modes of arbitrary shape in histograms, the algorithm will extend to find modes of arbitrary shape in correlation planes.
Since the meaningful mode algorithm uses the FLST and it requires a probability distribution as input, the correlation plane underwent two transformations before it was analyzed. First it was shifted by a constant such that the minimum was zero and then it was quantized. The shift does not change the location of local maxima; therefore the local maxima still correspond to potential targets. Because the maximum number of values in a discrete grid is the number of grid points, the number of quantization levels for the FLST can be as high as the number of grid points in the digital correlation plane. The number of quantization levels was reduced to 250 to quicken the analysis. Figure 11 demonstrate that there was little difference in the results when the quantization was changed from 250 levels to 25. Errors in finding targets arose when 12 quantization levels were chosen and separate peaks were merged into one. After a quantization level is chosen, to preserve the probability interpretation of r, M must be the number of quantization levels times the area under the shifted correlation plane.
An example of the meaningful mode algorithm applied to a correlation plane is given in Figure 10. The top panel is a rendered image of ladar data with targets and clutter present in the image. The imaged area has 19 potential targets, and the targets are displayed in blue boxes. The boxes are for display purposes only, and were not used in the analysis. The local maxima method identified 117 potential target locations, many of them duplicates, missed two of the targets, and had five false alarms as shown in the middle panel. The bottom panel shows the improved ability of the meaningful mode algorithm to correctly identify the targets. A total 17 of the 19 targets were found with the meaningful modes algorithm, and two false alarms were found. 6. Conclusions. A novel method of finding two dimensional histogram modes based on the Helmholtz principle and the FLST tree structure has been presented. This method was then applied to the problems of image color segmentation and correlation plane pattern recognition. The Helmholtz principle was used to create a parameter free technique to locate the modes while the FLST tree structure was exploited to coordinate the search for modes. Its inclusion structure was utilized to find disjoint modes of the histogram that are unlikely to occur in a uniform histogram.
The development of maximal -meaningful modes using the FLST tree structure required advancements in multinomial tail bounds. These advancements include a generalization of the Höeffding relative entropy bound to the multinomial distribution. The containment genealogy ordering in the tree of shapes incorporates parents with a variable number of descendants, something the new q-state relative entropy bound can accommodate. Moreover, the q-state bound is always a tighter bound than the Joag-Dev/Proschan product bound. Finally, the maximal -meaningful mode criterion depends on the q-state entropy, which fits within the Cao et al. Helmholtz framework.
Generalized algebraic mean value ratio tests to predict the monotone ordering of the 1-state relative entropy function H 1 (r, p) are derived, which provide algorithmic shortcuts for the determination of maximal -meaningful modes. Algebraic factoring of the q-state relative entropy function H q (r, p) and a simple test that predicts the monotone ordering between the 1-state entropy and 2-state entropy functions also facilitate the maximal -meaningful mode determination.
The meaningful mode algorithm was applied to color image segmentation and to correlation pattern recognition. A useful two dimensional histogram of color pixels was created by transforming the RGB color space into the CIELAB color space and suppressing the luminous value. This created a color histogram that is less susceptible to shadows. The resulting color image segmentation was demonstrated in two pictures. The extracted colors in both images were very uniform and captured most of the pixels that visually represented that color. The color feature extraction also benefited from the morphological nature of the FLST tree structure. The FLST tree structure combined with the invariance of the variables r and p upon multiplication of the histogram by a constant integer generated a morphologically invariant approach. This morphological invariance translated to robustness under contrast changes for color image segmentation.
An alternative multiple target identification procedure based on MACH filters was also described and demonstrated. This technique was able to reliably locate multiple unique targets, and, without the addition of new parameters, it lowered the false detection rate compared with previous methods.
Applications of the meaningful mode algorithm is not restricted to the applications presented here. The method has also been used by Hewer et al. to analyze two dimensional wavelet histograms to generate an image cartoon-texture decomposition. See reference [25] for a complete discussion on their procedure. The algorithm also naturally extends to all families of correlation filters. The work of Kumar, Mahalanabois, and Juday discuss other correlation filters and correlation pattern recognition [29].
Meaningful gaps, introduced in [20], are defined as intervals that contain fewer points than the expected average. While meaningful gaps are not included in the discussion, they could be accommodated within the q-state entropy framework. Moreover, the -maximal meaningful mode analysis is readily extensible to three or more dimensional histograms. Unfortunately, a three dimensional FLST algorithm does not exist at this time.

Appendix.
6.1. Expected number of false alarms. The interpretation of the N F A as an upper bound on the expected number of -meaningful modes serves as an argument for its use. The proof that the expected number of false alarms is less than is given below. Only the multinomial theorem will be proven since the binomial is a special case. First a necessary lemma will be stated and proven. The proof of this lemma is modeled after a lemma in [5].
The binomial tail is well defined on the set C = {k ∈ Z q |k i > 0, k 1 ≤ M, k 2 ≤ M − k 1 , . . . , k q ≤ M − k 1 − k 2 − · · · − k q−1 }. Extend m q (M, k, p) to Z q by defining m q (M, k, p) = 0 for k / ∈ C. Define the functionk(k) = (k 2 , k 3 , . . . , k q ) as the projection of k onto the last q − 1 elements and let  Proof. Define D as the collection of all the q-families of disjoint sets of the form , then the number of q-families in D consisting of disjoint sets is less than |R| q . The random variable K = (K(G 1 ), K(G 2 ), . . . , K(G q ) represents the random number of elements in each set and define p = (p(G 1 ), p(G 2 ), . . . , p(G q ) with p(G i ) = |G i |/N where N is the number of bins.
Denote by Y ({G i } q i=1 ) the binary random variable equal to 1 if {G i } q i=1 ismeaningful and to 0 otherwise. The random variable Z ) counts the number of -meaningful modes. By linearity of the expectation where the last inequality is true by the previous lemma. Using this result, then for equation 65 Let r = k M , then the inequality M(M, k, p) ≤ exp(−M H q (r, p)) implies that (M + 1) q−1 |R| q M(M, k, p) < whenever (M + 1) q−1 |R| q exp(−M H q (r, p) < , and the expected number of -meaningful modes is less than . 6.2. Algebraic tests proof. The following is a proof of theorem 4.1. The algebraic tests in this theorem can reduce the computational time to calculate if a set ismeaningful. Recall the definitions µ(r, p) = r/p and ω(r, p) = (1 − r)/(1 − p). The subscripts accompanying µ i and ω i is a shorthand to distinguish between the variables p i and r i . Theorem 6.3. Let r 1 , r 2 , p 1 , p 2 ∈ [0, 1] with r 1 > p 1 , r 2 > p 2 , 0 < r 1 + r 2 < 1, and 0 < p 1 + p 2 < 1.
(3) The proof for the ω inequality follows by similar reasoning. Note the D ω (p) is the difference between the relative entropies evaluated along the two curves r ω (p) and r H1 (p). After applying the differential inequality for the lower bound ω(r, p), it follows that the total derivative is now negative. The final inequality H 1 (r 1 , p 1 ) > H 1 (r 2 , p 2 ) is satisfied because the needed inequality r 2 ≤ r ω (p 2 ) = ω 1 (p 2 −1)+1 < r 1 follows from the inequalities p 1 > p 2 and ω 2 ≤ ω 1 .
The proof of (2) and (4) follow from obvious continuity considerations of H 1 (r, p).