Plugin procedure in segmentation and application to hyperspectral image segmentation

In this article we give our contribution to the problem of segmentation with plug-in procedures. We give general sufficient conditions under which plug in procedure are efficient. We also give an algorithm that satisfy these conditions. We give an application of the used algorithm to hyperspectral images segmentation. Hyperspectral images are images that have both spatial and spectral coherence with thousands of spectral bands on each pixel. In the proposed procedure we combine a reduction dimension technique and a spatial regularisation technique. This regularisation is based on the mixlet modelisation of Kolaczyck and Al.


Introduction
In this article we study the segmentation problem which is a particular learning problem that generalizes classification (as defined in [5]) by asking for multiple simultaneous decisions instead of a simple decision. In segmentation, we have an observation x = (x [1], . . . , x[N ]) in X N (in this paper, X = R p but some results are more general). This observation is associated to a label y = (y [1], . . . , y[N ]) with values in the product space {0, 1} N . Note that we restrict ourself to the binary segmentation mainly to simplify the theoretical study, however, in the applications of this paper y takes values in {1, . . . , K} N . In the segmentation problem, the label y is unknown and a segmentation procedure is a function g : X N → {0, 1} N that tries to guess the correct label associated to a given observation. For example, in a grayscale image segmentation, N is the number of pixel in the image and X = R, in the hyper-spectral image segmentation problem X = R p with p very large. The segmentation error of a segmentation procedure g can be measured with a distance d on {0, 1} N by d(g(X), Y ). In this article we will use the normalized Hamming distance d H defined by The value of d H (g(X), Y ) represents the proportion of misclassified pixels.
In order to analyze the theoretical performances of the proposed procedures, we introduce a probabilistic setting, and let (X, Y ) be an X N × {0, 1} N valued random pair, modeling an observation and the corresponding label. Let P X be the distribution of X, P Xi the distribution of X[i] for all i ∈ {1, . . . , N }, and P Y the distribution of Y . For k = 0, 1, i = 1, . . . , N , let P ik be the probability distribution of X[i] given Y [i] = k, let π i = P Y (Y [i] = 0). The distribution of the random pair (X, Y ) may be described by ((P ik ) i,k , (π i ) i ). In this article, we make the following assumption We measure the performance of a segmentation procedure g by E[d H (g(X), Y )] and it is easy to see that, under assumption 1, the optimal procedure, e.g the one that minimizes E[d H (g(X), Y )], is given by The two step framework in a plug-in perspective, the construction of a segmentation procedure approaching g * can be divided into two steps.
• Step 1: Learning step. Find the substitute (P i0 ,P i1 ) i=1,...,N for the conditional distributions on each pixel (P i0 , P i1 ) i=1,...,N . • Step 2: Segmentation Step. FindP ∈ × N i=1 Conv(P i0 ,P i1 ) 1 using the observation X drawn from P X (the observed image). Note that finding such a distribution is equivalent with finding a set of weights π(P X ) = (π i (P X )) i=1,...,N ∈ [0, 1] N witĥ π i (P X )P i0 + (1 − π i (P X ))P i1 . ( The (plugin) segmentation procedure obtained with this construction is Remark 1. The segmentation rule depends on the observation X[i] through the evaluation of the likelihood ratio dPi0 dPi1 (X[i]) but also depends on the whole image X through the evaluation of the weigh vector π(P X ) in step 2.
In the applications of this article, a learning set, composed of n independent random variables drawn from P 0i and P 1i ∀i = 1, . . . , N , is given in the first step. This is what we will refer to as the supervised segmentation, and this justifies the name of the first step.
Obtained rate of convergence. In order to measure the difference, between a segmentation procedure g and the best segmentation procedure g * , it convenient to introduce the excess risk: In this article, we give rates for the convergence of S(g) to zero. The procedure we describe in Section 2 for the estimation of the weight π(P X ) is a general model selection procedure introduced by Kolaczyk et Al. [9]. The algorithm in step 2 has never been used before but it is in line with step (c) in the work of Antoniadis et Al. [1]. Apart from our numerical studies and the fact that we combine step 1 and step 2, our main contribution is theoretical. Indeed, we obtained rates of convergence for a class of plugin segmentation procedures. The corresponding results are summarized in Theorem 2. It gives a relation between the convergence of S(g) and separately with the choice of (P i0 ,P i1 ) i=1,...,N and the complexity of the class of possible weights π(P X ).
• f : i → π(P X )[i] satisfies some smoothness assumption (which are fulfilled in the case of the boundary fragment model as described in [11]).
(recall that p is the dimension of X , N is the number of pixels in the image, n is the size of the learning set and the notation A B means that there exists a constant c > 0 such that A ≤ cB.) Foreseen applications. In satellite imagery, images often contain more than 200 spectral bands. In mars satellite imagery, (see [13]) geologists have a clear idea of what type component they will find within images and they can create a learning set. This learning set can be made out of samples from images that have been analyzed by an expert. Anyway, this expert cannot identify spectra from the tera bytes of data that flows from Mars to the earth, and the proportions of the different component in the learning set taken from a randomly chosen place on mars cannot be used to infer on what will be the proportion in a new image coming from another part of Mars.
In medical imagery of the brain the problem is also exploratory, but the number of images is relatively small and if experts can analyze images by themselves, contamination by noise makes a statistical support attractive. Images contain thousands of spectral bands.
The remote-sensing literature about supervised and unsupervised segmentation procedure of images is really large, however, only a few procedures have been developed to tackle the multi and hyperspectral image segmentation problem. Finaly, we are not aware of any work providing theoretical assessment of image (or hyperspectral image) segmentation procedure with a learning step.
Structure of the paper. This article is constructed as follows. In Section 2 we give our main theoretical result which concerns step 2 (segmentation step). In Section 3 we give an algorithm that aims at estimating the conditional density under gaussian assumption and when X is a high dimensional space R p with p large. This algorithm gives a solution for step 1 and is shown to satisfy necessary condition for step 2 to be consistent (i.e Assumptions 5 from Section 2). In Section 4 we apply the whole algorithm (step 1 plus step 2) to hyperspectral (medical and satellite) image segmentation. In Section 6 we give the proof of our theoretical results.

Algorithm and main result
In this section, we give the algorithm for step 2 (estimating the weights) and associated theoretical results. This can be considered as the main result of this paper.

Mixlet estimation
The mixlet algorithm has been introduced by Kolaczyk et Al. [9]. It a model selection algorithm based on a penalized maximum likelihood estimation.
Let M N be a finite subset of × N i=1 Conv(P i0 ,P i1 ) (i.e a subset of models) and pen N : M N → R + a penalty function. Note that the set M N can either be seen as a set of measures, as a set of weights, or (because it is finite) as a set of densities with respect to a dominating probability measure. The mixlet estimation of π(P X ) is obtained by findingP X given bŷ For the penalty function and the associated set of models, we only require a kraft inequality: This type of assumption is standard in model selection theory (see [2]). This can be seen as a complexity assumption on the set of models and penalty. Such inequality then results from Kraft inequality, as, for example in Kolaczyk and Nowak [10]. Equivalently, it can be seen as a topological covering bound, as, for example in Barron et Al. [2]. The way we use this inequality in the proof of the following theorem is inspired from the work of Birgé [3].

The "d-dimensional" hyperspectral image
In this example, we examine the choice of M N and pen N in a particular case related to a "d-dimensional" hyperspectral image.
Each index of {1, . . . , N } will now be associated to the center of one of the N = n d pixels of a d-dimensional hypercube: [0, 1] d (here we assume that pixels in the image are d-dimensional hypercube with equal size). As a consequence, giving a segmentation g(X) of X is equivalent with finding a particular partition of [0, 1] d into groups of pixels. We will search those partition within the set of recursive dyadic partition. Recall that a recursive dyadic partition of [0, 1] d (in short RDP d ) is a partition constructed recursively and associated to a 2 d -tree, e.g a tree with 2 d sons or one leave at each node. A splitting of a node in the tree correspond to a splitting of a d-dimensional hypercube into 2 d identical hypercubes.
The set of models M N and the associated penalty function will be use in the numerical application to hyperspectral image segmentation. For i = 1, . . . , N , we will search π(P X ) i in a regular grid of [0, 1] with N 3/2 elements (take the entire part of N 3/2 if it is not round). This grid of [0, 1] will be denoted G N . Finally M N will be the set of product distribution . . , N and i → π(Q) i constant on each piece of a given RDP d . The minimal RDP d on which π(Q) i is constant will be P(Q). The function pen(Q) penalize the partitions that are too rich: where m d−1 is the number of elements of the partition P(Q). It is known that with this type of penalty, we have a kraft inequality (see for example [9]) end hence Assumption 2 is fulfilled.
The corresponding model selection algorithm (for step 2), i.e used to find the minimum in Equation (5) with the defined set of models and associated penalty function, can be implemented efficiently with a pyramidal algorithm (see [9]) and has been called mixlet algorithm.

Theoretical result
Before we state the theoretical result we give assumptions that have to be fulfilled Assumption 5. There exists C > 0 such that where χ 2 (P 1 , P 0 ) is the chi square divergence between two probability distribution (P 1 and P 0 ) defined by The obtained result is the following Theorem 1. Under the Assumptions [1][2][3][4][5], and ifĝ is a classification rule constructed with the two step given in the preceding section (defined by Equation as long as N e −N ψN,n = O(ψ N,n ) where c 0 is a positive constant, S the excess risk defined by Equation (4) ∀R ∈ M N , h(R, N ) = π(P X ) − π(R) l1 + pen N (R).' (10) and the error term related to the learning step is given by where The proof of this result is postponed in the annex. Let us discuss the assumptions of this Theorem.
• In order to get a rate of convergence for the full process (step 1 and step 2) we need to provide a bound to E[L n,N ] (where this last expectation is with respect to the learning set). This will be the purpose of Section 3. • Assumption 4 is rather strong. If the number of pixel with a pure component (π i = 0 or 1) is small (i.e the order of the upper bound in the preceding theorem), the results are still valid. We think that the construction of M N should be changed to avoid this assumption, in particular the discretization of the set of values for π i should be refined near 0 and 1. This will be the purpose of further work. • Assumption 4 is related to the choice of the model in ad-equation with the structure of π(P X ). In the next section we explain how this choice can be done in the case of a "d−dimensional image".

Turning back to the "d dimensional image"
In order to be able to upper bound inf R∈MN h(R, N ) (tradeoff between bias and complexity) it is natural to introduce assumption about the "spatial" regularity (i.e regularity of the weights in the image) that can be handled by a RDP d . This is done by the following Definition and Assumption .
where t i is the center of the hypercube i of P I d .

Remark 2.
This assumption is an assumption on the topological structure of T N . This structure is more complex when d is bigger.
With M N and pen N as defined previously, and under Assump- The proof of this proposition can be founded in Donoho [7] or in the Annex of Kolaczyk et Al. [9]. Corollary 1. Let d ≥ 2 and suppose that for all i = 1, . . . , N , k = 0, 1, we haveP ik = P ik . Under the Assumptions 1, 3, 4 and ifĝ is a classification rule constructed with the two step given in the preceding section (defined by Equation (3)) withP X in the first step, given by Equation (5) where S is the excess risk of segmentation refined by (4).
This corollary is a direct consequence of the preceding theorems.
Remark 3. This result together with the one in the next Section may be seen as a complete convergence description of the algorithm that is used in Section 4. Unfortunately it is not the case because the only application we have are not in the case where the number of possible class is two.

Dimension reduction in segmentation: a solution to step 1 in high dimension
In this section, we investigate Step 1 when X = R p under the following assumption Assumption 7. For k = 0, 1, i = 1, . . . , N , P ki is gaussian with mean µ k and covariance C. For k = 0, 1, C − µ k has less than p 0 + 1 non null components, where p 0 is bounded with respect to p, n and N . The matrix C is diagonal.
Note that, under this assumption, P ki does not depend on the position i. For k = 0, 1, we suppose that we have n k independent random variables Z kj (n k is a positive integer) drawn from distribution P k1 . The set Z = {Z kj , k = 0, 1, j = 1, . . . , n k } is the learning set. If A is a squared matrix, we will use the notation A − for the associated generalised inverse.
The algorithm for estimating P k1 (k = 0, 1), i.e the learning step, is as follows.

ComputeÎ aŝ
3. The means µ 0 and µ 1 are estimated bŷ . . , p, k = 0, 1, and the covariance C by the diagonal matrixĈ with diagonal elementŝ  (3) withP X in the first step, given by Equation (5), we have: The proof of this Theorem is given in Subsection 6.4. The weakness of this theorem is that it require the knowledge of C. We did not succeed in giving a proof in more general case and we believe that further improvement of this result is beyond the scope of this paper.

Application to hyperspectral image segmentation
Before we give the details of our application to hyperspectral medical image segmentation we have to emphasis that the theoretical results we gave are designed for a two class segmentation (K = 2). However, in most application the number of possible classes is larger than two and the algorithms we gave can easily be extended to the case when K > 2. Indeed, the penalized maximum likelihood estimation of the weight (π i ) i can be used when K > 2 and the dimension reduction algorithm can be extended to a multiclass framework. This last extension can be done using a global measure of the contrast between groups.

Application to medical hyperspectral segmentation
Hyperspectral images of the brain from magnetic resonance imaging are high dimensional data. These images have only a few pixel (N = 256 pixels) giving the detail of a slice of the brain but on each pixel, we observe a high dimensional spectra with p = 1024, hence X = R 1024 . A given spectra is expected to give a complete information on the tissular characteristic at a given spatial position. These tissular characteristics can be classified into groups. In this medical problem, we have a learning set composed of 62 spectra from four different groups: 21 Glioblastomas of type A, 9 Glioblastomes of type B, 16 Méningiomes, and 9 healthy tissues. We were given an hyperspectral image associated to a Glioblastoma mixing type A and type B. This image (its spatial configuration) is simulated from spectra obtained in a real experimentation. The obtained segmentation and the true segmentation are given in Figure 2. Our conclusion is that the tumor is well localized but that the different types of Glioblastomas are not distinguished. Note that if the result are positive, this partly results from a pre-treatment of the data (ad-hoc re-phasing of the spectra) and from the fact that we did not include any metastases in the problem (metastases and glioblastomas are hard to distinguish). Studying automatic re-phasing will be the purpose of further research. Figure 2. Obtained segmentation -on the left-, and segmentation that we should obtainon the right-. The pixels that are colored in blue correspond to healthy tissues, the green is associated to type B Glioblastomas and the red is associated to type A glioblastomas.

Conclusion
We studied the problem of supervised segmentation. We gave theoretical results in a plugin perspective that allow to consider a wide range of model selection procedure. We showed that the procedure of Kolaczyk et Al. [9] (the mixlet procedure) can be applied consistently for segmentation of images with smooth boundaries. We gave a theoretical result that separate the segmentation error due to the learning step and the segmentation error due to the segmentation step (estimation of the weigh in the mixture model). We gave a reduction dimension procedure for the learning step and gave associated theoretical results. The corresponding result gives the convergence rate of the whole segmentation procedure (learning step plus segmentation step), this convergence rate is adapted to the case when the dimension p of the feature space of a pixel observation is much larger than the number n of elements in the learning set. Finally, we applied the whole methodology to medical image segmentation.

A general result in segmentation with a plugin rule
Any segmentation procedureĝ can be summarized byR π : X N → [0; ∞] N throughĝ[i](X) = 1 1<Rπ[i](X) ∀i ∈ {1, . . . , N }. We obtained Theorem 3 below which gives an upper bound on the excess risk S(ĝ) under the following assumption on the error made while estimating R π : where E(R π ,R π ) is given by and We also need the following additional assumption . Take R π and the associated optimal segmentation procedure g * as defined by Equation 1. Under Assumption 1, and ifR π satisfies assumption 8 9, then

where S is the excess risk defined by Equation 4.
This Theorem is proven in Annex. Assumption 9 is a weak assumption that will be satisfied if P 0i and P 1i are not as closed as desired (in i ∈ {1, . . . , N }) for a well chosen distance. Notice that Assumptions 3 and 4 imply Assumption 9. The way to obtain inequality 13 in Assumption 8 will be the topic of Subsection 6.2 but the Theory developed by Birgé in [3] is our main reference and inspiration on the topic.
To understand the interest of Theorem 3 one should notice that a simple analysis gives On the other hand, it is possible (using same argument that are used in the proof of Corollary 2) to show that Assumption 8 implies However, Assumption 8 is weaker than and this shows that Theorem 3 is sharper than Equation 17.
Proof. To simplify notation we will set The proof of the Theorem is decomposed into 3 steps Step 1: we claim that there exists 1 > c > 0 such that (where the supremum is taken over all subset of {0, . . . , N } of size k). Indeed, we can notice that we have, for all where this last inequality results from the bounded difference inequality. Also, setting inf I:|I|=k c(I) = 2c (c > 0 from Assumption ??) gives Inequality 19.
Step 2: we claim that with c 0 , c 1 > 0 and ψ N,M as in Assumption 8 we have Cauchy Schwartz inequality gives, for all which implies: Finally, Inequality 20 follows from Assumption 8 and the fact that, for any i ∈ {1, . . . , N }, Step 3: Standard calculous (see for example [6] noticing that Ω(R π [i]) = |η i − 1/2|) leads to where these last two inequalities follows from the results of step 1 and 2. Since N e −c ′ N ψN,n = O(ψ N,n ) for any c ′ > 0 this gives the desired result

A general oracle inequality for penalized maximum likelihood estimation in mixlet model
In this Subsection, we give a general result on the estimation of the weigths. We first define the mean Hellinger distance between product measures.
are two product distributions on X N , we will call mean Hellinger distance: H N , the positive quantity defined by where h 2 (P i , Q i ) = X ( √ dP i − √ dQ i ) 2 is the squared Hellinger distance between P i and Q i . Theorem 4. Suppose that M N satisfies the Assumption 2 and thatP X is given by Equation 5. Then, under Assumption 5, there exists c ′ , c ′′ > 0 such that where h is given by Equation 10 and L N,n by Equation 11.
Before we give the proof of this theorem, let us give some comments. The result of this theorem is an oracle inequality aiming to verify assumption ?? as it has been noticed in Subsection 2.4. The function g(R, N ) measures the tradeoff between bias and complexity for model R. The error term N i=1 χ 2 (P i1 , P i1 ) + χ 2 (P i0 , P i0 ) is related to step 1 but should also be connected to Remark ??.
The Assumption 5 is necessary to obtain theoretical results. It is weaker than the Assumption sup x∈X ,k1,k2∈{1,...,K} 2 which is common in mixture model estimation (see for example the thesis of Li [12], or the work of Kolaczyk et Al. [9]). Note that the Assumption given by (22) is not satisfied when the P k are gaussian. Our Assumption 5 allows us to consider gaussian mixture. Kolaczyk et Al. [9] introduced the idea of mixture weight estimation by maximum likelihood estimation. In the same paper, they give a theoretical result without using the mean Hellinger distance which weakened their result. In addition, they consider only the case whereP ik = P ik with d = 2 and use the assumption related to Equation 22. From this point of view, our result (Theorem 4 together with Proposition 1) is a significant improvement of the result obtained in [9]. Indeed, for all i = 1, . . . , N k = 0, 1, under assumption given by equation (22), and assumption 6 with d = 2, there exists a positive constant c 0 such that We did not succeed in using this last bound to obtain a result in the segmentation problem (such as Theorem 1). This is the reason why we worked on obtaining stronger results such as Theorem 4 and its consequences.
The result may be difficult to apprehend in the preceding theorem, also we give the following simple corollary (it is a weaker result). Proof. The proof relies on the same principle that the one exposed by Birgé in [3]. More precisely, the densityP X is a penalized maximum likelihood estimator, but it is also a T -estimator. As a consequence, we have (from the sub-additivity of probability measures). In addition, for all Q ∈ M N Markov inequality leads to For all R, Q ∈ M N , by applying twice Cauchy-Schwartz inequality, we have:

(this equation defines A, B and C)
where We first give an upper bound for A: This bound is easy to obtain by using the standard inequality Equations (24), (25) and (23) and Assumption 2 (Kraft inequality) then give We now only need to show that (h(R, N ) is given by Equation (10)) and with c ′ > 0 and c ′′ > 0.
Let us begin with Equation (26). Easy calculous (using in particular the concavity of x → x 1/2 lead to On the other hand, Assumption 4 easily gives (for a positive constant c) which gives, using Equation (28) The proof of this Lemma is only simple variational analysis (all the functions of x that appear have maximum on 0 or 1), and the use of the identity We now show Equation (27). We have and this, together with the third equation of the preceding Lemma, gives: with, for all i = 1 . . . , N , Using the inequality max(a, b) ≤ a + b for all a, b ∈ R gives the desired restult (i.e Equation (27)).

Proof of Theorem 1
The aim of this proof is to use Theorem 4. First, one should notice that Assumptions 3 and 4 imply assumption 9. Let us defineR Then use the following lemma gives a kind of triangular inequality, it results from simple analysis.
Using the Lemma with x = R π /R π and y =R π /R π gives Also, because P (X + Z ≥ δ) ≤ P (X ≥ δ/2) + P (X ≥ δ/2) for any real valued random variable X, Y , we have We first bound the first term of the right hand side of Inequality 29. Using Assumption 4 gives (for a positive constant c ′ using Assumption 4) and we can use Theorem 4 to conclude that where h is given by Equation 10.
Let us now bound the second term of the right hand side of Inequality 29. To simplify notations, we set The finite difference inequality implies that N Also, taking t = (δ/2 − e) + ( (x) + stands for the positive part of x) gives Finally, because e ≤ N φ N,M , there exists c 0 , c 1 , c 2 > 0 such that ∀δ ≥ 0 P X E(R π ,R π ) ≥ δ ≤ c 2 e c0N φN −c1δ , This inequality and inequality 31 imply that Assumption 8 from Theorem 3 is fulfilled and ends the proof.

Proof of Theorem 2
First notice that the subscript i in Theorem 2 does not play any role, we will chose i = 1 and omit the corresponding subscript in the rest of the proof (in particular g * (x) will stand for g * (x) [1]). Because S(ĝ) is upper bounded by 1, we only need to upper bound separately the 3 terms that substantially appear in E[min(L N,n /N, 1)]: E 1 = E min 1, E Ω 2 dP 1 dP 1 , E 2 = E min 1, χ 2 (P 1 ,P 1 ) where the last expectations in E 1 E 2 and E 3 are with respect to the learning set.
Upper bound for E 1 The upper bound for E 1 easily follows from the fact that E Ω 2 (dP 1 /dP 1 ) ≤E log 2 dP 1 /dP 1 ≤E L 2 (X) Upper bound for E 2 follows directly from the upper bound for E 3 since χ 2 (P 1 ,P 1 ) = E P1 e 2L(X) − 1.
This ends the proof.

Proof of corollary 2
Proof. We only have to use Proposition 3 of Birgé [3]: Lemma 3. Let Y be a positive random variable with P (Y > y) ≤ αe −y 2 for y ≥ȳ and α > 0.
Then, for all q ≥ 1, where ζ q is a function defined on R + decreasing and such that ∀x ≥ cq, ζ q (x) = q 2 e −x , where c = 1/2 if q ≤ 2πe and 0.612 otherwise .
We applied the preceding Theorem to check the hypothesis of the Lemma withȳ 2 = 2 c ′ inf R∈MN {g(R, N )} + c ′′ L N,n , α = eȳ 2 , Y 2 = N H 2 N (P X ,P ). As a consequence, whenȳ > (cq) 2 we have αζ q (ȳ) ≤ q 2 e −ȳ/2 , which leads to the desired result (for N large enough, and because H 2 N ≤ 2, for all N by changing the constant).