Model selection for the segmentation of multiparameter exponential family distributions

We consider the segmentation problem of univariate distributions from the exponential family with multiple parameters. In segmentation, the choice of the number of segments remains a difficult issue due to the discrete nature of the change-points. In this general exponential family distribution framework, we propose a penalized log-likelihood estimator where the penalty is inspired by papers of L. Birg\'e and P. Massart. The resulting estimator is proved to satisfy an oracle inequality. We then further study the particular case of categorical variables by comparing the values of the key constants when derived from the specification of our general approach and when obtained by working directly with the characteristics of this distribution. Finally, a simulation study is conducted to assess the performance of our criterion for the exponential distribution, and an application on real data modelled by the categorical distribution is provided.


Introduction
Penalized-likelihood approaches are becoming more and more popular in numerous domains in statistics: density estimation, variable selection, machine learning, etc. Here we consider a multiple changepoint detection setting for univariate datasets where we are interested in the estimation of the number of segments K. While earlier works have focused on variables distributed from specified distributions, for instance Gaussian [1], or either Poisson or negative binomial distributions [2], we consider here a more general framework of exponential family distributions. More precisely, the model is the following: where G is a distribution from the exponential family with parameter θ t ∈ R d . In the context of changepoint models, we want to consider that θ t is piece-wise constant along the time-line 1, . . . , n and we therefore wish to identify a partition of {1, . . . , n} into K segments within which the observations can be modeled as following the same distribution while they differ between segments.
In our framework we will consider the minimal canonical form of the exponential family distribution which we will write as where we use the . symbol to denote the canonical scalar product of R d , A is the log-partition function of G and T t = T (Y t ) (∈ R d ) is the minimal sufficient statistic associated with variable Y t .
While almost all methods for choosing the number of segments can be seen as penalized-likelihood approaches (Akaike Information Criterion, [3], Bayes Information Criterion [4], Integrated Completed Likelihood [5], etc), we and other authors (see for instance [1,6,7]) have previously emphasized how crucial the choice of the penalty function is in contexts such as segmentation where the size of the collection of models grows with the size of the data. For this reason we have extended the approach developed in the Poisson and negative binomial cases to more general distributions from the exponential family. The motivation for this work initially relied on the very strong similarities which we had observed between the Poisson and negative binomial distributions. Most of those similarities are in fact related to properties of the log-partition function A.
Exponential family distributions, and in particular the log-partition function, have been well studied in the past years. In a pioneer work [8], Brown has described the fundamental properties of exponential family distributions, such as parametrization using sufficient statistics, differentiability of the log-partition function and its relation to moments, etc. More recently, [9] has demonstrated the strong links between graphical models and exponential family, [10] has studied the sub-exponential growth of the cumulants of an exponential family distribution and studied convergence rates of regularization algorithm under sparsity assumptions while [11] has studied consistency properties of the lasso procedure under some convexity assumption.
In this paper our goal is to mimic the procedure followed in the Poisson and negative binomial cases for obtaining the oracle inequality while emphasizing the role of the log-partition function. Considering the essential role played by the chi-square statistic in this earlier work, we will restrict the considered families to those with positive marginal sufficient statistics (hence allowing the definition of χ 2 ), thus we will assume that the set of natural parameters θ t is restricted to a compact ν such that ∇A(ν) belongs to {R + } d . This for instance excludes the Gaussian distribution unless we assume the mean parameter to be positive.
Among the key features of minimal exponential family is the relationship between the derivatives of A and the moments of the sufficient statistics. Hence the first two moments are given by Moreover, using minimal representation of the exponential family ensures that the gradient mapping ∇A : ν → ∇A(ν) is a bijection (see for instance [9]). These properties will be fundamental in the construction of the oracle inequality. The proof of the later mimicks the approach adopted in [2], most of the time studying each marginal of the sufficient statistic separately.
In Section 2 we introduce the collection of models and the penalized-likelihood framework and state our main result while in Section 3 we derive exponential bounds for the sufficient statistic. Proofs of our main statement and oracle inequality are given in Section 4. In Section 5 we explicit the constants and assumptions used in a list of classical law from the exponential family such as Poisson, exponential, gamma, beta, etc, and we study the particular case of categorical variables in Section 6 to assess the precision lost in dealing with general exponential family instead of directly bounding the particular distribution. Finally, in Section 7, we illustrate the performance of our approach on a simulation study based on the exponential distribution, and propose an application to DNA sequence distribution on a real data-set which we model using a piece-wise constant categorical distribution.

Model Selection Procedure
2.1. Penalized maximum-likelihood estimator. In our change-point setting, we will want to consider partitions m of the set {1, . . . , n} on which our models will be piece-wise constant. More precisely, for a given partition m and denoting J a generic segment of m, we define the collection of models associated to m as: We will consider the log-likelihood contrast γ and the associated Kullback-Leibler distance K(s, u) = E[γ(u)−γ(s)] between distribution s and u so that for distributions The minimal contrast estimatorŝ m of s on the collection S m is thereforeŝ m = arg min u∈Sm γ(u) and is given byŝ where T J = t∈J T t , is the sum of the sufficient statistics on segment J, andT J = T J /|J| its mean; the bijective mapping of the gradient of A ensuring the existence and uniqueness of ∇A −1 (T J ). Similarly, the projections m of s in terms of Kullback-Leibler on collection S m iss m = arg min u∈Sm K(s, u) and is given bys As is classical in penalized-likelihood settings, since minimizing the Kullback-Leibler risk would require knowing the true distribution s, we will wish to choose a penalty function pen(m) such that the penalized estimatorŝm, wherem = arg min γ(ŝ m ) + pen(m), satisfies an oracle inequality of the type To this effect, here as in previous works (see for example [12]), we will introduce an event of large probability, Ω m f (ε) defined on a minimal partition m f , where the fluctuation of each centered marginal is bounded. On this event we will derive tight controls of the risk of the models which will lead to the first part of the oracle inequality, i.e. E[K(s,ŝm)] ≤ C 1 E[K(s,ŝ m(s) )]. On the complementary of this event, we will obtain less tight results which are compensated on expectation by the negligible probability of the event. This control will result in the C 2 constant. The choice of ε is therefore crucial in insuring that both C 1 and C 2 are as small as possible, while having the negligibility property of C 2 compared to C 1 E[K(s,ŝ m(s) )]. In practice, this choice is data-driven and is efficiently performed through the use of the slope heuristic [13]. In this paper, we therefore consider a generic but fixed ε, and aim at obtaining the shape of the penalty function. This choice is given in the next section.

Main result.
Theorem 2.1. Let s = {s(t)} 1≤t≤n be a distribution from the exponential family such that its parameters θ t belong to a compact set ν with greatest band ν and such that ∇A(ν) ⊂ {R + } d . Assume there exist some positive constants ζ i and κ i and a positive constant κ such that where R i t is the radius of convergence of A(θ t + zδ i ), δ i = (0, . . . , 0, 1, 0, . . . , 0) and α i t is an exponential control of the growth rate of the coefficients of its power series. Let ζ = min i {ζ i }. Let M n be a collection of partitions constructed on a partition m f such that there exists Γ > 0 satisfying ∀J ∈ m f , |J| ≥ Γ log(n) 2 , and let (L m ) m∈Mn be some family of positive weights satisfying where E min = min 1≤i≤d {E(T i )} and E max = max 1≤i≤d {E(T i )} are bounds on the expectation of the sufficient statistics.
Remark 2.2. It can be noted that some assumptions of theorem 2.1 are in fact redundant and were included to simplify notations and clarify the dependency to the required assumptions. Indeed, • the existence of E min and E max are guaranteed since we assume that θ belongs to a compact ν.
Similarly, the greatest bound ν of the compact needs not be introduced and we could use θ max −θ min instead. However in general the marginals of the sufficient statistic might not fluctuate in the same subset of R + and therefore ν is a refinement of the bound. • bounding θ (on ν) also implies that the variance of the sufficient statistics are bounded (since A is analytic), and therefore the existence of ζ and κ in assumption (H) are guaranteed.
The theorem could therefore be re-written with the minimal conditions that θ ∈ ν, ∇A(ν) ⊂ {R + } d and the hypothesis on m f and Σ.
In our change-point setting, we will choose weights L m that depend on m only through their number of segments |m|, resulting in L m = 1.1 + log n |m| (see [1] for a justification of this choice). This leads to a penalty function of the form: (2) pen(m) = β ε d|m| 1 + 4 1.1 + log n |m| , and the inequality of theorem 2.1 therefore becomes: E h 2 (s,ŝm) ≤ C βε inf m∈Mn K(s,s m ) + dβ ε |m| 1 + 4 1.1 + log n |m| The following proposition gives a bound on the Kullback-Leibler risk associated toŝ m , which we prove in Section 4.1. This bound is then integrated into the main theorem to obtain our oracle inequality in corollary 2.4.
where a ≥ 2 and C(ε) is a constant that depends on ε, the constants involved in hypothesis (H), and the constraints on the compact ν.
Corollary 2.4. Let s = {s(t)} 1≤t≤n be a distribution from the exponential family such that its parameters θ t belong to a compact ν of {R + } d . Let M n be a collection of partitions constructed on a partition m f such that there exists Γ > 0 satisfying ∀J ∈ m f , |J| ≥ Γ log(n) 2 , and assume ∃ ζ and κ such that assumption (H) is verified. There exists some constant C such that We can therefore apply the large deviation result from [16]: Exponential bound for χ 2 m . We now consider m f a partition of {1, . . . , n} such that ∀J ∈ m f , |J| ≥ Γ(log(n)) 2 , and we assume that M n is a set of partitions that are constructed on this grid m f . We first introduce the following set Ω m f defined by: with a > 2.
The following proposition gives an exponential bound for χ 2 m on the restricted event Ω m f (ε). Proposition 3.1. Let Y 1 , . . . , Y n be independent random variables with distribution s (from the exponential family) and verifying (H). Let m be a partition of M n with |m| segments and χ 2 m be the statistic given by (6). For any positive x, we have We can then conclude by applying Bernstein's inequality [12] taking v = 2 5 (κ(1 + ε)) 2 |m| and c = 4 (κ(1 + ε)): While for a given i the {Z J (i)} J∈m are independent variables, in general for a given J the variables {Z J (i)} 1≤i≤d are not. We conclude the proof using lemma 3.2: Lemma 3.2. Let X 1 , . . . , X n be real random variables and let a 1 , . . . , a n ∈ R n and b 1 ,

4.1.
Proof of proposition 2.3. Before we focus on the main theorem, we prove our lower bound on the risk of a model. By definition, we have We define the compact subset K(ε) of R d as the pre-image by ∇A of the domains induced by Ω m , that is where B(x, r) denotes the closed ball centered in x with radius r of R d . Since we consider the union of a finite number of balls, homeomorphic properties of ∇A ensures that K(ε) is a compact set of R d . Since ∇ 2 A is definite positive, A is m ε -strongly convex on the compact set K(ε), and we have where m ε can be chosen as a lower bound on the smallest eigen-value of ∇ 2 A on K(ε).

Now introducing
we obtain Now if we consider M ε an upper bound of the eigen-values of ∇ 2 A on K(ε), then on Ω m f (ε) we have and therefore Combining the previous results leads to and using E[χ 2 m ] ≥ dζ|m| and Cauchy-Schwarz inequality on E[χ 2

4.2.
Proof of theorem 2.1. To prove theorem 2.1, we will control each of the three terms in equation (5) individually. To this effect, we introduce a generic partition m of M n . Let Then we can controlγ(ŝ m ) −γ(s m ) with the following proposition : We therefore havē Using Cauchy-Schwarz inequality, where χ 2 m has been defined in equation (6) and V 2 m has been defined in equation (7). Then from equation Finally, using proposition 3.1, The expectation of the second term can be bounded using the following proposition: where ν is the greatest length of the compact ν.
Proof: We recall thatγ Then Finally, we control the last term using proposition 4.3 for which a proof can be found in [2].
Applying it to u =s m yields: We then define

4.2.4.
Proof of the theorem. We can now combine the previous propositions in order to prove our main theorem. To this effect, we introduce the following event: We then apply the proof of [2] with C(ε) = E max κ(1 + ε) 2m ε , which yields: .

Characteristics of classic laws
In this section we provide some explicit values for the constants used in the main theorem for some usual distributions from the exponential family. While we do not detail all computations, in Table 1 we summarize the parameters of the distribution, the sufficient statistics associated with the natural parameters, the expression of the log-partition function and possible choices for the values of ζ and κ.
Here we detail the computations for the exponential distribution. Some details for other distributions are provided in Appendix A. The exponential distribution can be written in its natural form as We then obtain Table 1. List of most usual univariate laws from the exponential family.

Particular case of categorical variables
The goal of this section is double. We first wish to illustrate how our general approach can be used when working with a specific distribution, in this example the categorical distribution. Indeed, while we have shown that the penalty shape is the same for all distribution, the constants can be defined properly with quantities depending on the distribution of interest. We then show how the main results could be derived if working directly with the characteristics of the categorical distribution instead of dealing with the log-partition function from the exponential family. We conclude this section by comparing the constants obtained in each case.
6.1. Application of the general approach. Here we suppose that Y can take values between 1 and d + 1 and we denote {p i ; 1 ≤ i ≤ d} the probability that Y belongs to categories 1 through d (so that In the canonical form, the parameters θ are given by θ i = log The inverse mapping of the gradient of the log-partition function is given by 6.1.1. Verification of hypothesis (H). Let us assume that the probabilities of categories 1 to d are bounded away from 0 and 1. We will want to assume the same property on the normalizing category d + 1, so that there exists a > 1 such that − log a < θ i < log a for all 1 ≤ i ≤ d. Our compact set ν is therefore defined by [− log a; log a] d , and ν = 2 log a. This leads to: The Laplace transform of sufficient statistic T i is which is analytic in z provided z ≤ log 1 + 1 E i . Let R = log 2 and let z ∈ [−R; R]. Then the cumulants associated with T i can be obtained through the recurrence property c i 1 = p i and for k ≥ 2, where we have switched back to usual proportion notations for sake of readability. Let P k+1 (p) = ∂c i k ∂p with p = p i . We have P 2 (p) = 1, polynomial in p of degree 0 = k − 2 and with sum of absolute coefficients = 1 = 2 2−3 * 2!. Let us assume that for a given k > 1, P k (p) is a polynomial in p of degree k − 2 and with sum of absolute coefficients less than 2 k−3 k!. Then

and with sum of absolute coefficient less than
We therefore obtain by recurrence that P k (for k ≥ 2) is a polynomial in p of degree k − 2 and with sum of absolute coefficients less than 2 k−3 k!. From this we get: , 1/R}. Since κ i = 1 and R = log 2, we can take κ = 2. Note that considering the sufficient statistic T i independently of other categories can be interpreted as considering a surrogate random variable X with success X = 1 if Y = 1 and X = 0 otherwise. The distribution of X is then Bernoulli with parameter p = p i and the results claimed in the previous Section can be deducted from what precedes. 6.1.2. Computation of m ε and M ε . For a given vector of proportions (π 1 , . . . , π d ), the smallest eigen value of ∇ 2 A is greater than Π This can be seen by applying Sylvester's criterion to the symmetric matrix ∇ 2 A − ΠId. Considering the mapping between natural parameters and proportions, we obtain that Π can be defined as Π(η) = exp( η i ) 1 + e η i d+1 .
Therefore let z ∈ K(ε). There exists E such that ∇A(z) ∈ B(E, ε), that is there exists E such that We therefore have: and can conclude by taking m ε = (E min − ε) d+1 .
The greatest eigen-value of ∇ 2 A can be bounded by the trace of the gradient matrix, in this case the sum of the variance of each sufficient statistic. Therefore, for a distribution with proportions {q i }, one would get Therefore, on any set we have M ≤ 1 and in particular, on K(ε), we can take M ε = 1.

6.2.
Direct results for the categorical variables. This section is dedicated to the derivation of direct controls when studying the categorical distribution. As before, Y will take values in 1, . . . , d + 1 and p i will denote the probability of category i. Once again, it is possible to reduce the number of parameters to d, however keeping all d + 1 parameters leads to more tractable quantities to control and to smaller resulting constants. For sake of readability and to avoid confusions, we will denote r = d + 1.

6.2.1.
Notations and main result. The density s t of Y t can be decomposed in r terms by denoting s(t In this specific case, quantities such as the contrast or the Kullback-Leibler risk can be identified : • The model S m is: • the contrast is the log-likelihood defined by: • Associated to this contrast, the Kullback-Leibler information between s and u is:  By Cauchy-Schwarz inequality,γ where Finally, the set Ω m f (ε) defined on a minimal partition m f is defined in this context by All those quantities are defined in the same manner than in our general approach, as we can recover E i J = t∈J s (t, i) and T i J = N J (i). The main difference is that the dimension differs since r = d + 1 terms are considered in each sum. Control of K (s m ,ŝ m ) by V 2 m .
Lemma 6.2. For all positive densities p and q with respect to µ, one has if one notes f = log q p .
Applying this lemma (for which a proof can be found in [15]), we have on Ω m f (ε) Exponential bound for χ 2 m . We want to control the chi-square statistic around its expectation by using Bernstein's inequality as in Section 3. Here the required control of N J (i) can be obtained through a direct application of the bounded version of Bernstein since 1 {Y =i} is bounded by 1. We get: which, as in the general approach, can be used to obtain P Ω m f (ε) c ≤ C (Γ, ρ, a, r, ε) n a .
Then, since . and ∀t, i ρ ≤ s (t, i) ≤ 1, and r − 1 ≥ r/2 for r ≥ 2, we have the following bounds By controlling the moment of Z J (i) (as in Section 3) and using Bernstein's inequality again, we conclude to Control of K (s m ,ŝ m ) by χ 2 m . The term V 2 m can be written as follows: for all x > 0, on the set Ω m f (ε), we have: Combining this equation with relation (11) gives, on Ω m f (ε)

6.2.3.
Control of the two other terms. We explicit here the control ofγ (s m ) −γ (s) andγ(ŝ m ) −γ(s m ) which appeared in the negligible constant of the risk.
• Control of the termγ (s m ) −γ (s). We have to control and we bound this expectation by • Control of termγ(ŝ m ) −γ(s m ). We use the following proposition Proposition 6.3. For every u ∈ S and any positive x, and h 2 is the Hellinger distance.
The proof relies on the same arguments than for proposition 4.3.
6.2.4. Proof of theorem 6.1. The proof of this theorem is obtained by following the lines of Section 4.2.4: we introduce sets of large probability, Ω 1 (ξ) and Ω 2 (ξ) and gather the previous results on the control of each terms of the main decomposition. This results in the main inequality : {K (s,s m ) + pen (m)} + C (Σ, r, λ ε , ρ, Γ) .
Proof: According to inequalities (12) and (13), we get the following lower bound of the risk: Using the results of proposition 6.4 and theorem 6.1, we obtain the following oracle-type inequality: Corollary 6.5. Let pen : M n → R + be defined by (9), and assume s ≥ ρ for some ρ > 0. Then there exists some constant C such that for an exhaustive search, {K (s,ŝ m )} + C(Γ, ρ, a, r, λ ε , Σ).

6.3.
Comparison of the constants. First, we aim at comparing the bounds of the risk of our estimator between the general and direct approach (see equations (1) and (10)). In the general approach, the constant is expressed as C βε = Using the results of Section 6.1, we have E max < 1, κ = 2 and m ε = (E min − ε) d+1 , so that C 2 (ε) can be bounded by We therefore obtain: which can further be bounded by C βε = 2β In the direct approach, the constant is In both cases, β ε and λ ε are the constants from the penalty function that are tuned from the data. We can notice that the general shape of C βε and C λε are the same, except that the power of the penalty constant is directly related to the number of categories in the general case while it is fixed to 3 in the direct approach. This is compensated by, on one hand, a greater constraint on β ε since a > 1 and thus β ε > r r while λ ε > 1/2, and on the other hand by a different multiplicative factor in the penalty function, since we obtain pen(m) > β ε d|m|(1 + 4 √ L m ) 2 in the general approach and pen(m) > λ ε (d + 1)|m|(1 + 4 √ L m ) 2 in the direct approach.
Then comparing the oracle inequalities given by (4) and (15) in the general and direct cases respectively results in comparing the constants C that is: (14)) in the direct one. C βε β ε and λ ε C λε behaving as previously, one looses at least a constant 1 2ζ(E min ) d+1 between the general and the direct approaches.

Simulation study and application to DNA sequences
In this section, we apply our estimator in two scenarios. The first is a simulation study where the observations are sampled from the exponential distribution with piece-wise constant rate parameter. In this case the distribution is continuous and the sufficient statistic is unidimensional (and is the variable itself). The second is an application to a real data-set on the analysis of DNA sequences in terms of basecomposition. The objective is to apply a segmentation model in order to find homogeneous regions which can be related to structural and functional biological regions. In this case we model the observations with a categorical distribution with piece-wise constant proportion parameters. The distribution is therefore discrete and the sufficient statistic is multivariate (equal to the indicator of the variable taking a category value).
7.1. Simulation study with exponential distribution. We simulated 400 datasets of length n = 10 5 for which the number of segments K was drawn from a Poisson distribution with meanK = 50, and the K − 1 change-points were sampled uniformly on {2, . . . , n − 1} subject to the constraint that segments had to be of length at least 10. We considered 4 sets of values for the rate parameters. In all scenarios, odd segments had a rate of 0.01 while the rate on even segments was chosen randomly with probability 0.4, 0. In this framework, the segmentation was performed using the pruned dynamic programming algorithm [17,18] from which we obtained the optimal segmentation for every K up to Kmax = 200. The number of segments was then obtained using the penalty function proposed in (2) where the constant is calibrated using the slope heuristic (see [13]).
To assess the quality of our results, we introduce, as in [19], the quantities E(mŝ||m s ) and E(m s ||mŝ) between the partitions associated with the true and estimated distributions s andŝ, where The first quantity assesses how our estimated segmentation is able to recover the true change-points. Intuitively, the segmentation with the largest number of segments will have the greatest chance of yielding a small value of E(mŝ||m s ). On the contrary, the second quantity judges how relevant the proposed change-points are compared to the true partition: a segmentation with too many segments will necessarily have change-points far from the true ones. Note that the Hausdorff distance can then be recovered as sup{E(mŝ||m s ), E(m s ||mŝ)}.
The performance of our procedure is finally assessed via: • The difference between the true number of segments and the estimated, ∆ = K −K; • E(mŝ||m s ) and E(m s ||mŝ), between our estimator and the true segmentation; • The Kullback distance between true and estimated distributions, namely t To allow a fair assessment of our estimator, the Haussdorff, Kullback and Hellinger criteria are also computed for the optimal segmentation for the true number of segments, which we will denotem K , and the resulting estimated distributionŝm K .
Our method leads to a tendency to under-estimate the number of segments, as is shown in Figure 2 a). As in classical studies of model selection for segmentation, our estimator tends to under-estimate the number of segments in particular when the scenario becomes more difficult to segment. Indeed, the number of segments is reduced in order to avoid false detection. This phenomenon is illustrated in Figure  2 b) as our estimator (represented through the blue boxplots) yields high values of E(mŝ||m s ) due to the missed segments (subfigure b 1 ) but low values of E(m s ||mŝ) as the segments we propose tend to correspond to true segments (subfigure b 2 ). On the contrary, the segmentationŝm K corresponding to the true number of segments (yellow boxplots) has lower values of E(mŝ||m s ) as it has more change-points thus lower distances to the missed ones, but higher values of E(m s ||mŝ) as some of its segments are spurious. On a particular example, presented in Figure 1 by comparing the red line (true segmentation) to the blue one (corresponding to our estimator), and more so on the bottom subfigure illustrating a simulation from the fourth group, we observe that our method tends to fail to recover small segments with rate very close to surrounding ones, whereas the optimal segmentation (as represented by the yellow line) still fails to recover those small segments, thus proposing additional ones, typically very short (average length=9) with very different rate values. Finally, Figures 2 c) and d) show that in terms of Kullback-Leibler and Hellinger distances, our estimator performs at least as well than the one with the true number of segments, and significantly better when the profiles become more difficult to segment.
Evaluation of the performance criteria in the four simulation study frameworks. a) difference between the true number of segments and the estimated. Our method tends to underestimate the number of segments, with performance degradating as the segmentation difficulty increases. b 1 ) boxplot of the values of E(mŝ||m s ) and b 2 ) E(m s ||mŝ) over the hundred simulations in each framework. In each case, left boxplots (blue) assess our estimator, right boxplots (orange) assess the estimator corresponding to the true number of segments. c 1 ) and c 2 ) Boxplot of the Kullback-Leibler distance to the true distribution for the estimated distribution (blue), and the optimal distribution in K segments (orange) in each simulation framework. In c 2 ), onlyŝK is shown. d) Same as c 1 ) but for the Hellinger distance.
7.2. Application of a DNA sequence. The objective of this application is to find regions of a DNA sequence which are homogeneous in terms of base composition, that is which present a stability in the frequencies of the four nucleotide letters. These regions are thought to correspond to areas of the genome which are biologically significant. To this end, we apply our procedure modeling the data with categorical variables with d = 3 (see Section 6 for the model).
Here we consider the bacteriophage Lambda genome with length n = 48502 base pairs which is a parasite of the intestinal bacterium Escherichia coli. This genome has been used for the comparison of segmentation methods (see [20] and references therein) such as HMM ( [21], [22]) or penalized quasi-likelihood [23]. The data and its annotation are publicly available from the National Center for Biotechnology Information (NCBI) pages at the url adress http://www.ncbi.nlm.nih.gov/. From a computational point of view, the large size of the Lambda genome hampers the direct use of the classical Dynamic Programming (DP) algorithm. Here we propose a hybrid algorithm that consists in first selecting a small number of relevant change-points using the CART algorithm [24], and then using dynamic programming on this set of candidate change-points. As in the simulation study, the penalty constant is calibrated using the slope heuristic (see [13]). Four change-points (i.e. five segments) are selected by our criterion at positions 22546, 27829, 38004 and 46528. The associated regions are characterized by different base composition as shown in Figure 3 which represents the estimated probabilities of each base for the obtained segmentation.
These change-points are very close (and even on some occasion the precise same) as the one obtained in [23], which concluded to 3 more change-points. This reference also supposes bases to be independent and uses a penalized contrast procedure to perform the segmentation, and is in this sense the closest approach to ours. The segments we identify reflect changes in transcription direction. Indeed, this direction is forward up to base 20855, it then switches to reverse from base 22686 to 37940, switches back to forward between 38041 to 46427 and finally reverse again from 46459 to the end. Note that a refinement has been obtained when assuming a dependence relationship between bases (see [21] and [22]).

Conclusion
We have proposed a general approach to the selection of the number of segments in the general framework where the data can be modeled using a distribution from the exponential family. As expected, the log-partition function and its many properties are instrumental in the derivation of the bounds and the obtention of the oracle inequality.
The main drawback of our approach is that it can only be applied to distributions for which the sufficient statistics can be guaranteed to belong to a positive set. This is mainly due to the use of the chi-square statistic which was initially defined for the analysis of count (and thus positive) data. It is very likely that our result could be extended to overcome this issue by modifying decomposition 5 and controlling the fluctuation of the statistics around their expectation using different concentration inequalities. However, the main goal of this work, beside providing a general penalty function for model selection, was to underline the role of the log-partition function in our previous work. Similarly, this work should easily be extended to multivariate distributions from the exponential family. Here, using the particular case of categorical variables as a example, we have shown that the loss in tightness of the main constant is not a drastic issue by comparing the results obtained from the general approach to that from the direct one.
Moreover, we have shown in many examples through simulation and application studies (negative binomial and Poisson distributions in our previous work, exponential distribution in the simulation study and categorical distribution in the application to DNA sequences) that our approach is a powerful method to detect significant changes in the distribution of the data, which can often be related to phenomenon of interest. It outperforms existing criteria (see for instance [2]) with a behavior that is expected in classical studies of segmentation problems: it tends to under-estimate the number of segments in order to avoid false detection.