Exponential bounds for minimum contrast estimators

The paper focuses on general properties of parametric minimum contrast estimators. The quality of estimation is measured in terms of the rate function related to the contrast, thus allowing to derive exponential risk bounds invariant with respect to the detailed probabilistic structure of the model. This approach works well for small or moderate samples and covers the case of a misspecified parametric model. Another important feature of the presented bounds is that they may be used in the case when the parametric set is unbounded and non-compact. These bounds do not rely on the entropy or covering numbers and can be easily computed. The most important statistical fact resulting from the exponential bonds is a concentration inequality which claims that minimum contrast estimators concentrate with a large probability on the level set of the rate function. In typical situations, every such set is a root-n neighborhood of the parameter of interest. We also show that the obtained bounds can help for bounding the estimation risk, constructing confidence sets for the underlying parameters. Our general results are illustrated for the case of an i.i.d. sample. We also consider several popular examples including least absolute deviation estimation and the problem of estimating the location of a change point. What we obtain in these examples slightly differs from the usual asymptotic results presented in statistical literature. This difference is due to the unboundness of the parameter set and a possible model misspecification.


Introduction
One of the most fundamental ideas in statistics is to describe an unknown distribution IP of the observed data Y ∈ IR n with the help of a simple parametric family (IP θ , θ ∈ Θ) , where Θ is a subset in a finite dimensional space, say, in IR p . In this situation, the statistical model is characterized by the value of the parameter θ ∈ Θ and the statistical inference about IP is reduced to recovering θ . The standard likelihood approach suggests to estimate θ by maximizing the corresponding likelihood function. The maximum likelihood estimator can be generalized in several ways resulting in the so-called minimum contrast and M-estimators; see Huber (1967) and Huber (1981). The main idea behind this generalization is to estimate the underlying parameter θ by minimizing over Θ a contrast function −L(Y , θ) : (1.1) The negative sign in this notation comes from the main example which we have in mind when L(Y , θ) is the log-likelihood or quasi log-likelihood. A natural condition on the contrast function is that its expectation under the true measure IP θ 0 is minimized at the true parameter θ 0 , i.e.
If L(Y , θ) is log-likelihood ratio, that is, then the value −IE θ 0 L(θ, θ 0 ) coincides with the Kullback-Leibler divergence K(IP θ 0 , IP θ ) between IP θ 0 and IP θ . It is well known that K(IP θ 0 , IP θ ) is always non-negative and K(IP θ 0 , IP θ ) = 0 if and only if IP θ 0 = IP θ . If the distribution IP does not belong to the parametric family (IP θ , θ ∈ Θ) , then the target of estimation can be naturally defined as the point of minimum of −IE L(Y , θ) . We will see that this point θ 0 indeed minimizes a special distance between the underlying measure IP and the measures IP θ from the given parametric family.
The classical parametric statistical theory focuses mostly on asymptotic properties of the difference between θ and the true value θ 0 as the sample size n tends to infinity. There is a vast literature on this issue. We only mention the book Ibragimov and Khas'minskij (1981), which provides a comprehensive study of asymptotic properties of maximum likelihood and Bayesian estimators. Typical results claim that the maximum likelihood and Bayes estimators are asymptotically optimal under certain regularity conditions. Large deviation results about minimum contrast estimators can be found in Jensen and Wood (1998) and Sieders and Dzhaparidze (1987), while subtle small sample size properties of these estimators are presented in Field (1982) and Field and Ronchetti (1990).
Another stream of the literature considers minimum contrast estimators in a general i.i.d. situation, when the parameter set Θ is a subset of some functional space. We mention the papers Van de Geer (1993), Birgé and Massart (1993), Birgé and Massart (1998), Birgé (2006) and references therein. The studies mostly focused on the concentration properties of the maximum max θ L(Y , θ) rather on the properties of the estimator θ which is the point of maximum of L(Y , θ) . The established results are based on deep probabilistic facts from the empirical process theory; see e.g. van der Vaart and Wellner (1996). In this paper we also focus on the properties of the maximum of L(Y , θ) over θ ∈ Θ . However, we do not assume any particular structure of the contrast. Our basic result claims that if for every θ ∈ Θ the differences L(Y , θ) − L(Y , θ 0 ) has exponential moments, then under rather general and mild conditions, the maximum max θ {L(Y , θ) − L(Y , θ 0 )} has similar exponential moments. In what follows, to keep notation shorter, we omit the argument Y in the contrast function L(Y , θ) writing L(θ) instead of L(Y , θ) . However, one has to keep in mind that L(θ) is a random field that depends on the observed data Y . We also denote a Gaussian contrast. With M (θ, θ ′ ) = −IEL(θ, θ ′ ) , D 2 (θ, θ ′ ) = Var L(θ, θ ′ ) , the random variable L(θ, θ ′ ) is normal N −M (θ, θ ′ ), D 2 (θ, θ ′ ) . Moreover, M(µ, θ, θ 0 ) = − log IE exp µL(θ, θ 0 ) = µM (θ, θ 0 ) − µ 2 D 2 (θ, θ 0 )/2 and the values µ * (θ), M * (θ, θ 0 ) defined in (1.3)-(1.4) can be easily computed: The formula can be further simplified if L(θ) is a Gaussian log-likelihood.
Finally we consider a classical linear Gaussian regression.
Example 1.3. [Linear Gaussian model] Consider the linear model Y = Xθ 0 + σε , where Y ∈ IR n , θ ∈ IR p , X is a known n × p matrix, and ε is a white Gaussian noise in IR n , i.e. ε i are i.i.d. standard normal. Then where · n denotes the standard Euclidian norm in IR n . Obviously and thus (see Example 1.2) The log-likelihood ratio can be written as Let k denote the rank of the matrix X ⊤ X . Obviously k ≤ p and the vectors X(θ − θ 0 ) span a linear subspace X in IR n of dimension k . Denote by Π the projector in IR n on X . Then where the maximum is attained at any u ∈ IR n such that Πu = 2σΠε . It is well known that Πε 2 n follows χ 2 -distribution with k degree of freedom and However, for any positive s < 1 , it holds by the same argument that and thus An important feature of this inequality is that it only involves the effective dimension k of the parameter space and does not depend on the design X , noise level σ 2 , sample size n , etc. Later we show that such a behaviour of the log-likelihood is not restricted to Gaussian linear models and it can be proved for a quite general statistical set-up.

Main result
The examples from Section 1.1 suggest to consider in the general situation the maximum of the random field µ * (θ)L(θ, θ 0 ) + sM * (θ, θ 0 ) for s < 1 . The main result of the paper shows that under some technical conditions this maximum is indeed stochastically bounded in a rather strong sense. Namely, for some ρ ∈ (0, 1) where C(ρ, s) is a constant that can be easily controlled in typical examples. This result particularly yields that µ * ( θ)L( θ, θ 0 ) and M * ( θ, θ 0 ) have bounded exponential moments. Another corollary of this fact is that θ concentrates on the sets A(z, θ 0 ) = {θ : M * (θ, θ 0 ) ≤ z} for sufficiently large z in the sense that the probability IP θ ∈ A(z, θ 0 ) is exponentially small in z . Usually every such concentration set is a root-n vicinity of the point θ 0 . See Section 2.3 for precise formulations. Ibragimov and Khas'minskij (1981) stated a version of (1.5) for the i.i.d. case and used it to prove consistency of θ . We briefly comment on some useful features of the basic inequality (1.5). First of all this bound is non-asymptotic and may be used even if the sample size is small or moderate. It is also applicable in the situation when the parametric modeling assumption is misspecified. Our results may be used in such cases as well with the "true" parameter θ 0 defined as the maximum point of the contrast expected value: θ 0 = argmax θ IEL(θ) .
Another interesting question is about the accuracy of estimation when the parameter set Θ is not compact. The typical results in the classical parametric theory has been established for compact parametric sets since this assumption simplifies considerably the conditions and the technical tools. There exist very few results for the case of non-compact sets. See Ibragimov and Khas'minskij (1981) for an example. Our conditions are quite mild and particularly, the parameter set can be non-compact and unbounded. Moreover, we present some examples in Section 4 illustrating that the quality of the minimum contrast estimation can heavily depend on topological properties of Θ and on the behavior of the rate function M * (θ, θ 0 ) for large θ . The corresponding accuracy of estimation can be different from the classical root-n behavior.
The paper is organized as follows. The main result is presented in Section 2. Section 2.3 presents some useful corollaries of (1.5) describing concentration properties of θ , some risk bounds, confidence sets for the target parameter θ 0 based on the L( θ, θ) . Section 2.4 specifies the approach to the important case of a smooth contrast. In this situation the main conditions ensuring (1.5) are substantially simplified. Section 3 illustrates how our approach applies to the classical i.i.d. case while Section 4 presents some applications of the general exponential bound to three particular problems: estimation of the median, of the scale parameter of an exponential model and of the change point location.
Although these examples have already been studied, the proposed approach reveals some new features of the classical least squares and least absolute deviation estimators in the cases when the parametric assumption is misspecified or the parameter set is not compact. In the case of median estimation the result applies even if the observations do not have the first moment. The last example in this section considers the prominent change point problem. We particularly show that in the case when the size of the jump is completely unknown, the accuracy of estimation of its location differs from the well known parametric rate 1/n and it depends on the distance of the change point to the edge of the observation interval and involves an extra iterated-log factor.

Risk bound for the minimum contrast
This section presents a general exponential bound on the minimum contrast value in a rather general set-up. Let −L(θ), θ ∈ Θ, be a random contrast function of a finite dimensional parameter θ ∈ Θ ⊂ IR p given on some probability space (Ω, F, IP ) . We also assume that L(θ) is separable random field and IEL(θ) exists for all θ ∈ Θ . The minimum contrast estimator is defined as a minimizer of −L(θ) and the target of estimation is the value θ 0 which minimizes the expectation −IEL(θ) . It is clear that for any θ • ∈ Θ θ = argmax θ∈Θ L(θ, θ • ) and θ 0 = argmax θ∈Θ IEL(θ, θ • ).
Our study focuses on the value of maximum in θ of the random field L(θ, θ 0 ) : By definition, L( θ, θ 0 ) is a non-negative random variable.
Theorem 2.1. Assume (EG) and let Θ be a discrete set. Then for any s < 1 Usually, the function M(θ, θ 0 ) rapidly grows as θ moves away from θ 0 . This property is often sufficient to bound the sum in the right hand-side of (2.3) by a fixed constant.
Although Theorem 2.1 is a rather simple corollary of (2.1), the bound (2.3) yields a number of useful statistical corollaries. Some of them are presented in Section 2.3. However, even in discrete case, this bound may be too rough (see the example in Section 4.3). It is also clear that (2.3) is useless in the continuous case. The next section demonstrates how the bound (2.3) can be extended to the case of an arbitrary parameter set.

The general exponential bound
Here we aim to extend the exponential bound (2.3) from the discrete case to the case of an arbitrary finite dimensional parameter set. We apply the standard approach which evaluates the supremum over the whole parameter set Θ via a weighted sum of local maxima.
Define for any θ, imsart-ejs ver. 2008/08/29 file: ejs_2009_352.tex date: January 6, 2009 Usually the local properties of the centered contrast difference ζ(θ, θ ′ ) are controlled by the variance D 2 (θ, θ ′ ) = Var ζ(θ, θ ′ ) , which defines a semi-metric on Θ see, e.g. van der Vaart and Wellner (1996). However, in some cases, it is more convenient to deal with a slightly different metric which we denote by S(θ, θ ′ ) . This metric usually bounds the standard deviation D(θ, θ ′ ) from above. Sections 2.4 and 3 present some typical examples of constructing such a metric. Below in this section we assume that the metric S(·, ·) is given. Define for any point θ • ∈ Θ and a radius ǫ > 0 the ball To control the local behavior of the process L(θ) within any such ball B(ǫ, θ • ) , we impose the following local exponential condition: In fact, this condition only requires that every random increment ξ(θ, θ ′ ) has bounded exponential moment for some λ > 0 . Then Lemma 5.8 from the Appendix implies the prescribed quadratic behavior in λ for λ ≤ λ . For a fixed θ • ∈ Θ and ǫ ′ ≤ ǫ , by N(ǫ ′ , ǫ, θ • ) we denote the local covering number defined as the minimal number of balls B(ǫ ′ , ·) required to cover the ball B(ǫ, θ • ) . With this covering number we associate the local entropy We begin with a local result which bounds the maximum of the process L(θ) over a local ball B(ǫ, θ • ) .
The next theorem is the global bound which generalizes the upper bound from Theorem 2.1.
Theorem 2.3. Assume (EG) and (EL) for some λ, ν 0 , ǫ , and let π(·) be a σ -finite measure on Θ such that As in Theorem 2.1, proper growth conditions on the function M(θ, θ 0 ) ensure that the integral H ǫ (ρ, s) in (2.6) is bounded by a fixed constant.

Some corollaries
This section demonstrates how Theorems 2.1-2.3 can be used in the statistical analysis of the minimum contrast estimator θ = argmax θ∈Θ L(θ) . We show that probabilistic properties of this estimator may be easily derived from the following inequality: for prescribed ρ, s < 1 , which obviously follows from Theorem 2.3 and the definition (2.2) of Q(ρ, s) .

A risk bound for the "natural" loss
A first corollary of Theorem 2.1 presents exponential bounds separately for the minimum contrast L( θ, θ 0 ) and for the "natural" loss M( θ, θ 0 ) .

Concentration properties of the estimator θ
The assertion (2.7) can be used for establishing the concentration property of the estimator θ . Consider the sets for some r > 0 . The next result shows that the estimator θ leaves the set A(r, θ 0 ) with the exponentially small probability of order exp(−ρsr) .
Corollary 2.5. For any ρ, s < 1 , it holds and the assertion follows.
In typical situations, M(θ, θ 0 ) is proportional to the sample size n and each set A(r, θ 0 ) corresponds to a root-n neighborhood of the point θ 0 . See the Section 3 for applications related to the i.i.d. case.

Confidence sets based on L( θ, θ)
Next we discuss how the exponential bound (2.7) can be used for constructing the confidence sets for the target θ 0 based on the optimized contrast L( θ, θ) . The inequality (2.8) claims that L( θ, θ 0 ) is stochastically bounded. This justifies the following construction of confidence sets: To evaluate the covering probability, consider first the case when µ(θ) ≥ µ * > 0 uniformly in θ ∈ Θ . The next result claims that E(z) does not cover the true value θ 0 with a probability which decreases exponentially with z .
Corollary 2.6. Assume that µ(θ) ≥ µ * > 0 . Then for any z > 0 and any ρ < 1 Proof. The bound (2.8) implies imsart-ejs ver. 2008/08/29 file: ejs_2009_352.tex date: January 6, 2009 In the case when the function µ(θ) cannot be uniformly bounded from below by a positive constant, we assume that such a bound exists for every set A(r, θ 0 ) . Denote Then and combining Corollaries 2.5-2.6 yields Corollary 2.7. For any z > 0 and any ρ, s < 1 and any r > 0 A reasonable choice of r in this bound is given by the balance relation µ * (r)z = sr . With this choice the bound of Corollary 2.6 may by replaced by

Exponential bounds for smooth contrasts
This section deals with the case when the contrast L(θ) is a smooth function of θ . In this situation, the local condition (EL) is easy to verify. Moreover, the local balls B(ǫ, θ) nearly coincide with usual Euclidean ellipsoids and the local entropy can be easily bounded by an absolute constant only depending on the dimensionality p of the parameter space Θ . Suppose Θ is a convex set in IR p and the function L(θ) along with the scaling factor µ(θ) are differentiable w.r.t. θ . Below, the symbol ∇ stands for the gradient w.r.t. θ . Define for every unit vector γ ∈ IR p . To simplify the presentation, here and in what follows we assume that every matrix V (θ) is non-degenerated. It is easy to see that H(0, γ, θ) = 0 , ∂H(0, γ, θ)/∂λ = 0 , and Therefore for small λ H(λ, γ, θ) ≈ 2λ 2 . Below we assume that this property is fulfilled uniformly in θ ∈ Θ and in γ over the unit sphere S p in IR p .
(ED) There exists λ > 0 such that for some (2.10) Now we define the metric S(θ, θ ′ ) by Define also for every θ • ∈ Θ and ǫ > 0 the ellipsoid In what follows, we assume that the radius ǫ can be chosen in such a way that the functions V (θ) and M(θ, θ 0 ) have bounded fluctuations within the ball B ′ (ǫ, θ • ) for every θ • ∈ Θ . More precisely, for a given function f (·) define its magnitude over B ′ (ǫ, θ • ) by Similarly, the magnitude of the matrix V (θ) over B ′ (ǫ, θ • ) is computed as follows Notice that under the condition A ǫ V (·) ≤ ν 1 , the topology induced by the metric S(·, ·) is (locally) equivalent to the Euclidean topology and the set B(ǫ, θ • ) can be well approximated by the ellipsoid B ′ (ǫ, θ • ) and computing the local entropy Q(ǫ, ·) can be reduced to the Euclidean case; see Lemma 5.4 for more detail. Now we are ready to state an exponential bound for the contrast process in the smooth case.
Theorem 2.8. Assume that (EG) and (ED) hold true with some ν 0 and λ > 0 . Suppose that there is a constant ǫ > 0 such that ǫρ/(1 − ρ) ≤ λ and for a fixed ν 1 ≥ 1 and each θ ∈ Θ , it holds (2.12) where ω p is the Lebesgue measure of the unit ball in IR p . Then it holds Remark 2.1. The conditions of this theorem are very mild. (EG) only requires that L(θ, θ 0 ) has exponential moments. (ED) requires a similar condition for the centered and normalized gradient ∇L(θ) . The inequalities (2.12) are equivalent to uniform continuity of the function V (θ) .
Remark 2.2. The presented exponential bound requires that the value H ǫ (ρ, s) is finite. Fortunately it can be easily checked in typical situations. A typical example is given in Section 3 which deals with the i.i.d. case.

A risk bound for θ − θ 0
Our main result controls the risk of the minimum contrast estimator in terms of the rate function M(θ, θ 0 ) . In the case of the smooth contrast, this result may be used to bound the classical estimation loss θ − θ 0 . The idea is to bound from the rate function M(θ, θ 0 ) by a quadratic function in a vicinity of the point θ 0 and next to make use of the concentration property of θ . Note that for any µ , it obviously holds M(µ, θ 0 , θ 0 ) = 0 and a simple algebra yields for the gradient of M(µ, θ 0 , θ 0 ) So, M(µ, θ 0 , θ 0 ) can be majorated from below and from above in a vicinity of θ 0 by the Taylor expansion of the second order. The same behavior can be expected for the optimized rate function M(θ 0 , θ 0 ) . This argument and the concentration property from Corollary 2.5 lead to the following result: Corollary 2.9. Suppose the conditions of Theorem 2.8 are satisfied and also for some r > 0 , the function M(θ, θ 0 ) fulfills for some positive matrix V 0 . Then for any ρ, s < 1 and z > 0 Proof. It is obvious that and the result follows from Corollary 2.7.
In the case of i.i.d. observations, the function M(µ, θ, θ 0 ) and hence the matrix V 0 are proportional to the sample size n and the result of Corollary 2.9 automatically yields the root-n consistency of θ ; see Section 3 for more details.

Quasi MLE for i.i.d. data
Let Y = (Y 1 , . . . , Y n ) be an i.i.d. sample from a distribution P . By IP we denote the joint distribution of Y . Let also P = (P θ , θ ∈ Θ ⊂ IR p ) be a parametric family. In contrast to the standard parametric hypothesis which assumes that P ∈ P , in this section, we focus on the quality of estimation in the case when the underlying measure P does not necessarily belong to the parametric family P . We will see that in this case the maximum likelihood method estimates the point θ 0 , which minimizes some special distance between P and P θ over θ ∈ Θ .
The integral in H ǫ (ρ, s) can be easily bounded in typical situations. The result presented below involves some conditions on the marginal rate function m(θ, θ 0 ) . Namely, it is assumed that this function is bounded from below by a quadratic polynom in a vicinity A 1 (r, θ 0 ) def = {θ : m(θ, θ 0 ) ≤ r} of the point θ 0 for some fixed r > 0 and it increases at least logarithmically with the norm θ − θ 0 outside of this neighborhood.
In particularly, it is shown in Section 5 that for n sufficiently large Let for some β > 0 , hold: Finally, let n be sufficiently large to ensure Then for some C depending on a r , ν 0 , ν 1 , C r (β) only, it holds This bound together with Corollary 2.9 yields This result means root-n consistency of θ in a rather strong sense.

Examples
This section illustrates how the exponential bounds can be applied to some particular situations. To simplify technical details, we do not try to cover the most general case. Rather we aim to show that our basic conditions can be easily verified in typical situations.

Estimation in the exponential model
The exponential model assumes that the observations Y = (Y 1 , . . . , Y n ) are i.i.d. exponential random variables from the exponential law P θ with an unknown parameter θ ∈ IR + : P θ (Y i > y) = exp(−θy) . In this example we focus on the classical parametric set-up assuming that the underlying measure IP coincides with the product of IP θ0 for some θ 0 ∈ IR + . The corresponding maximum likelihood contrast is given by imsart-ejs ver. 2008/08/29 file: ejs_2009_352.tex date: January 6, 2009 where K(θ, θ ′ ) = θ ′ /θ −1−log(θ ′ /θ) is the Kullback-Leibler divergence between the exponential laws P θ and P θ ′ . Define .

LAD contrast and median estimation
Median or more generally quantile estimation is known to be more robust and stable against outliers and it is frequently used in econometric studies; see Koenker (2005), Koenker and Xiao (2006). Suppose we are given a sample Y = (Y 1 , . . . , Y n ) . In the problem of median estimation, these random variables are assumed i.i.d. and we are interested in estimating the median θ 0 which is a root of the equation Alternatively, the median minimizes the value E|Y 1 − θ| provided that the expectation of |Y 1 | is finite. This remark leads to the natural estimator θ of the median as the minimizer of the contrast −L(θ) = n i=1 |Y i − θ| : If the Y i 's are i.i.d. with the Laplace density exp −|y − θ 0 | /2 , then L(θ) coincides (up to a constant factor) with the log-likelihood. In the general case, L(θ) can be treated as a quasi log-likelihood contrast. Later we also briefly comment on the case when the Y i 's are not i.i.d.
The case of independent but non i.i.d. observations can be again reduced to the considered case using P = n −1 i=1 P i and defining the point θ 0 as a root of the equation

Estimation of the location of a change point
Suppose the observations Y = (Y 1 , . . . , Y n ) follow the change point model: where ξ i is a standard white Gaussian noise. Our goal is to estimate the change point location θ ∈ Θ = {1, . . . , n − 1} . The obtained results can be easily extended to the case of non-Gaussian errors under some exponential moment conditions.
We begin with the case when the amplitude A is known. To estimate θ , we use the maximum likelihood estimator where the maximum likelihood contrast is given by Note that L A (θ) is a Gaussian random variable for every θ with This yields for any µ ≥ 0 and the corresponding values µ * (θ), M * (θ, θ 0 ) can be easily computed: Therefore, for ρ < 1 , Theorem 2.1 implies where C(ρ) = exp{−ρ(1 − ρ)A 2 /(8σ 2 )} . By Lemma 5.7 IE| θ A − θ 0 | r ≤ C 1 (r) σ 2 /A 2 r with some constant C 1 (r) . Now we switch to the case when A > 0 is an unknown parameter. In this case, we cannot use the contrast L A (θ) because it strongly depends on A . To find a reasonable contrast, one can use the maximum likelihood principle. Considering A as a nuisance parameter and maximizing L A (θ) w.r.t. A ≥ 0 leads to the following estimator: imsart-ejs ver. 2008/08/29 file: ejs_2009_352.tex date: January 6, 2009 where [x] + = max (x, 0) . In what follows we deal with a slightly modified version of this estimator which is again a Gaussian one. By the model equation (4.3), this contrast can be represented in the form: It is easy to see that the drift M (θ, θ 0 ) = −IEL(θ, θ 0 ) satisfies Similarly, and obviously, M (θ, θ 0 ) = aD 2 (θ, θ 0 )/2 . Also D 2 (θ, θ 0 ) ≤ 2 for all θ . As L(θ) is a Gaussian contrast, it holds see Example 1.1. Note that for every θ ∈ Θ , the value M * (θ, θ 0 ) is bounded by a 2 /8 = A 2 θ 0 /(8σ 2 ) . So, this example is quite special in the sense that the Kullback-Leibler divergence between measures IP θ0 and IP θ does not grow to infinity with θ . We will see that this fact results in an extra loglog-factor in the bound for the minimum contrast. For given ǫ > 0 and θ • ∈ Θ , the local ball B(ǫ, θ • ) = {D(θ, θ • ) ≤ ǫ} can be represented in the form and it can be transformed into the usual symmetric interval around log θ • by using the parameter log θ instead of θ : This immediately implies that the local entropy Q(ǫ, θ • ) is bounded by Q = 1 for all θ • ∈ Θ .
Let the measure π(·) assign the mass 1 to any point θ = 1, . . . , n . Then π B(ǫ, θ • ) is equal to the number Π ǫ (θ) of points θ in B(ǫ, θ • ) , and it obviously holds Π ǫ (θ) ≈ K(ǫ)θ with K(ǫ) = (1 − ǫ 2 /2) −2 − (1 − ǫ 2 /2) 2 ≥ ǫ 2 for ǫ ≤ 1 , so that (2.4) is fulfilled. Fix ǫ 2 = 1/2 . The trivial lower bound M(θ, θ 0 ) ≥ 0 yields for H ǫ (ρ, s) from (2.5) for any s ≤ 1 : ≤ log C 1 log n for some C 1 > 0 . This yields by Theorem 2.3 and its Corollary 2.4 that IE exp ρa 2 d( θ, θ 0 )/8 ≤ C 2 log n. (4.4) Combining this with Lemma 5.7 yields The extra log log -factor in this bound is due to the unbounded parameter set. In the case "classical" situation when the size A of the jump is bounded away from zero and infinity and the true "relative" location θ 0 /n is bounded away from the edge 0 similar calculations (not presented here) lead to a bound IE exp C 1 ρ 2 A 2 | θ − θ 0 | ≤ C 2 which does not involve any extra log-term; see e.g. Csorgő and Horváth (1997) and references therein for asymptotic versions of this result. It is also interesting to compare this result with the accuracy of the maximum likelihood method in the case, where the magnitude of jump A is known. One can see that there is a payment for the adaptation to the nuisance parameter A which is in form of an extra log log -factor. Another observation is that the accuracy of estimation strongly depends on the true location θ 0 , more precisely, on the value a 2 = A 2 θ 0 /σ 2 . In the "classical" situation this value is of order n leading to the accuracy of order n −1 log log(n) . If the value a 2 is smaller in order than n , then the accuracy becomes worse by the same factor. In particular, if A 2 θ 0 /σ 2 is of order one, then even consistency of θ cannot be claimed.

Proofs
This section collects proofs of the main theorems and some auxiliary facts.

Proof of Theorem 2.3
Theorem 2.2 implies a local bound for the process µ(θ)L(θ, θ 0 )+ M(θ, θ 0 ) over any ball B(ǫ, θ • ) . To derive a global bound we apply the following general fact: Lemma 5.2. Let f (θ) be a nonnegative function on Θ ⊂ IR p and let for every point θ ∈ Θ a vicinity U (θ) be fixed such that θ ′ ∈ U (θ) implies θ ∈ U (θ ′ ) . Let also a measure π U (θ) of the set U (θ) fulfill for every We are going to apply Lemma 5.2 with In view of the definition of M ǫ (θ • , θ 0 ) = min θ∈B(ǫ,θ • ) M(θ, θ 0 ) it follows from the local bound of Theorem 5.1 that Below by C p we denote a generic constant (not necessarily the same) which only depends on the dimensionality p . First we show that the differentiability condition (ED) implies the local moment condition (EL) .
• sup θ∈Θ Q(ǫ, θ) ≤ Q p + p log(ν 1 ), where Q p is the entropy of the unit ball in IR p in the Euclidean topology.
Now we are ready to proceed with the proof of Theorem 2.8. We make use of the following technical result which helps to bound the global supremum of a random function over an integral of local maxima.

Proof of Theorem 3.2
We start with some technical lemmas.

Auxiliary facts
Lemma 5.6. For any r.v.'s ξ k and any nonnegative λ k such that Λ = k λ k ≤ 1 log IE exp k λ k ξ k ≤ k λ k log IEe ξ k .