Auxiliary information : the raking-ratio empirical process

We study the empirical measure associated to a sample of size $n$ and modified by $N$ iterations of the raking-ratio method. This empirical measure is adjusted to match the true probability of sets in a finite partition which changes each step. We establish asymptotic properties of the raking-ratio empirical process indexed by functions as $n\rightarrow +\infty$, for $N$ fixed. We study nonasymptotic properties by using a Gaussian approximation which yields uniform Berry-Esseen type bounds depending on $n, N$ and provides estimates of the uniform quadratic risk reduction. A closed-form expression of the limiting covariance matrices is derived as $N\rightarrow +\infty$. In the two-way contingency table case the limiting process has a simple explicit formula.


The raking-ratio method
In survey analysis, statistics, economics and computer sciences the raking-ratio iterative procedure aims to exploit the knowledge of one or several marginals of a discrete multivariate distribution to fit the data after sampling. Despite many papers from the methodological and algorithmic viewpoint, and chapters in classical textbooks for statisticians, economists or engineers, no probabilistic study is available to take into account that the entries of the algorithm are random and the initial discrete measure is empirical. We intend to fill this gap. Let us first describe the algorithm, usually considered with deterministic entries, then recall the few known results and state the open question to be addressed.
The raking-ratio algorithm. A sample is drawn from a population P for which k 2 marginal finite discrete distributions are explicitly known. Initially, each data point has a weight 1/n. The ratio step of the algorithm consists in computing new weights in such a way that the modified empirical joint distribution has the currently desired marginal. The raking step consists in iterating the correction according to another known marginal law, changing again all the weights. The k margin constraints are usually treated in a periodic order, only one being fulfilled at the same time. The raking-ratio method stops after N iterations with the implicit hope that the previous constraints are still almost satisfied. See Section .1 for an elementary numerical example with k = 2 and Section 1.4 for notation and mathematical definition of the algorithm.
The limit. This algorithm was called iterative proportions by Deming and Stephan [10] who first introduced it. They showed that the k margins converge to the desired ones as N → +∞. They even claimed that if the frequencies of a multiway contingency table are raked periodically as N → +∞ they converge to the frequencies minimizing the chi-square distance to the initial frequencies, under the k margin constraints. Two years later, Stephan [25] observed that it is wrong and modified the algorithm accordingly to achieve the chi-square distance minimization. Lewis [18] and Brown [7] studied the case of Bernoulli marginals from the Shannon entropy minimization viewpoint. When k = 2 a two-way contingency table can be viewed as a matrix. Sinkhorn [22,23] proved that a unique doubly stochastic matrix can be obtained from each positive square matrix by alternately normalizing its rows and its columns, which shows that the algorithm converges in this special case. Finally Ireland and Kullback [16] generalized previous arguments to rigorously justify that the raking-ratio converges to the unique projection of the empirical measure in Shannon-Kullback-Leibler relative entropy on the set of discrete distributions satisfying the k margin constraints. From a numerical viewpoint, the rate of convergence of the algorithm is geometric, see Franklin and Lorentz [15].
Remark A. When minimizing contrasts such as discrimination information, chisquare distance or likelihood, the minimizers are not explicit due to the nonlinearity of sums of ratios showing up in derivatives. This is why converging algorithms are used in practice. In the case of the iterative proportions algorithm each step is easily and fastly computed. What has been studied concerns the convergence when the iterations N → +∞, with n fixed and initial empirical frequencies treated as deterministic entries. When the sample size n → +∞, these entries are close to P itself, which satisfies the marginal constraints, hence one expects that the number N of iterations necessary to converge is small. We shall study the N 0 first iterations in the statistical setting n → +∞.
Non explicit bias and variance. The initial values being empirical frequencies the converged solution of the algorithm as N → +∞ is a joint distributions fulfilling the marginal requirements that still deviates from the true population distribution P , and moreover in a rather complicated way. The modified empirical distribution satisfying only the marginal constraint of the current iteration, there is a permanent bias with respect to other margins, and hence with P . The exact covariance matrix and bias vector of the random weights after N iterations are tedious to compute. For instance, estimates for the variance of cell probabilities in the case of a two-way contingency table are given by Brackstone and Rao [6] for N 4, Konijn [17] or Choudhry and Lee [8] for N = 2. Bankier [2] proposed a recursive linearization technique providing an estimator of the asymptotic variance of weights. In Binder and Théberge [4] the variance of the converged solution requires to calculate weights at each iteration.
Open question. Since exact computations lead to intractable formulas for the bias and variance of frequencies and statistics as simple as means, an important open problem is to identify leading terms when n is large compared to N . We derive comprehensive explicit formulas as n → +∞ for N N 0 and N 0 fixed, then for N → +∞. In order to further analyze the raking ratio method it is moreover desirable to control simultaneously large classes of statistics and hence to work at the empirical measure level rather than with the empirical weights or a single statistic only. This is the main motivation for the forthcoming general study of empirical measures indexed by functions and modified through auxiliary information given by partitions.

Statistical motivation
Representative sample. In a survey analysis context, the raking-ratio method modifies weights of a contingency table built from a sample of size n in order to fit exactly given marginals. Such a strict margin correction is justified when a few properties of the finite population under study are known, like the size of sub-populations. The modified sample frequencies then reflect the marginal structure of the whole population. If the population is large or infinite the information may come from previous and independent statistical inference, from structural properties of the model or from various experts. Remark B. Making the sample representative of the population is an ad hoc approach based on common sense. The mathematical impact is twofold. On the one hand all statistics are affected by the new weights in terms of bias, variance and limit law so that practitioners may very well be using estimators, tests or confident bands that have lost their usual properties. On the other hand, replacing marginal frequencies with the true ones may smooth sample fluctuations of statistics correlated to them while leaving the uncorrelated ones rather unaffected. These statements will be quantified precisely at Section 2.2. Remark C. Fitting after sampling is a natural method that has been re-invented many times in various fields, and was probably used long time ago. Depending on the setting it may be viewed as stratification, calibration, iterating proportional fitting, matrix scaling and could be used to deal with missing data. Many fitting methods may be reduced to a raking-ratio type algorithm. We initially called it auxiliary information of partitions as we re-invented it as a special case of the nonparametric partial information problem stated in Section 1.3.
Remark D. An asymptotic approach is no more relevant in survey analysis when the underlying population is rather small. In the small population case, the way the sample is drawn has a so deep impact that it may even become the main topic. A study of calibration methods for finite population can be found in Deville and Särndal [11,12]. This is beyond the scope of our work.
Quadratic risk reduction. Modifying marginals frequencies of a sample may induce serious drawbacks. One should ask whether or not the estimation risk can be controlled. Typically, a statistic has more bias when sample weights are changed by using raking, calibration or stratification methods after sampling. In the spirit of Remark B, a variance reduction is expected if the statistic of interest is strongly correlated to the k known discrete marginal variables. Now, evaluating the quadratic risk of a specific statistic requires tedious expansions for the bias, variance and correlations of weights, whence the very small N studied in the literature. Likewise, no global risk reduction property has been established as n → +∞ and no multivariate or uniform central limit theorem. These results are established at Propositions 4 to 9.
Contributions. In this paper we consider classes of empirical means raked N times, sampled jointly from any population. We derive closed-form expressions of their Gaussian limits and their limiting covariances as n → +∞ then N → +∞. We also quantify the uniform risk reduction phenomenon and provide sharp statistical estimation tools such as uniform Berry-Esseen type bounds. In particular, a Donsker invariance principle for the raked empirical process provides joint limiting laws for additive statistics built from empirical means, and this can be extended to non linear estimators by applying the delta method, argmax theorems or plug-in approaches as in the classical setting -see [27,28].
Organization of the paper. In Section 1.3 we relate the raking-ratio problem to nonparametric auxiliary information. The raking-ratio empirical process α (N ) n (F ) is defined in Section 1.4. Usual assumptions on an indexing class F of functions are given in Section 2.1. In Section 2.2 we state our results for α (N ) n (F ) when the number N of iterations is fixed. Our main theorem is a nonasymptotic strong approximation bound which yields the uniform central limit theorem with rate as n → +∞, as well as an uniform control of the bias and the covariances for fixed n. The approximating Gaussian process is studied in Section 2.3, which establishes the uniform risk reduction phenomenon provided the iterations are stopped properly. In Section 2.4, in the two partitions case we characterize explicitly the limiting process as N → +∞. All statements are proved in Sections 3 and 4. The Appendix provides a few examples.

An auxiliary information viewpoint
Let X 1 , ..., X n be independent random variables with unknown law P on some measurable space (X , A). Assumptions like separability or Haussdorf property are not necessary for this space. Let δ x denote the Dirac mass at x ∈ X and consider the empirical measure P n = n −1 n i=1 δ Xi on A. Auxiliary information. Our interest for the raking-ratio method came while investigating how to exploit various kinds of partial information on P to make P n closer to P . The auxiliary information paradigm is as follows. Usually what is assumed on P is formulated in terms of technical or regularity requirements. Sometimes it is relevant to assume that P satisfies simple properties that could be tested or estimated separately. Consider the following two extreme situations. First, a parametric model provides a tremendous amount of information by specifying P = P θ up to a finite dimensional parameter θ, so that P n can be replaced with the most likely P θn(X1,...,Xn) among the model. Notice that P n is used to minimize the empirical likelihood, but the resulting P θn(X1,...,Xn) is of a very different nature, far from the initial and discrete P n thanks to the valuable parametric information. On the opposite, in a purely nonparametric setting the information mainly comes from the sample itself, so that only slight modifications of P n are induced by weak hypotheses on P -like support, regularity, symmetry, logconcavity, Bayesian model, semi-parametric model, etc. In between, we would like to formalize a notion of a priori auxiliary information on P based on partial but concrete clues to be combined with the knowledge of P n . Such clues may come from experts, models, former inference, statistical learning or distributed data. A generic situation one can start with is when the probabilities P (A j ) of a finite number of sets A j ∈ A are known -which in a parametric setting already determines θ then P .
Information from one partition. If the A j form a finite partition of X then the auxiliary information coincides with one discrete marginal distribution and a natural nonparametric redesign P (1) n of P n is the following. Let A According to Proposition 1 below, the random measure satisfies the auxiliary information P (1) j ), for 1 j m 1 , and is the relative entropy projection of P n on these m 1 constraints. The random ratios in (1.1) induce a bias between P (1) n and P . We prove that the bias of α n − P ) vanishes uniformly and that the limiting Gaussian process of α (1) n has a smaller variance than the P -Brownian bridge. Extension to N partitions. If some among the sets A j are overlapping then the information comes from several marginal partitions. It is not obvious how to optimally combine these sources of information since there is no explicit modification of P n matching simultaneously several finite discrete marginals. In other words there is no closed form expression of the relative entropy projection of P n on several margin constraints. An alternative consists in recursively updating the current modification P (N −1) n of P n onto P mN )) exactly as in (1.1) for P (1) n from P (0) n = P n and P (A (1) ). This coincides with the Deming and Stephan's iterative procedure, that is the raking-ratio algorithm, as formalized in Section 1.4.

Information from N finite partitions
and , n allocates the random weight P (A Let define recursively, for N ∈ N * , the N -th raking-ratio empirical measure 3) and the N -th raking-ratio empirical process A few more formulas concerning α which almost surely holds for all n large enough and N fixed, by (1.2) and the law of large numbers. Given two probability measures Q n and Q supported by {X 1 , ..., X n } we define the relative entropy of Q n and Q -see for instance [9] to be As a consequence, the formula (1.3) means that the N -th iteration P (N ) n is the Shannon-Kullback-Leibler projection of P (N −1) n under the constraint P (A (N ) ). Therefore the raking-ratio method is an iterated maximum likelihood procedure.
A mixture of conditional empirical processes. By introducing, for A ∈ A such that P (A) > 0 and P (N ) n (A) > 0, the conditional expectations we see that (1.3) further reads Therefore (1.4) can also be formulated into is the conditional empirical process of P This unavoidable bias is induced by (1.5) to globally compensate for the local cancellation of the variance of P n (f )) for many more functions f . Our results show that, uniformly over a large class of functions, the bias vanishes asymptotically and the variance decreases, as well as the quadratic risk, thus E((P   Let M denote the set of measurable real valued functions on (X , A). Consider a class F ⊂ M such that sup f ∈F |f | M < +∞ and satisfying the pointwise measurability condition often used to avoid measurability problems. Namely, lim m→+∞ f m (x) = f (x) for all x ∈ X and f ∈ F where {f m } ⊂ F * depends on f and F * ⊂ F is countable. With no loss of generality also assume that In addition F is assumed to have either a small uniform entropy, like Vapnik-Chervonenkis classes -for short, VC-classes -or a small P -bracketing entropy, like many classes of smooth functions. These entropy conditions are defined below and named (VC) and (BR) respectively. For a probability measure Q on (X , A) and f, g ∈ M define d 2 be the minimum number of balls having d Q -radius ε needed to cover F . Let N [ ] (F , ε, d P ) be the least number of ε-brackets necessary to cover F , of the form c 0 /ε ν0 where the supremum is taken over all discrete probability measures Q on (X , A).
If one modifies a class F satisfying (VC) or (BR) by adding functions necessary to also satisfy the condition (2.1) then (VC) or (BR) still holds with a new constant c 0 or b 0 respectively. Many properties and examples of VC-classes or classes satisfying (BR) can be found in Pollard [19], Van der Vaart and Wellner [27] or Dudley [14]. Uniform boundedness is the less crucial assumption and could be replaced by a moment condition allowing some truncation arguments, however adding technicalities. Let ℓ ∞ (F ) denote the set of real-valued functions bounded on F , endowed with the supremum norm · F . The raking-ratio empirical process α Remind (1.7). Let us introduce a new centered Gaussian process G (N ) (F ) indexed by F that we call the N -th raking-ratio P -Brownian bridge and that is defined recursively, for any N ∈ N * and f ∈ F , by 3) The distribution of G (N ) is given in Proposition 7. Lastly, the following notation will be useful,

General properties
We now state asymptotic and nonasymptotic properties that always hold after raking N 0 times. The i.i.d. sequence {X n } is defined on a probability space (Ω, T , P) so that P implicitly leads all convergences when n → +∞ and (X , A) is endowed with P = P X1 . For all N N 0 the information P (A (N ) ) satisfies (1.2). Most of the subsequent constants can be bounded by using only N 0 and N0 is large, and possibly largely suboptimal, except for N 0 = 0 where κ 0 = 1 coincides with the classical law of the iterated logarithm -from which the proposition follows.
The next result shows that the nonasymptotic deviation probability for α can be controlled by the deviation probability of α (0) n F which in turn can be bounded by using Talagrand [26], van der Vaart and Wellner [27] or more recent bounds from empirical processes theory. However, since the partition changes at each step the constants are penalized by factors similar to κ N0 above, involving (2.6) Proposition 3. If F is pointwise measurable, bounded by M then for any N 0 ∈ N, any n ∈ N * and any λ > 0 we have Under (BR) it holds, for n > n 0 and λ 0 < λ < D 0 √ n, where the positive constants D 0 , D 1 , D 2 , n 0 , λ 0 are defined at (3.9). Under (VC) it holds, for n > n 0 and λ 0 < λ < 2M √ n, where the positive constants D 3 , D 4 , n 0 , λ 0 are defined at (3.10).
Remark F. Clearly, to avoid drawbacks N 0 should be fixed as n increases, and F limited to the bare necessities for the actual statistical problem. In this case, Proposition 3 shows that α (N0) n F is of order C √ log n with probability less than 1/n 2 and C > 0. Concentration of measure type probability bounds for α involving unbounded random coefficients.
If F satisfies (BR) then write v n = 1/(log n) γ . In both cases, one can define on the same probability space (Ω, T , P) a sequence {X n } of independent random variables with law P and a sequence {G n } of versions of G satisfying the following property. For any N 0 0 there exists n 0 ∈ N and d 0 > 0 such that we have, for all n n 0 , then exploiting the properties induced by (2.3) as in Section 2.3. For instance the finite dimensional laws of G (N ) are computed explicitly at Proposition 7. For nonasymptotic applications, given a class F of interest it is possible to compute crude bounds for n 0 and d 0 since most constants are left explicit in our proofs as well as in [3]. Indeed d 0 depends on p N0 from (2.5), on P N0 , M N0 , S N0 from (2.6), on ν 0 , c 0 , r 0 , b 0 from (VC) or (BR), on N 0 , M, θ 0 and on some universal constants from the literature. Clearly, Theorem 2.1 implies that the speed of weak convergence of α (3.11) and Section 11.3 of [13] for a definition of this metric. More deeply, from Theorem 2.1 we derive the following rates of uniform convergence for the bias and the variance.
where v n → 0 and d 0 are the same as in Theorem 2.1, and By Proposition 5, the bias process E(α n (f )) − P f ) vanishes at the uniform rate 1/ √ n. The covariance of G (N ) is computed in Section 2.2 and the quadratic risk is estimated at Remark I. A second consequence of Theorem 2.1 is uniform Berry-Esseen type bounds. Let Φ denote the distribution function of the centered standardized normal law. Proposition 6. Assume that F satisfies (VC) or (BR), fix N 0 ∈ N and let d 0 > 0, v n → 0 be defined as in Theorem 2.1. If F 0 ⊂ F is such that then for any d 1 > d 0 there exists n 1 ∈ N such that for all n n 1 , Let L be a collection of real valued Lipschitz functions ϕ defined on ℓ ∞ (F ) with Lipschitz constant bounded by C 1 < +∞ and such that ϕ(G (N ) ) has a density bounded by C 2 < +∞ for all 0 N N 0 . Then for all ϕ ∈ L, n n 1 , Remark H. The formula (2.7) is a special case of the second one (2.8) and reads The functions f ∈ F overdetermined by the knowledge of P (A (N ) ) have a small V(G (N ) (f )) and are excluded from F 0 . Proposition 6 is especially useful under (VC) since v n is then polynomialy decreasing, thus allowing larger C 1 C 2 and L.
An example is given in Section .3. Whenever the class F is finite, the density of the transform ϕ(G(F )) of the finite dimensional Gaussian vector G(F ) is easily computed. The conditions for (2.8) of Proposition 6 are fulfilled if, for example, all random variables ϕ(G(F )) can be controlled by discretizing the small entropy class F , by bounding their densities then by taking limits accordingly.

Limiting variance and risk reduction
In this section we study the covariance structure of G (N ) (F ) from (2.3), for N fixed. The following matrix notation is introduced to shorten formulas. The brackets [·] refer to column vectors built from the partition and, for l k N define the matrix P A (k) |A (l) to be Let · denote a product between a square matrix and a vector. Finally define An explicit expression for G (N ) is given in Lemma 3 and the closed form for the covariance function of G (N ) (F ) is as follows.
Proposition 7. For all N ∈ N the process G (N ) (F ) is Gaussian, centered and linear with covariance function defined to be, for Proposition 7 implies the following variance reduction phenomenon.
The asymptotic risk reduction after raking is quantified by combining Propositions 5 and 8. Given ε 0 > 0 and 0 < σ 0 < σ F there exists some n 0 = n 0 (ε 0 , F ) such that if n > n 0 then any f ∈ F with initial quadratic risk σ 2 f /n > σ 0 /n has a new risk, after raking N times, equal to where v n → 0 and d 0 are as in Theorem 2.1 and so that the risk is reduced whenever ∆(f ) < 1 and n is large enough.
When N 1 > N 0 > 0 it is not automatically true that the covariance structure of α (N1) n (F ) decreases compared to that of α (N0) n (F ). According to the next statement, a simple sufficient condition is to rake two times along the same cycle of partitions.
Remark J. In Appendix .2 a counter-example with N 1 = N 0 + 1 shows that the variance does not decrease for all functions at each iteration. This case is excluded from Proposition 9 since N 1 = N 0 + 1 < 2N 0 if N 0 > 1 and, whenever N 0 = 1 and N 1 = 2 the requirement A (N0) = A (N1) is not allowed.

The case of two marginals
We now consider the original method where k partitions are raked in a periodic order. Let us focus on the case k = 2 of the two-way contingency table. The Deming and Stephan algorithm coincides with the Sinkhorn-Knopp algorithm for matrix scaling [24].
The matrix P A|B P B|A is m 1 × m 1 and P B|A P A|B is m 2 × m 2 . A sum with a negative upper index is null, a matrix with a negative power is also null, and a square matrix with power zero is the identity matrix. For N ∈ N * define Proposition 10. Let m ∈ N. We have Hypothesis (ER). The matrices P A|B P B|A and P B|A P A|B are ergodic.
Remark L. Notice that (ER) holds whenever the matrices have strictly positive coefficients. This is true for with positive probability. The latter requirement is for instance met if X = R d , P has a positive density and the partitions concern two distinct coordinates. l,odd (f ) for l = 1, 2 converge uniformly on F to S l,even (f ) and S l,odd (f ) satisfying where P l [f ] = (P (f ), . . . , P (f )) t are m l × 1 vectors. More precisely, given any vector norms · m l for l = 1, 2, there exists c l > 0 and 0 < λ l < 1 such that The main result of this section is the simple expression of the limiting process for a two partitions raking procedure. Let d LP denote the Lévy-Prokhorov distance. The matrices S 1,even (f ), S 2,odd (f ) and scalars λ 1 , λ 2 are as in Proposition 11.
Moreover we have, for all N large and c 3 > 0 depending on λ 1 , λ 2 , P (A), P (B), are not known without additional information. They can be estimated uniformly over F as n → +∞ to evaluate the distribution of G (N ) and G (∞) , thus giving access to adaptative tests or estimators. Since λ 1 , λ 2 and c 3 are related to eigenvalues of P A|B P B|A and P B|A P A|B they can be estimated adaptively at rate 1/ √ n in probability. This in turn provides an evaluation of d LP (G (N ) , G (∞) ). Remark N. In the case of an auxiliary information reduced to P (A), P (B) one should use A = {A, A c }, B = {B, B c }, estimate the missing P (A ∩ B) in P A|B , P B|A and the conditional expectations on the four sets, then S 1,even , S 2,odd . If the probabilities of more overlapping sets are known the above characterization of the limiting process G (∞) (F ) can be generalized to a recursive raking among k partitions {A j , A c j } in the same order.
3 Proofs of general results

Raking formulas
Write B n,N0 = min and B c n,N0 = Ω \ B n,N0 . By (1.2), the probability that α (N0) n is undefined is On B n,N0 we have, by (1.3) and since A (N ) is a partition, By (1.7) this implies (1.8) since for any A ∈ A, is a refined finite partition of X , we get

Proof of Proposition 1.
The partition A (N ) ∩ is defined at (3.3). By using (3.4) it holds, for N 1, is a partition of X . Hence the contrast between P (N −1) n and P (N ) n is the same on {X 1 , ..., X n } as on A (N ) . Now, by convexity of − log(x) it follows, for any probability distribution Q supported by {X 1 , ..., X n }, where the final identification relies on (3.5) and P (N ) n = P on A (N ) .

Proof of Proposition 2.
The classical law of the iterated logarithm holds for the empirical process α (0) n indexed by F under (VC) and (BR). See Alexander [1], in particular Theorem 2.12 for (BR) and Theorem 2.13 based on Theorem 2.8 that uses in its proof the consequence of Lemma 2.7, which is indeed (VC). Namely, for any ε > 0, with probability one there exists n(ω) such that, for all n > n(ω), where ) .
Since P (N ) n is a probability measure we have P (N ) where, for N > 0,
We next establish Theorem 2.1. Fix N 0 ∈ N.
Proof. If F is uniformly bounded by M then for N N 0 we have thus, by backward induction from N 0 to 1, F (N0) and H (N0) are uniformly bounded by (2M ) N /P N . It readily follows that F 0 and H 0 are bounded by (2M ) N0 /P N0 . Assume that f k ∈ F * converges pointwise on X to f ∈ F . From N ) f . By iterating this reasoning backward from N to 1 we obtain that F (N ) and H (N ) are pointwise measurable, by using the countable respectively. Assume next that F satisfies (V C). By (3.12) we have If F can be covered by N (F , ε, d Q ) balls of d Q -radius ε centered at some g then φ (N ) (F ) can be covered by the same number of balls, centered at the corresponding φ (N ) g and hence the same number of centers φ (1) • ... • φ (N ) g suffices to cover F (N ) . All the φ (j,k) • φ (k+1) • ... • φ (N ) g are needed to cover H (N ) , that is S N N (F , ε, d Q ). This shows that F 0 (resp. H 0 ) obeys (V C) with the same power ν 0 and a constant c 0 (N 0 + 1) (resp. c 0 N0 N =0 S N ). Assume now that F satisfies (BR).
If g − f g + then we have N ) , .
Step 2. By (3.2) we have (3.14) Under the convention that φ (N +1) • φ (N ) = id, iterating (3.14) leads to Clearly the terms Γ (k) n carry out some bias and variance distortion. However the following lemma states that α Moreover, for any ζ > 0 and θ > 0 there exists n 3 (ζ, θ) such that we have, for all n > n 3 (ζ, θ), Proof. (i) Let us apply Proposition 2 to F and, thanks to Lemma 1, to H 0 and H (N ) . So for all ε > 0 we have for all n large enough, The following statements are almost surely true, for all n large enough. On the one hand, for ε = On the other hand, having σ H0 σ F by (3.13), By (3.15), q n (j, N ) = 1/ 1 + α The almost sure result then holds with C 0 = 8σ 2 F κ 2 N0 S N0 /p (N0) . (ii) We now work on the event B n,N0 of (3.1). There obviously exists n 1 such that if n > n 1 then S N0 (1 − p (N0) ) n 1/4n θ . We can also find κ > 0 so small that n 2κ / √ n = o(v n ) as n → +∞. Therefore, whatever ζ > 0 there exists n 2 (κ, S N0 , ζ, F , P ) such that ζv n > 2S N0 n 2κ /p (N0) √ n for any n n 2 . Choosing n max(n 1 , n 2 ) we deduce as for (3.16) and (3.17) that By Proposition 3 we see that under (VC) or (BR) the latter probability can be made less than 1/8n θ for any n > n 3 (ζ, θ) and n 3 (ζ, θ) large enough. Clearly n 3 (ζ, θ) depends on ζ, θ, n 1 , n 2 and on the entropy of H 0 thus all constants in Lemma 1 and Proposition 3 are involved.
Step 3. Fix θ > 0. By Lemma 1 we can apply Propositions 1 and 2 of Berthet and Mason [3] to F 0 , which ensures the following Gaussian approximation. For some constant c θ (F 0 , P ) > 0 and n θ (F 0 , P ) > 0 we can build on a probability space (Ω, T , P) a version of the sequence {X n } of independent random variables with law P and a sequence {G (0) n } of coupling versions of G (0) in such a way that, for all n n θ (F 0 , P ), Keep in mind that constants n θ and c θ only depend on the entropy of F 0 through the constants M, c 0 , ν 0 , b 0 , r 0 . By choosing θ > 1, d θ > c θ (F 0 , P ) then applying Borel-Cantelli lemma to (3.18), it almost surely holds, for all n large enough, Step 4. Let θ 0 > 0. We work under the event B n,N0 of (3.1) with a probability at least 1 − 1/4n θ0 provided that n > n 1 . The process G (0) being linear on F we see that the recursive definition (2.3) applied to the version G This combined with (3.14) readily gives Remind that v n > L(n)/ √ n and Lemma 2 holds. By (3.19) and (3.20) we almost surely have, for all n large enough and d 0 = 2d θ0 , By Lemmas 1 and 2, (3.18) and (3.20), for n 0 > max(n 1 , n 3 (ζ, θ 0 ), n θ0 (F 0 , P )) and d 0 > c θ0 (F 0 , P ) + ζ we have, for all n n 0 , To conclude observe that the parameters N 0 , M, M N0 , S N0 , P N0 , p (N0) , θ 0 , ν 0 , c 0 , r 0 , b 0 have been used at one or several steps to finally define n 0 and d 0 .

Proof of Proposition 5
Theorem 2.1 implies, for f ∈ F , is a sequence of versions of the centered Gaussian process G (N ) from (2.3) and the random sequence r We have to be a little careful with the expectation, variance and covariance of the coupling error process R (N ) n .
Step 1. Since G (N ) n (f ) is centered the bias is controlled by Write a n = √ K log n where K > 0 and θ 0 > 1 from Theorem 2.1 can be chosen as large as needed. Then, for θ > 1, ε > 0 and k ∈ N * consider the events n F a n , C n,k = θ k−1 a n < G (N ) n F θ k a n .
By Theorem 2.1, P(A c n ) < 1/n θ0 and v n > a n / √ n for all n large enough, hence By Propositions 7 and 8, G (N ) (f ) is a centered Gaussian process indexed by F such that, under (VC) or (BR), it holds Therefore we have, since θ > 1 and v n > 4M/ √ n > 2a n /n for n large enough, , and the following series is converging to an arbitrarily small sum, where δ < K/8C 2 F − 1. It follows that (3.22) is ultimately bounded by d 0 .
Step 2. Starting from (3.21) and the bias and variance decomposition, the quadratic risk is in turn controlled by n ) 2 then using v n > a n / √ n, a n = K √ log n < √ n we get, for θ 0 = 2, where the series is equal to its first term n 2 exp −a 2 n /8C 2 F times a convergent series, by using (3.24) as for Step 1 with K > 16C 2 F . We have shown that lim sup (ii) Concerning the covariance term in (3.25) it holds where We have, by Proposition 7 and 8, By using again P(A c n ) < 1/n 2 we see that Lastly, for g , K large and all n large enough it holds, by (3.23) and (3.24), n F > θ k−1 a n ε.
The above upper bounds imply, by (3.25), (3.26) and (2.4), Step 3. Let extend Step 2 to the covariance. By Step 1 we have, for all n large,

Proof of Proposition 6.
Fix N 0 ∈ N. Let apply Theorem 2.1 with θ 0 = 2, from which we also use n 0 , d 0 and v n . We have, for all 0 N N 0 , ϕ ∈ L, x ∈ R and n n 0 , This establishes the second statement of Proposition 6 provided d 1 > d 0 and n n 1 n 0 where n 1 is large enough to have ( 4 Proofs concerning the limiting process 4.1 Proof of Proposition 7 Step 1. Let us first relate G (N ) (F ) from (2.3) to G(F ) = G (0) (F ) from (2.2) by means of the vectors Φ (N ) k (f ) introduced at (2.9) before Proposition 7.
Lemma 3. For all N ∈ N * and f ∈ F it holds Proof. The formula is true for N = 0. Assume that it is the case for N 0.

Recall that sets
given by (2.9) when k = N + 1. It remains to show that (4.1) where, for l = k, k + 1, ..., N the vector is also the j-th column of P A (N +1) |A (l) . Therefore, by turning L into L ′ = L + 1, where all terms are different from those in Having collected all terms of Φ Step 2. To compute the covariance of G (N ) (F ) we need the following properties. Recall that P A (k) |A (k) = Id m k is the identity matrix of R m k .
By the definition (2.9) of the vectors Φ which yields (4.4).
Step 3. Let us first compute the variance of G (N ) (f ). By Lemma 3 we have hence Lemma 4 gives, through (4.2) and (4.3), where, by (4.4), (4.5) The formula (2.10) is proved. It extends to the covariance since, by Lemma 3, where, by (4.2) and (4.3) again, k (g) according to (4.5), we obtain which is symmetric in f and g. The covariance formula of Proposition 7 is proved.

Proof of Propositions 8 and 9
Since V G[A (k) ] is semi-definite positive, for all 1 k N and f ∈ F we have and the variance part (2.10) of Proposition 8 follows from Proposition 7. For any m ∈ N * , (f 1 , ..., f m ) ∈ F m and u ∈ R m , it further holds, by Proposition 7 again, Under the wrapping hypothesis of Proposition 9 we have since the corresponding (2.9) only involves A (N0−k) = A (N1−k) for 0 k < N 0 . Assuming moreover N 1 2N 0 we get, by Proposition 7, and, for m ∈ N * , (f 1 , ..., f m ) ∈ F m and u ∈ R m ,
Lemma 5. Assume (ER). For l = 1, 2 there exists an invertible m l × m l matrix U l and an upper triangular (m l − 1) × (m l − 1) matrix T l such that Proof. Since A is a partition, for 1 i m 2 the sum of the m 1 terms of row i of P A|B is m1 j=1 P (A j | B i ) = 1 hence P A|B is stochastic. Likewise P B|A is stochastic and, by stability, so are P A|B P B|A and P B|A P A|B . Let the column of 1's associated to their eigenvalue 1 be in first position in their respective matrix U 1 , U 2 of eigenvectors. The announced decomposition is always true with some upper triangular matrices T l having Jordan decomposition which proves that P B|A P A|B has invariant probability P (A). Similarly, P (B) is invariant for P A|B P B|A , and the first line of U −1 1 and U −1 2 is P [A] and P [B] respectively. Under (ER) these matrices are ergodic, which ensures that the eigenvalues of ∆ l have moduli strictly less than the dominant 1 since it is the case of eigenvalues of T l hence D l . It follows that Furthermore, since N l and D l commute it holds We conclude that lim k→+∞ T k l = 0 m l −1,m l −1 . . (4.8) The scalar product of P (A) by V 1 (f ) is null since P (A) · E[f |A] = P (f ) and Likewise we get P (B) · V 2 (f ) = 0.
The following convergences hold for any matrix norm. By Lemma 5 we have Now, the matrices Id m1−1 − T 1 and Id m2−1 − T 2 are nonsingular since 1 is a dominant eigenvalue of P B|A P A|B and P A|B P B|A . Recalling (2.11) and (2.12), by Lemma 5 and 6 it follows that which, by Lemma 5, converge respectively to Since we have already seen by using (4.8) and the notations of Proposition 11 that and using (2.14), (2.13).

Appendix .1 Elementary example
The Raking-Ratio algorithm changes the weights of cells of a contingency table in such a way that given margins are respected, just as if the sample should have respected the expected values of known probabilities. Let us illustrate the method from the following basic two-way contingency table. The margins of this sample differ from the known margins, here called expected total. Firstly the weights of lines are corrected, hence each cell is multiplied by the ratio of the expected total and the actual one, this is step N = 1. The totals for each column are similarly corrected at step N = 2. Typically the margins of the lines no longer match the expected frequencies.
Here they move in the right direction. Some estimators based on P n may be improved. The last two operations are repeated until stabilization. The algorithm converges to the Kullback projection of the initial joint law. The rate depends only on the initial table compared to the desired marginals. It takes only 7 iterations in our case to match the expected margins. The final raked frequencies are slightly moved away the initial ones, however this has to be compared with the natural sampling oscillation order 1/ √ ninsidiously n was not mentioned. For small samples such changes are likely to occur that may improve a large class of estimators, and worsen others. Our theoretical results showed that the improvement is uniform over a large class as n → +∞ and N is fixed.

.2 Counterexample of Remark J
Let assume that P satisfies the following probability values and that f has the following conditional expectations

.3 Raked empirical means over a class
General framework. Many specific settings in statistics may be modeled through F . Typically X is of very large or infinite dimension and each f (X) is one variable with mean P (f ) in the population. To control correlations between such variables one needs to extend F into F × = {f g : f, g ∈ F } and consider the covariance process α .., f k (X)) can in turn be combined into real valued random variables g θ (X) = ϕ θ (Y 1 , ..., Y k ) through parameters θ and functions g θ that should be included in F and so on. Consider for instance g θ (X) = θ 1 Y 1 + ... + θ k Y k + ε σ (X) with a collection of possible residual functions ε σ turning part of the randomness of X into a noise with variance σ 2 . The (VC) or (BR) entropy of F rules the variety and complexity of models or statistics one can simultaneously deal with. We refer to Pollard [20], Shorack and Wellner [21] and Wellner [29] for classical statistical models where an empirical process indexed by functions is easily identified.
Direct applications. Since the limiting process G (N ) of α (N ) n has less variance than G (0) , Theorem 2.1 can be applied to revisit the limiting behavior of classical estimators or tests by using P (N ) n instead of P n and prove that the induced asymptotic variances or risk decrease. For instance, in the case of goodness of fit tests, the threshold decreases at any given test level while the power increases against any alternative distribution Q that do not satisfy the margin conditions. As a matter of fact, enforcing P (N ) n to look like P instead of the true Q over all A (N ) makes P (N ) n go very far from P on sets where Q was already far from P .
Example: two raked distribution functions. Let (X, Y ) be a real centered Gaussian vector with covariance matrix 3 −1 −1 1 . We consider the raked joint estimation of the two distribution functions F X , F Y . An auxiliary information provides their values at points −2 to 2, every 0.5. The class F we need contains for all t ∈ R the functions f X t (x, y) = 1 ]−∞,t] (x), f Y t (x, y) = 1 ]−∞,t] (y) thus (VC) holds. For Z = X, Y let F (N ) Z,n (t) = Zi t P (N ) n ({Z i }) be the N -th raked empirical distribution function and write Z (1) · · · Z (n) the order statistics.
To exploit at best the information we use N = 2m, F  This shows some improvement, especially for N = 10. For n rather small it seems not always relevant to wait for the stabilization of the algorithm -here denoted N = ∞. Our theoretical results provide guaranties only for N N 0 , N 0 fixed and n n 0 for n 0 > 0. We also observed on graphical representations that the way F (N ) Z,n leaves F (0) Z,n to cross F Z at the known points tends to accentuate the error at a few short intervals where F (0) Z,n is far from F Z . This is less the case as the auxiliary information partition is refined or the sample size increases.
Example: raked covariance matrices. Given d ∈ N * and f 1 , ..., f d let V(Y ) denote the covariance matrix of the random vector Y = (f 1 (X), ..., f d (X)) which we assume to be centered for simplicity. Instead of the empirical covariance V (0) Let · denote the Froebenius norm and define In other words, In the context of Proposition 6 observe that ϕ Y is ( · F , · )-Lipshitz with parameter C 1 = d. Clearly ϕ Y (G (N ) ) has a bounded density since ϕ 2 Y (G (N ) ) is a quadratic form with Gaussian components and has a modified X 2 distribution. Choosing a finite collection of such ϕ Y ensures that C 2 < +∞. More generally by letting (f 1 , ..., f d ) vary among a small entropy infinite subset L d of F d and imposing some regularity or localization constraints to the f i one may have C 2 < +∞ while {f i f j : f i , f j ∈ F } satisfies (BR). The largest C 2 still works for L = d d0 L d . Therefore Proposition 6 guaranties that where it holds, for all N N 0 , d 0 d 1 , (f 1 , ..., f d ) ∈ L d and x > 0, by the variance reduction property of Proposition 8. Hence we asymptotically have P(ϕ Y (α x) − ε uniformly among Y such that P ϕ Y (G (N ) ) x < P ϕ Y (G (0) ) x − 2ε, for any fixed ε > 0.