An averaged projected Robbins-Monro algorithm for estimating the parameters of a truncated spherical distribution

The objective of this work is to propose a new algorithm to fit a sphere on a noisy 3D point cloud distributed around a complete or a truncated sphere. More precisely, we introduce a projected Robbins-Monro algorithm and its averaged version for estimating the center and the radius of the sphere. We give asymptotic results such as the almost sure convergence of these algorithms as well as the asymptotic normality of the averaged algorithm. Furthermore, some non-asymptotic results will be given, such as the rates of convergence in quadratic mean. Some numerical experiments show the efficiency of the proposed algorithm on simulated data for small to moderate sample sizes.


Introduction
Primitive shape extraction from data is a recurrent problem in many research fields such as archeology [Thom, 1955], medicine [Zhang et al., 2007], mobile robotics [Martins et al., 2008], motion capture [Shafiq et al., 2001] and computer vision [Rabbani, 2006, Liu andWu, 2014].This process is of primary importance since it provides a high level information on the data structure.
First works focused on the case of 2D shapes (lines, circles), but recent technologies enable to work with three dimensional data.For instance, in computer vision, depth sensors provide 3D point clouds representing the scene in addition to usual color images.In this work, we are interested in the estimation of the center µ ∈ R 3 and the radius r > 0 of a sphere from a set of 3D noisy data.In practical applications, only a discrete set of noisy measurements is available.Moreover, sample points are usually located only near a portion of the spherical surface.Two kinds of problem can be distinguished: shape detection and shape fitting.Shape detection consists in finding a given shape in the whole data without any prior knowledge on which observations belong to it.In that case, the data set may represent several objects of different nature and may therefore contain a high number of outliers.Two main methods are used in practise to solve this problem.The Hough transform [Abuzaina et al., 2013] performs a discretization of the parameter space.Each observation is associated to a set of parameters corresponding to all possible shapes that could explain the sample point.Then, a voting strategy is applied to select the parameter vectors of the detected shapes.The advantage of this method is that several instances of the shape can be detected.However, a large amount of memory is required to discretize the parameter space, especially in the case of three dimensional models.The RANSAC (RANdom SAmple Consensus) paradigm [Fischler andBolles, 1981, Schnabel et al., 2007] is a probabilistic method based on random sampling.Observations are randomly selected among the whole data set and candidate models are generated.Then, shapes can be detected thanks to an individual scoring scheme.The success of the method depends on a given probability related to the number of sampling and the fraction of points belonging to the shape.
The shape fitting problem assumes that all the data points belong to the shape.For example, spherical fitting techniques have been used in several domains such as industrial inspection [Jiang and Jiang, 1998], GPS localization [Beck and Pan, 2012], robotics [Von Hundelshausen et al., 2005] and 3D modelling [Tran et al., 2015].Geometric and algebric methods have been proposed [Landau, 1987, Rusu et al., 2003, Al-Sharadqah, 2014] for parameters estimation.Moreover, let us note that fitting methods are generally applied for shape detection as a post-processing step in order to refine the parameters of the detected shapes [Tran et al., 2015].
In a recent paper, Brazey and Portier [Brazey and Portier, 2014] introduced a new spherical probability density function belonging to the family of elliptical distributions, and designed to model points spread near a spherical surface.This probability density function depends on three parameters, namely a center µ ∈ R 3 , a radius r > 0 and a dispersion parameter σ > 0. In their paper, the model is formulated in a general form in R d .To estimate µ and r, a backfitting algorithm (see e.g.[Breiman and Friedman, 1985]) similar to the one used in [Landau, 1987] is employed.A convergence result is given in the case of the complete sphere.However, no result is established in the case of a truncated sphere while simulations showed the efficiency of the algorithm.
The objective of this work is to propose a new algorithm to fit a sphere on a noisy 3D point cloud distributed around a complete or a truncated sphere.We shall assume that the observations are independent realizations of a random vector X defined as (1.1) where W is a positive real random variable such that E [W ] = 1, U Ω is uniformly distributed on a measurable subset Ω of the unit sphere of R 3 , W and U Ω are independent.Parameters µ ∈ R 3 and r > 0 are respectively the center and the radius of the sphere we are trying to adjust to the point cloud.Random variable W allows to model the fluctuations of points in the normal direction of the sphere.When Ω coincides with the complete sphere, then the distribution of X is spherical (see e.g.[Muirhead, 2009]).Indeed, if we set Y = (X − µ)/r, then the distribution of Y is rotationally invariant.
We are interested in estimating center µ and radius r.As U Ω = 1, we easily deduce from (1.1) that It is clear that from these two equations, we cannot deduce explicit estimators of parameters µ and r using the method of moments since each parameter depends on the other.To overcome this problem, we can use a backfitting type algorithm (as in [Brazey and Portier, 2014]) or introduce a recursive stochastic algorithm, which seems well-suited for this problem since equations (1.2) and (1.3) can also be derived from the local minimization of the following quadratic criteria Stochastic algorithms, and more precisely Robbins-Monro algorithms, are effective and fast methods (see e.g.[Duflo, 1997, Kushner and Yin, 2003, Robbins and Monro, 1951]).They do not need too much computational efforts and can easily be updated, which make of them good candidates to deal with big data for example.However, usual sufficient conditions to prove the convergence of this kind of algorithm are sometimes not satisfied and it is necessary to modify the basic algorithm.We can, for example, introduce a projected version of the Robbins-Monro algorithm which consists in keeping the usual estimators in a nice subspace with the help of a projection.Such an algorithm has been recently considered in [Bercu and Fraysse, 2012] and [Lacoste-Julien et al., 2012].
In this paper, due to the non global convexity of function G, we estimate parameters µ and r using a projected Robbins-Monro algorithm.We also propose an averaged algorithm which consists in averaging the projected algorithm.In general, this averaged algorithm allows to improve the rate of convergence of the basic estimators, or to reduce the variance, or not to have to make a good choice of the step sequence, which can be as exhaustive as to estimate the parameters.It is widely used when having to deal with Robbins-Monro algorithms (see [Polyak and Juditsky, 1992] or [Pelletier, 1998] amoung others).This paper is organized as follows.In Section 2, we specify the framework and assumptions.After a short explanation on the non-convergence of the Robbins-Monro algorithm, the projected algorithm and its averaged version are introduced in Section 3. Section 4 is concerned with the convergence results.Some simulation experiments are provided in Section 5, showing the efficiency of the algorithms.Proofs of the different results are postponed in Appendix.

Framework and assumptions
We consider in this paper a more general framework than the one described in the introduction.Let X be a random vector of R d with d ≥ 2. Let F denotes the distribution of X.We assume that X can be decomposed under the form (2.1) where µ ∈ R d , r > 0, W is a positive real continuous random variable (and with a bounded density if d = 2), U Ω is uniformly distributed on a measurable subset Ω of the unit sphere of R d .Moreover, let us suppose that W and U Ω are independent.
Model (2.1) allows to model a point cloud of R d spread around a complete or truncated sphere of center µ ∈ R d and radius r > 0. Random vector U Ω defines the position of the points on the sphere and random variable W defines the fluctuations in the normal direction of the sphere.As mentioned in the introduction, when Ω is the complete unit sphere, then the distribution of X is spherical.
When W satisfies the condition E [W ] = 1, the radius r is identifiable and can be directly estimated.Indeed, since However, this condition is sometimes not satisfied (as in [Brazey and Portier, 2014]) and only r := r E [W ] can be estimated.Therefore, in what follows, we are interested in estimating θ := µ T , r T , which will be denoted by (µ, r ) for the sake of simplicity.
We suppose from now that the following assumptions are fulfilled: The random vector X is not concentrated around µ: • Assumption [A2].The random vector X admits a second moment: These assumptions ensure that the values of X are concentrated around the sphere and not around the center µ, without in addition too much dispersion.This framework totally corresponds to the real situation that we want to model.Moreover, using (2.1), Assumptions [A1] and [A2] reduce to assumptions on W .More precisely, Let us now introduce two examples of distribution allowing to model points spread around a complete sphere and satisfying assumptions [A1] and [A2].
Example 2.1.Let us consider a random vector X taking values in R d with a distribution absolutely continuous with respect to the Lebesgue measure, with a probability density function f δ defined for all δ > 0 by where C d is the normalization constant.Then, we can rewrite X under the form (2.1) with Example 2.2.Let us consider the probability density function introduced in [Brazey and Portier, 2014].
where K d is the normalization constant.Then, a random vector X with probability density function f can be rewritten under the form (2.1), with E [W ] = 1, but E [W ] is closed to 1 when the variance σ is negligible compared to the radius r.
To obtain points distributed around a truncated sphere, it is sufficient to modify the previous densities by considering densities of the form f where Ω is the set of points of R d whose polar coordinates are given by (ρ, θ 1 , . . ., θ d−1 ∈ R * + × Θ) where Θ defines the convex part Ω of the surface of the unit sphere of R d we want to consider.

The algorithms
We present in this section two algorithms for estimating the unknown parameter θ which can be seen as a local minimizer (under conditions) of a function.Indeed, let us consider the function where we denote by g the function defined for any x ∈ R d and y = (z, a) ∈ R d × R * + by g(x, y) := ( x − z − a) 2 .The function G is Frechet-differentiable and we denote by Φ its gradient, which is defined for all y = (z, From (2.1) and definition of θ = (µ, r ), we easily verify that ∇G(θ) = 0. Therefore, since θ is a local minimizer of function G (under assumptions) or a zero of ∇G, an idea could be to introduce a stochastic gradient algorithm for estimating θ.

The Robbins-Monro algorithm.
Let (X n ) n≥1 be a sequence of independent and identically distributed random vectors of R d following the same law as X and let (γ n ) n≥1 be a decreasing sequence of positive real numbers satisfying the usual conditions When the functional G is convex or verifies nice properties, a usual way to estimate the unknown parameter θ is to use the following recursive algorithm with θ 1 chosen arbitrarily bounded.The term ∇ y g (X n+1 , θ n ) can be seen as an estimate of the gradient of G at θ n , and the step sequence (γ n ) controls the convergence of the algorithm.
The convergence of such an algorithm is often established using the Robbins-Siegmund's theorem (see e.g.[Duflo, 1997]) and a sufficient condition to get it, is to verify that for any y ∈ R d × R * + , Φ(y), y − θ > 0 where ., .denotes the usual inner product and . the associated norm.However, we can show that this condition is only satisfied for y belonging to a subset of R d × R * + to be specified.Thus, if at time (n + 1), the update of θ n (using (3.4)) leaves this subset, then it does not necessarily converge.Therefore, we have to introduce a projected Robbins-Monro algorithm.

The Projected Robbins-Monro algorithm.
Let K be a compact and convex subset of R d × R * + containing θ = (µ, r * ) and let π : R d × R * + −→ K be a projection satisfying where ∂K is the frontier of K.An example will be given later.
Then, we estimate θ using the following Projected Robbins-Monro algorithm (PRM), defined recursively by where θ 1 is arbitrarily chosen in K, and (γ n ) is a decreasing sequence of positive real numbers satisfying (3.3).
Of course the choice of subset K and projector π is crucial.It is clear that if K is poorly chosen for a given projector, the convergence of the projected algorithm towards θ will be slower, even if from a theoretical point of view, we shall see in the next section dedicated to the theoretical results, that this algorithm is almost the same as the traditional Robbins-Monro algorithm since the updates of θ n , ie. the quantities θ n − γ n ∇ y g X n+1 , θ n , leave K only a finite number of times.
Let us now discuss the choice of K and π.The choice of K is directly related to the following assumption that we introduce to ensure the existence of a compact subset on which the scalar product Φ(y), y − θ is positive.

• Assumption [A3].
There are two positive constants R µ and R r such that for all y = (z, a) , and λ max (M ) denotes the largest eigenvalue of matrix M , and Remark 3.1.The less the sphere is troncated, the more E [U Ω ] is close to 0 and the constraints on R µ and R r are relaxed.In particular, when the sphere is complete, ie.U Ω = U where U denotes the random vector uniformly distributed on the whole unit sphere of R d , then E [U Ω ] = 0 and Assumption [A3] reduces to sup z∈B(µ,Rµ) The main consequence of Assumption [A3] is the following proposition which is one of the key point to establish the convergence of the PRM algorithm.
Proof.The proof is given in Appendix A.
Assumption [A3] is therefore crucial but only technical.It reflects the fact that the sphere is not too much truncated and that the points are not too far away from the sphere which corresponds to the real situations we want to model.In a general framework, this technical assumption is difficult to verify since it requires to specify the distribution of X.In the case of distribution of Example 2.1 with δ < 1/10, we can easily exhibit constant R µ and R r .Indeed taking R µ = R r = r * /10, then assumption [A3] holds.When the distribution of X is compactly supported with a support included in [1 − δ, 1 + δ], it is fairly easy to find the constants provided that δ is small enough.It is quite more difficult when dealing with distribution of Example 2.2.Nevertheless, topological results can ensure that these constants exist.
From constants R µ and R r of Assumption [A3], it is then possible to simply define a projector π which satisfies condition (3.5).Indeed, let us set K = K µ × K r with K µ = B(µ, R µ ) and K r = B(r * , R r ), and define for any Such projector satisfies the requested conditions.However, it is clear that this projector can not be implemented since µ and r * are unknown.We shall see in the simulation study how to overcome this problem.
We suppose from now that K is a compact and convex subset of where ∂K is the frontier of K, i.e there is a positive constant d min such that B (θ, d min ) ⊂ K.

The averaged algorithm.
Averaging is a usual method to improve the rate of convergence of Robbins-Monro algorithms, or to reduce the variance, or finally not to have to make a good choice of the step sequence (see [Polyak and Juditsky, 1992]), but for the projected algorithms, this method is not widespread in the litterature.In this paper, we improve the estimation of θ by adding an averaging step to the PRM algorithm.Starting from the sequence ( θ n ) n≥1 given by (3.6), we introduce for any n ≥ 1, which can also be recursively defined by We shall see in the following two sections, the gain provided by this algorithm.

Convergence properties
We now give asymptotic properties of the algorithms.All the proofs are postponed in Appendix B. The following theorem gives the strong consistency of the PRM algorithm as well as properties on the number of times we really use the projection.
Theorem 4.1.Let (X n ) be a sequence of iid random vectors following the same law as Moreover, the number of times the random vectors θ n − γ n ∇ y g X n+1 , θ n do not belong to K is almost surely finite.
The following theorem gives the rate of convergence in quadratic mean and the L p rates of convergence of the PRM algorithm (under conditions) as well as an upper bound of the probability that the random vector θ n − γ n ∇ y g X n+1 , θ n does not belong to K. Theorem 4.2.Let (X n ) be a sequence of iid random vectors following the same law as X.Assume that [A1] to [A3] hold and consider a step sequence (γ n ) of the form Moreover, for all positive integer p such that E X − µ 2p < ∞, there is a positive constant C p such that for all n ≥ 1, and for all n ≥ 1, , where d min := inf y∈∂K { y − θ } and ∂K is the frontier of K.
We now focus on the asymptotic behavior of the averaged algorithm.First of all, applying Theorem 4.1 and Toeplitz's lemma for example, we easily obtain the strong consistency of the averaged estimator θ n .Introducing the following assumption, we can specify its rate of convergence in quadratic mean as well as its asymptotic normality.
Note that thanks to topological results, this assumption also implies Proposition 3.1 but is not useful to obtain the constants R µ and R r .Nevertheless, this assumption is crucial to establish the results of the two following theorems but it is satisfied as soon as the sphere is not too much truncated and the dispersion around the sphere not too important which corresponds to the real situations encountered.Using model (2.1), Γ θ rewrites under the form (4.1) When the sphere is complete, ie.
(1/d)I d and λ min (Γ θ ) > 0 as soon as β < d/(d − 1).In the case of distribution of Example 2.1, we have β = (log(1+δ)−log(1−δ))/(2δ) and [A4] is satisfied as soon as δ is small enough.In the case of distribution of Example 2.2, [A4] is satisfied as soon as r > > σ.When the sphere is not complete, we can easily show that a sufficient condition to ensure [A4] is λ min (Var [U Ω ]) < 1 − 1/β, where Var [U Ω ] is the covariance matrix of the random variable U Ω .In the case of the half sphere and d = 3, we have λ min (Var [U Ω ]) = 1/12 and Γ θ is definite positive as soon as β < 12/11.This condition holds for distribution of Example 2.1 with δ < 0.4 for instance, and distribution of Example 2.2 as soon as r > > σ.
Theorem 4.3.Let (X n ) be a sequence of iid random vectors following the same law as X.Assume that [A1] to [A4] hold and consider a step sequence (γ n ) of the form Then there is a positive constant C such that for all n ≥ 1, With respect to results of Theorem 4.2, we clearly improve the rate of convergence in quadratic mean.Note that the computed rate is the optimal one for such stochastic algorithms.We finally give a central limit theorem which can be useful to build confidence balls for the different parameters of the sphere.
Theorem 4.4.Let (X n ) be a sequence of iid random vectors following the same law as X and let us choose the step sequence (γ n ) of the form From result (4.2) of Theorem 4.4, we easily derive that Therefore, in order to build confidence balls or statistical tests for the parameters of the sphere, matrices Γ θ and Σ must be estimated.
Let us decompose θ n under the form (Z n , A n ) where Z n ∈ R d estimates the center µ and A n ∈ R * + the radius r * , and let us denote Then we can estimate Γ θ and Σ by Γ n and Σ n iteratively as follows where Σ 1 = I d+1 and Γ 1 = I d+1 to avoid usual problems of invertibility.It is not hard to show that Γ n and Σ n respectively converge to Γ θ and Σ and then deduce that (4.5) The simulation study of the next section will illustrate the good approximation of the distribution of Q n by the standard gaussian for moderate sample sizes.

Some experiments on simulated data
We study in this section the behavior of the PRM and averaged algorithms on simulated data in the case d = 3, for small to moderate sample sizes.This section first begins with the specification of the compact set involved in the definition of the PRM algorithm which is of course a crucial point.We then study the performance of the two algorithms in the case of the whole sphere with the distributions of Examples 2.1 and 2.2.Finally, we consider the case of the truncated sphere (a half-sphere) and we compare our strategy with the one proposed by [Brazey and Portier, 2014].
In this simulation study, we shall always consider the same sphere defined by its center µ = (0, 0, 0) T and its radius r = 50.In addition, to reduce sampling effects, our results are based on 200 samples of size n.Finally, let us mention that simulations were carried out using the statistical software R (see R Core Team, 2013).5.1.Choice of the compact set and of the projection.We discuss here the crucial point of the choice of the compact set K and of the projection π involved in the definition of the PRM algorithm.The main problem is to find a compact set containing the unknown parameter θ.We propose to build a preliminary estimation of θ, using a geometric approach which consists in finding the center and the radius of a sphere of R 3 from 4 non-coplanar distinct points.We denote by (µ 0 , r 0 ) this initial estimate of θ.From this estimate, we define the compact set K by K := K µ 0 ×K r 0 with K µ 0 := B(µ 0 , r 0 /10) and K r 0 := B(r 0 , r 0 /10), where the choice of the value r 0 /10 for the radius of the balls is justified by the discussion about Assumption [A3] in Section 3.2.We then define the projector π as follows: for any y = (z, a) ∈ R 3 × R * + , we set π(y) := (π µ 0 (z), π r 0 (a)) with With this strategy, we can raisonnably hope that if our initial estimate is not too poor, then the true parameter belongs to K and the quadratic criteria G is convex on K.
We will see below that even if this preliminary estimation is rough, the true parameter belongs to K and the PRM algorithm improves the estimation of θ.
Let us now describe our strategy to obtain a preliminary estimation of the parameter θ = (µ, r ).Since the data points are spread around the sphere, the estimation of the parameters from only one quadruplet of points is not robust to random fluctuations.In order to make the estimation more robust, we consider instead N quadruplets sampled with replacement from the first K points of the sample X 1 , . . ., X n .For each quadruplet, we calculate the center of the sphere which passes through these four points, which gives a sequence of centers ( µ i ) 1≤i≤N .The initial estimate of the center, denoted by µ 0 , is then computed as the median point.Finally, we obtain an estimation of the radius by calculating the empirical mean of the sequence ( X i − µ 0 ) 1≤i≤50 .
A simulation study carried out for various values of K and N in the case of the whole and truncated sphere, shows that by taking K = 50 and N = 200, we obtain a preliminary estimation of θ sufficiently good to ensure that the compact K contains θ.
To close this section, let us mention that although the initial estimate is quite accurate, it is necessary to project the Robbins-Monro algorithm to ensure the convergence of the estimator.Indeed, taking a step sequence of the form γ n = c γ n −α , the results given in Table 1 show that for some values of c γ and α, the parameter θ is poorly estimated by the Robbins-Monro algorithm, while the PRM algorithm (Table 2) is less sensitive to the step sequence choice.α 0.51 0.6 0.66 0.75 0.99 1 0.27 0.14 0.09 0.06 0.23 c γ 5 10 8 10 6 10 5 10 4 10 5 10 10 31 10 18 10 14 10 10 10 6 In the sequel of the simulation study, we take a step sequence of the form γ n := n −2/3 (α = 2/3 is often considered as the optimal choice in the literature).

Case of the whole sphere.
In what follows, we are interested in the behavior of the PRM and averaged algorithms when samples are distributed on the whole sphere according to the distribution of Example 2.1 with δ = 0.1.
Figure 1 shows that the accuracy of the estimations increases with the sample size.In particular, as expected, the PRM algorithm significantly improves the initial estimations of the center and the radius (see the first boxplots which correspond to the initial estimations).Moreover, as expected in the case of the "whole sphere", we can see that the three components of the center µ are estimated with the same accuracy.Let us now examine the gain provided by the use of the averaged algorithm.Figure 2 shows that for small sample sizes, the performances of the two algorithms are comparable, but when n is greater than 500, the averaged algorithm is more accurate than the PRM algorithm.We can even think that by forgetting the first estimates of the PRM algorithm, we improve the behavior of the averaged algoritm when the sample size is small.q q q q q q q q q q q q q q q q 200 200 500 500 1000 1000 1500 1500 2000 2000 48 49 50 51 52 q q q q q q q q q q q q q q q q 200 200 500 Finally, let us study the quality of the Gaussian approximation of the distribution of Q n for a moderate sample size.This point is crucial for building confidence intervals or statistical tests for the parameters of the sphere.
Figure 3 shows that this approximation is reasonable when n = 2000.Indeed, we can see that the estimated density of each component of Q n is well superimposed with the density of the N (0, 1).To validate these approximations, we perform a Kolmogorov-Smirnov test at level 5%.The test enables us to conclude that the normality is not rejected for each component of Q n .

5.3.
Comparison with a backfitting-type algorithm in the case of a half-sphere.In this section, we compare the performances of the averaged algorithm with the ones of the backfitting algorithm introduced by [Brazey and Portier, 2014].In what follows, we consider samples coming from the distribution of Example 2.2, with σ = 1, in the case of the half sphere defined by the set of points whose y-component is positive.
Results obtained with the two algorithms are presented in Figure 4. We focus on parameter µ y for the center since it is the more difficult to estimate.We can see that even if the backfitting (BF for short) algorithm is better than the averaged algorithm, the performances are globally good, which validates the use of our algorithm for estimating the parameters of a sphere from 3D-points distributed around a truncated sphere.Recall that convergence results are available for our algorithm in the case of the truncated sphere, contrary to the backfitting algorithm for which no theoretical result is available in that case.q q q q q q q q q q q 500 500 1000 1000 1500 1500 2000 2000 −1.0 −0.5 0.0 0.5 1.0 q q q q q q q q q q q q q q 500 500 1000

Conclusion
We presented in this work a new stochastic algorithm for estimating the center and the radius of a sphere from a sample of points spread around the sphere, the points being distributed around the complete sphere or only around a part of the sphere.
We shown on simulated data that this algorithm is efficient, less accurate than the backfitting algorithm proposed in [Brazey and Portier, 2014] but for which no convergence result is available for the case of the truncated sphere.Therefore, our main contribution is to have proposed an algorithm for which we have given asymptotic results such as its strong consistency and its asymptotic normality which can be useful to build confidence balls or statistical tests for example, as well as non asymptotic results such as the rates of convergence in quadratic mean.
A possible extension of this work could be to extend the obtained results to the case of the finite mixture model.This framework has been considered in [Brazey and Portier, 2014] but no convergence result is established.Proposing a stochastic algorithm for estimating the different parameters of the model and obtaining convergence results would be a nice challenge.
Appendix A. Some convexity results and proof of proposition 3.1 The following lemma ensures that the Matrix in Assumption [A3] is well defined and that the Hessian of G exists for all y ∈ R d × R.
Moreover, suppose that W admits a bounded density, then for all d ≥ 2, there is a positive constant C such that for all Note that for the sake of simplicity, we denote by the same way the two constants.
Proof of Lemma A.1.
Step 1: d ≥ 3 By continuity and applying Assumption [A1], there are positive constants , C such that for all z ∈ B (µ, ), with M positive and defined later.Moreover, let t ≥ M , consists in measuring the intersection between a truncated sphere with radius bigger than /2 with a ball of radius 1 t , with 1 t ≤ 2 .This is smaller than the surface of the frontier of the ball (see the following figure).Thus, there is a positive constant k such that for all t ≥ M , (A.1) We conclude the proof taking Step 2: d = 2 and W admits a bounded density Let f max be a bound of the density function of W .As in previous case, let As in previous case, if t ≥ 2 , there is a positive constant k such that for all t ≥ 2 , Moreover, since f max is a bound of the density function of W , Thus, for all t ≥ 2 , , and in a particular case, and one can conclude the proof taking C = max {C , 2 −1 + krf max }.
In order to linearize the gradient in the decompositions of the PRM algorithm and get a nice decomposition of the averaged algorithm, we introduce the Hessian matrix of G, denoted, for all y = (z, a) ∈ R d × R, by Γ y : R d × R −→ R d × R and defined by : (A.5) Proof of Proposition A.1.Under Assumption [A1], by continuity, there are positive constants C , such that for z ∈ B (µ, ), Moreover, note that for all y ∈ K, Thus, with analogous calculus to the ones in the proof of Lemma 5.1 in [Cardot et al., 2015], one can check that there is a positive constant C such that for all y ∈ B (θ, ) ∩ K, Moreover, for all y = (z, a) ∈ K and y = (z , a Thus, applying Cauchy-Schwarz's inequality, Thus, applying Lemma A.1, there are positive constants A 1 , A 2 such that Note that since K is compact and convex, there is a positive constant C K such that for all y ∈ K and t ∈ [0, 1], θ + t(y − θ) ≤ C K , and in a particular case, Thus, for all y ∈ K such that y − θ ≥ , Thus, we conclude the proof taking

Appendix B. Proof of Section 4
Proof of Theorem 4.1.Let us recall that there is a positive constant c such that for all y ∈ K, Φ(y), y − θ ≥ c y − θ 2 .The aim is to use previous inequality and the fact that the projection is 1-lipschitz in order to get an upper bound of E θ n+1 − θ

2
|F n and apply Robbins-Siegmund theorem to get the almost sure convergence of the algorithm.
Almost sure convergence of the algorithm: Since π is 1-lipschitz, Applying Robbins-Siegmund's theorem (see [Duflo, 1997] for instance), θ n − θ 2 converges almost surely to a finite random variable, and in a particular case,

Number of times the projection is used
Let N n := n k=1 1 { θ k −γ k ∇y(X k+1 , θ k )/ ∈K} .This sequence is non-decreasing, and suppose by contradiction that N n goes to infinity.Thus, there is a subsequence ∈ K, and in a particular case, θ n k +1 ∈ ∂K, where ∂K is the frontier of K. Let us recall that θ is in the interior of K, i.e let d min := inf y∈∂K θ − y , we have d min > 0. Thus,

and, lim
k→∞ which leads to a contradiction.

Proof of Theorem 4.2. Convergence in quadratic mean
The aim is to obtain an induction relation for the quadratic mean error.Let us recall inequality (B.1), Then we have and one can conclude the proof with the help of an induction (see [Godichon, 2015] for instance) or applying a lemma of stabilization (see [Duflo, 1996]).
L p rates of convergence Let p ≥ 2, we now prove with the help of a strong induction that for all integer p ≤ p, there is a positive constant C p such that for all n ≥ 1, This inequality is already checked for p = 1.Let p ≥ 2, we suppose from now that for all integer k < p , there is a positive constant C k such that for all n ≥ 1, We now search to give an induction relation for the L 2p -error.Let us recall that We suppose from now that E [W 2p ] < +∞ (and in a particular case, E W k < +∞ for all integer k ≤ 2p) and let U n+1 := ∇ y g X n+1 , θ n .We have 3) The aim is to bound each term on the right-hand side of previous inequality.In this purpose, we first need to introduce some technical inequalities.
Thus, applying Lemma A.1 in [Godichon, 2015] for instance, for all integer k ≤ p , In a particular case, since for all k ≤ p, E W 2k < +∞, there are positive constants A 1,k , A 2,k such that for all n ≥ 1, We can now bound the expectation of the three terms on the right-hand side of inequality (B.3).First, since θ n is F n -measurable, applying inequality (B.4), let A 1,k , using previous inequality and by induction, In the same way, applying Cauchy-Schwarz's inequality, let .
With analogous calculus to the ones for inequality (B.5), one can check that there is a positive constant B such that for all n ≥ 1, Thus, Finally, applying Lemma A.1 in [Godichon, 2015] Thus, with analogous calculus to the ones for inequality (B.5), one can check that there is a positive constant B such that for all n ≥ 1, (B.7) Finally, applying inequalities (B.5) to (B.7), there are positive constants B 1 , B 2 such that for all n ≥ 1, Thus, with the help of an induction on n or applying a lemma of stabilization (see [Duflo, 1996] for instance), one can check that there is a positive constant C p such that for all n ≥ 1, which concludes the induction on p and the proof.
Proof of Theorem 4.3.The aim is, in a first time, to exhibit a nice decomposition of the averaged algorithm.In this purpose, let us introduce this new decomposition of the PRM algorithm (B.9) Remark that (ξ n ) is a sequence of martingale differences adapted to the filtration (F n ) and r n is equal to 0 when θ n − γ n ∇ y g X n+1 , θ n ∈ K.Moreover, linearizing the gradient, decomposition (B.9) can be written as where δ n := Φ θ n − Γ θ θ n − θ is the remainder term in the Taylor's expansion of the gradient.This can also be decomposed as As in [Pelletier, 2000], summing these equalities, applying Abel's transform and dividing by n, We now give the rate of convergence in quadratic mean of each term using Theorem 4.2.In this purpose, let us recall the following technical lemma.B.1 ([Godichon, 2015]).Let Y 1 , ..., Y n be random variables taking values in a normed vector space such that for all positive constant q and for all k ≥ 1, E [ Y k q ] < ∞.Thus, for all constants a 1 , ..., a n and for all integer p, (B.12)

Lemma
The remainder terms: First, one can check that In the same way, applying Theorem 4.2, Thanks to Lemma A.1, there is a positive constant C θ such that for all n ≥ 1, Thus, applying Lemma B.1 and Theorem 4.2, there is a positive constant C 2 such that Let U n+1 := ∇ y g X n+1 , θ n , note that if θ n − γ n U n+1 ∈ K, then r n = 0. Thus, applying Lemma B.1 and Cauchy-Schwarz's inequality, Moreover, since π is 1-lipschitz, Thus, applying inequality (B.4), there are positive constants A 1 , A 2 such that for all n ≥ 1, In a particular case, applying Theorem 4.2, there is a positive constant A 3 such that for all n ≥ 1, E r n 4 ≤ A 3 n 2α .Moreover, applying Theorem 4.2, there is a positive constant C 6 such that for all n ≥ 1, The martingale term: Since (ξ n ) is a sequence of martingale differences adapted to the filtration (F n ), Moreover, Finally, applying inequality (B.4) and Theorem 4.2, there is a positive constant M such that Then, which concludes the proof.
Proof of Theorem 4.4.Let us recall that the averaged algorithm can be written as follows (B.17 We now prove that the first terms on the right-hand side of previous equality converge in probability to 0 and apply a Central Limit Theorem to the last one.The martingale term Let θ n = (Z n , A n ) ∈ R d × R, then ξ n+1 can be written as ξ n+1 = ξ n+1 + n+1 + n+1 , with µ−X n+1 − E Zn−X n+1 X n+1 −Zn − µ−X n+1 µ−X n+1 F n 0 .
Note that (ξ n ) , ( n ) , ( n ) are martingale differences sequences adapted to the filtration (F n ).Thus, Moreover, Thus, one can check that Thus, applying Theorem 4.2, As a particular case, This last term is closely related to the gradient of the function we need to minimize to get the geometric median (see [Kemperman, 1987] for example) and it is proved in [Cardot et al., 2015] that since Lemma A.1 is verified, then Thus, Finally, since ( n ) is a sequence of martingale differences adapted to the filtration (F n ), applying Theorem 4.2 and Cauchy-Schwarz's inequality, Note that with more assumptions on W , we could get a better rate but this one is sufficient.Indeed, thanks to previous inequality, Finally, applying a Central Limit Theorem (see [Duflo, 1997] for example), we have the convergence in law which also can be written as

Figure 1 .
Figure 1.Whole sphere with distribution of Example 2.1.From the left to the right, boxplots of estimates of µ x , µ y , µ z and r obtained with the PRM algorithm for different sample sizes.

Figure 2 .
Figure 2. Whole sphere with distribution of Example 2.1.Boxplots of estimates of µ y (left) and r (right) obtained with the PRM algorithm (in red) and with the averaged algorithm (in blue) for different sample sizes.

Figure 3 .
Figure 3. From the left to the right, estimated densities of each components of Q 2000 superimposed with the standard gaussian density.

Figure 4 .
Figure 4. Comparison of averaged and BF algorithms.Boxplots of the estimates of µ y (on the left) and r (on the right), obtained with the BF algorithm (in blue) and with the averaged algorithm (in red) for the half sphere in the case of Example 2.2.

Table 1 .
Robbins-Monro algorithm.Errors in quadratic mean of the 200 estimations of the center µ for samples of size n = 2000 in the case of the distribution of Example 2.1.

Table 2 .
PRM algorithm.Errors in quadratic mean of the 200 estimations of the center µ for samples of size n = 2000 in the case of the distribution of Example 2.1.