Prediction by quantization of a conditional distribution

: Given a pair of random vectors ( X,Y ), we consider the problem of approximating Y by c ( X ) = { c 1 ( X ) ,..., c M ( X ) } where c is a measurable set-valued function. We give meaning to the approximation by using the principles of vector quantization which leads to the deﬁnition of a multifunction regression problem. The formulated problem amounts at quantiz- ing the conditional distributions of Y given X . We propose a nonparametric estimate of the solutions of the multifunction regression problem by combin- ing the method of M -means clustering with the nonparametric smoothing technique of k -nearest neighbors. We provide an asymptotic analysis of the estimate and we derive a convergence rate for the excess risk of the estimate. The proposed methodology is illustrated on simulated examples and on a speed-ﬂow traﬃc data set emanating from the context of road traﬃc forecasting.


Introduction
Regression analysis encompasses important statistical methods for exploring the relationship between a response variable Y and a predictor X. Most commonly, the focus is on estimating (or modeling) the regression function η(x) := E[Y |X = x] by methods of various sorts [see e.g. 41,20]. Over the years, alternatives to mean regression (that is, estimation of the regression function η) have been proposed and analyzed in the literature. Among these, median regression [as a special case of quantile regression 27] exhibits properties of robustness to outliers [22]. Another alternative is modal regression. In [28,29] [see also 25,46] the mode of the conditional distribution of Y given X is modeled as a linear function of x. A related setting is that considered in [42] where the dependence of the conditional mode on the predictor x is monotone. Typical nonparametric approaches to conditional mode estimation resort to first estimating the conditional densities using a nonparametric method, and then to infer the mode by maximization, as in [10] for instance, and [47] for a generalization of this approach using local polynomials.
Yet in the situation where the data is heterogeneous, summarizing the conditional distribution of Y given X by a single measure of location (mean, median, or mode) may be inadequate. As an illustration, consider the scatterplot represented in Figure 1. The distribution of Y given X is a mixture of two Normal distributions with equal proportions, equal variances, and means η 1 (x) < η 2 (x), and X follows a uniform distribution over the unit interval. The difference in means η 2 (x) − η 1 (x) increases with x so that the conditional distribution of Y given X is clustered into two distinct groups, all the more separate as x is large. By construction, η(x) = 1 2 [η 1 (x) + η 2 (x)] and η is an increasing function of x. Thus the regression function is well representative of the average trend in the data but provides a limited summary of the distribution of Y given X since it is bimodal. Instead, the set-valued map, also referred to as a multi-valued function or a multifunction, defined by x → {η 1 (x), η 2 (x)} would better capture the structure of the data than a real-valued map such as the conditional mean (or mode or median) function.  1], and where the distribution of Y given X is a mixture of two Normal distribution with weights equal to 1 2 , variances equal to 0.01, and means η 1 (x) = arctan(8x) (green curve) and η 2 (x) = 2 arctan(6x) (red curve). The regression function E[Y |X = x] = 1 2 [η 1 (x) + η 2 (x)] is represented by the dashed curve.
Fitting a finite mixture model is a popular approach for modeling such heterogeneous data. These models are typically studied in an estimation framework [see e.g. 16,45,34] where an application of the maximum likelihood principle defines the estimation method of choice. For purposes of regression analysis, a finite mixture regression model is obtained by conditioning a finite mixture distribution on a vector of covariates, as in [26]. For instance, the data repre-sented in the scatterplot of Figure 1 is drawn from a Gaussian mixture regression model with two components, and the interest would be primarily in estimating the mean curves η 1 (x) and η 2 (x), in addition to the mixture proportions and the variances of the components of the model. Finite mixture regression models provide a flexible way of handling heterogeneous data and are receiving a growing attention from the statistical community, with recent results giving performance bounds even in a high-dimensional setting [see e.g. 43,35,12]. These models are also known as mixture of experts [23,24] in machine learning.
Another class of methods centered on modal regression has developed to estimate the set of all the modes of the conditional distribution, that is, the set of points of local maximum of the conditional density, called the modal set. Following [15], [8] propose a plug-in nonparametric estimate of the modal set based on a kernel density estimate. This contrasts with early works on modal regression [e.g. 28,29,42] which focused primarily on estimating the principal mode motivated by concerns of robustness to outliers. Formally, [8] defines the modal set Mod(x) at some point x as the set of points y in R where the conditional density f Y |X=x (y) of Y given X = x satisfies f Y |X=x (y) = 0 and f Y |X=x (y) < 0 and it is assumed that Mod(x) is finite (we note that, as defined, Mod(x) is a subset of the set of points of local maximum). Hence the map x → Mod(x) is a multivalued function. From an algorithmic standpoint, [15] propose to estimate the modal set with a conditional version of the mean-shift algorithm, which is a modified version of the mean-shift algorithm used in the context of density mode clustering [9,11,4], and [8] prove that the resulting modal set estimate is consistent.
Arguably, nonparametric modal regression may prove effective especially when the conditional distributions admit only a limited number of local modes, as in the speed-flow traffic data reported in [15]. There, the conditional distribution of the speed of vehicules on a Californian freeway given the traffic flow is found to be bimodal over a range of small flow values, and then unimodal for larger values of the flow. In this example, the modal set is composed of at most two points. However in a situation where the conditional distributions would admit a large number of local points of maximum, then the modal set might be difficult to interpret (the modal set may even be uncountable, when this latter is defined as the set of points of local maximum). Therefore, there is a need for developing a regression methodology which could extract potentially more than one feature from the data, in a manner similar to modal regression, but while keeping their number relatively small, or even the control thereof, to preserve the interpretability of these features.
In this paper, we propose to apply the principles of vector quantization [17,18,32] to the conditional distributions of Y given X in order to define, given M an integer, a set-valued function x → c(x) := {c 1 (x), . . . , c M (x)} meant to capture the underlying structure of the data, thereby extending the regression problem to a multifunction regression problem. We define optimality in terms of the mean squared error or predictive risk and we focus on the estimation of a multifunction c which achieves the infimum of E(c) over the set of measurable multifunctions of the form x → c(x) := of a real-valued function f used in regression analysis. Hence E(c) is a natural extension of the L 2 risk to multifunctions. We emphasize that, even when M = 1, our objective is not to quantize the regression function η, a problem studied in [21], but to estimate an optimal multifunction c , which can be represented by the M real-valued functions x → c j (x), for j = 1, . . . , M. Given IID data (X 1 , Y 1 ), . . . , (X n , Y n ) with the same distribution as (X, Y ), we propose a nonparametric estimateĉ n (x) := {ĉ n,1 (x), . . . ,ĉ n,M (x)} defined by combining the approach of k-means clustering [see e.g. 14, Chap. 10] with the smoothing technique of k-nearest neighbors averaging [see e.g . 20]. Underlying the definition ofĉ n is the estimation of E(c) with k-nearest neighbors smoothing, followed by minimization of this estimate over the set of measurable multifunctions x → c(x) := {c 1 (x), . . . , c M (x)}. As will be argued further in the paper, the minimization problem over c can be reduced to a collection of quantization problems indexed by x, which leads to a simple algorithm for evaluating the value of the estimateĉ n (x) at any x. We measure the performance of the estimate by the excess risk Notice that when M = 1,ĉ n reduces to a single valued functionĉ n,1 : , the expectation of the L 2 error of the estimatê c n,1 of the regression function η. To summarize, in the present paper, we make in particular the following contributions: • We state a multifunction regression problem and we study its solutions.
• We propose a nonparametric multifunction estimateĉ n defined by combining the method of M -means with the smoothing technique of k-nearest neighbors averaging. We propose a simple companion algorithm to compute the value of the estimate. • We prove the consistency of the estimate and we derive convergence rates on the excess risk and on a pointwise version of the excess risk. • We propose a heuristic for automatically selecting the number of neighbours of the estimate. We also study the automatic selection of the number of quantization points. We illustrate the methods on two simulated examples and on a data set of speed records versus the location along an automobile path in the city of Toulouse, France.
The paper is organized as follows. In section 2, we summarize the foundational principles of vector quantization and of the design of empirical vector quantizors by minimization of the empirical risk. In section 3, we define the multifunction regression problem, emphasizing potential measurability issues that we address. In section 4, we define our proposed estimateĉ n , and in section 5, we provide an asymptotic analysis of the estimate. First, in Theorem 1, we obtain a bound on the pointwise version of the excess risk, under a local regularity condition on the conditional densities. Then we derive a convergence rate on the excess risk in Theorem 2 under mild regularity conditions. In Theorem 3, we prove a convergence result of the estimte towards elements of the solution set of the multifunction regression problem. Then in section 6, we report on practical implementation details on numerical experiments and we also apply the methodology to speed data along an automobile path in the city of Toulouse, France. The proofs of the theorems are exposed in section 7.

Vector quantization
In this section, we collect foundational materials on vector quantization. We start by formulating the quantization problem and by defining the notion of an optimal quantizer. Then we describe the application of the principle of empirical risk minimization to the design of an empirical quantizer from IID data.

The quantization problem
Vector quantization refers to the process of discretizing a random vector by a random variable that can take only a finite number of values [17,18,32]. Known as lossy data compression in information theory and signal processing, vector quantization forms the basic principle of the method of k-means for data clustering [38] and is also used in defining numerical integration schemes [36]. In this section, and we collect foundational materials on vector quantization.
Let Y be a random vector in R p with distribution P Y . Given M an integer, an M -points quantizer is a map q : R p → R p such that its image is a finite set {c 1 , . . . , c M } of M points of R p . Using the Euclidean norm . on R p , the performance of a quantizer q is measured by the distortion An M -points nearest-neighbor quantizer is a quantizer q c of the form q c (x) = arg min 1≤j≤M x−c j , where ties are broken arbitrarily, and where c := (c 1 , . . . , c M ) is a configuration, or codebook, of M points in R p . Any quantizer q defines a partition of R p into the sets q −1 (c j ), for j = 1, . . . , M. In the case of a nearestneighbor quantizer q c , the partition is called a Voronoi partition and for any j = 1, . . . , M, the (closed) Voronoi cell V j (c) associated with c j is defined by

Optimal quantizers
The search for an optimal quantizer minimizing the distortion can be restricted to the class of nearest-neighbor quantizers [18,Lemma 3.1]. In the present work, only nearest-neighbor quantizers are considered, and a nearest-neighbor quantizer q c is referred to by the configuration c := (c 1 , . . . , c M ) from which it is defined. A configuration c := (c 1 , . . . , c M ) will be called simply a quantizer and the distortion E(c; P Y ) of the quantizer c is defined by An optimal quantizer c is any minimizer of E(c; and its existence is guaranteed; see e.g. Theorem 1 in [32] or Theorem 4.12 in [18].

Approximation of measures
The connection between vector quantization and the Wasserstein distance has long been recognized; in particular, we have where W 2 (P Y , Q) denotes the L 2 Wasserstein distance between the probability measures P Y and Q [18,Lemma 3.4]. Hence finding an optimal quantizer for P Y is equivalent to best approximating P Y , in the Wasserstein distance, by a discrete measure with support of cardinality at most M . Under the regularity assumption that, for any optimal quantizer c , P Y does not charge the boundaries common to any two adjacent Voronoi cells, that is, if P Y (V i (c ) ∩ V j (c )) = 0 for all 1 ≤ i = j ≤ M , then the set of minimizers in (7) coincides with the set of optimal quantizers minimizing (5) by [18, Lemma 3.1] and [18,Lemma 4.4]. Following [18], any minimizer of (7) is called an M -optimal quantizing measure. By this equivalence, any M -optimal quantizing measure is of the form P Y • q −1 c , that is, the image measure (pushforward measure) of P Y by the quantizer map q c , and it can be expressed as

Empirical vector quantization
Empirical vector quantization refers to the quantization of the empirical measure of a random sample and forms the basis for data clustering by the method of kmeans [38], where the goal is to automatically partition the data into dissimilar groups of similar items. The setting is that of a sequence (Y i ) i≥1 of independent random vectors with the same distribution as Y . For each sample size n, denote by P (n) Y := 1 n n i=1 δ Yi the empirical measure associated with Y 1 , . . . , Y n . An empirical quantizer c n is any minimizer of the distortion for P with E as in (5). Consistency of c n is shown in [37,38]. It is shown in [33,6,2] that the excess risk E[E(c n ; P Y )] − E (P Y ) of an empirical quantizer decreases at a rate on the order of O(1/ √ n) under the assumption that P Y has bounded support. This result is extended in [7] for the quantization over a separable Hilbert space. Faster convergence rates have been reported in the literature under different kind of assumptions [see e.g. 3,30]. We mention that these rates share the property of depending only on the sample size, and not on the number of quantization points nor on the space dimension. The dependence on these parameters is only through the constant factors.

The multifunction regression problem
Let (X, Y ) be a pair of random vectors taking values in R d × R p . Following [40], a set-valued mapping or multifunction c : R d ⇒ R p is a map which to each x in R d associates a subset c(x) of R p . The double arrow notation is used to distinguish multifunctions from single-valued functions and the Euclidean spaces under consideration are endowed with their Borel σ-fields. A multifunction c : Given M an integer, we consider the set F M of measurable multifunctions c : (9) where # denotes the cardinality of a set.
Notice that each multifunction in F M is closed-valued. By [40,Theorem 14.5], each closed-valued measurable multifunction admits a Castaing representation, meaning in our context that, for each c in F M , there exists M measurable functions c 1 , . . . , c M from R d to R p such that c(x) = (c 1 (x), . . . , c M (x)) for all x. We define the multifunction regression problem as the problem of best approximating Y by c(X), for some c in F M , in the sense of the predictive risk E(c) defined in (1) as Notice that E(c) does not depend on the choice of functions (c 1 , . . . , c M ) used to represent c. Then we define a solution to the multifunction regression problem as any multifunction c in F M such that A notable difference with the conventional regression setting (corresponding to M = 1) is that the solution set typically contains multiple solutions when M ≥ 2, while when M = 1, the solution is unique and coincides with the As claimed in the Introduction, minimization over c can be reduced to a collection of quantization problems indexed by x. This is true in the following sense. Denote by P X the distribution of X and by S X its support. By conditioning on X in the definition of E(c), let E(c; x) be the function defined for any x in S X and any c = ( which is also equal to the conditional version of (5) (we use the same notation c to denote either a multifunction R d ⇒ R p or a point in R p × · · · × R p when this is clear from the context and there is no risk of confusion and for anyc in F M , the following equivalence holds: Thus minimizing E(c) over F M is equivalent to minimizing E(c; x) over c ∈ R p × · · · × R p , up to measurability. This issue can be resolved by considering a measurable selection from the argmin sets as follows: is closed-valued and measurable by [40,Theorem 14.37], and so it admits a measurable selection [40,Corollary 14.6], that is, a measurable functionc : We conclude that any solution c to the multifunction regression problem can be defined either directly as a minimizer of E(c) over F M or as a measurable selection (which exists) from the collection of argmin sets defined by

The estimate
In this section, we define our estimateĉ n (x) := (ĉ n,1 (x), . . . ,ĉ n,M (x)). We also describe an optimization algorithm to compute the values ofĉ n (x) at any point x.

Definition of the estimate
be an IID sequence of random vectors with the same distribution as (X, Y ). To defineĉ n (x), we proceed by first estimating E(c; x) and next by minimizing the estimated distortion over c for each x. Clearly there is ample leeway for the first step, and for computational reasons exposed in section 4.2, we consider a k-nearest neighbors local averaging estimate of E(c; x) of the form where {W n,i (x), i = 1, . . . , n} is the set of weights depending on the observations X 1 , . . . , X n defined as As for E(c; x) defined in (10), E n (c; x) is a Carathéodory integrand (hence a normal integrand) and so there exists a measurable selection from its argmin sets. Then we define our estimateĉ n as any measurable selection from the collection {arg min c∈R p ×···×R p E n (c; x) : x ∈ S X }, meaning thatĉ n is measurable and satisfies

An optimization algorithm
Minimizing E n (c; x), or E n (c; P Y ) in the non conditional setting, is known for being computationally difficult (it is NP-hard). A popular and tractable optimization algorithm for this purpose is the k-means algorithm, which proceeds iteratively by constructing a sequence of quantizers converging to a local optimum. From a practical perspective, the local averaging estimate E n (c; x) defined in (13) can be minimized by considering a weighted version of the k-means algorithm, as described in Algorithm 1. Naturally, weights other than the knearest neighbors weights could be used. But when using the k-nearest neighbor weights (14), the algorithm is equivalent to the standard M -means algorithm applied to the Y i 's which correspond to the k nearest neighbors of x among the X i 's. Thus, the algorithm is rather simple to implement.

Asymptotic analysis
In this section, we study the convergence of the estimateĉ n defined in (15). We assume that (X, Y ) admits a probability density f XY with respect to the Lebesgue measure on R d × R p and that Y is bounded, that is, that there is R > 0 such that Y ≤ R almost surely.
We shall need the following notation. The conditional density of Y given X = x at y is denoted by f Y |X=x (y). For any m, the closed ball of R m centered at x and of radius ρ is denoted by B m (x, ρ). The closed ball centered at the origin and of radius ρ is denoted by B m (ρ) and the volume of B m (1) is denoted by ω m .
Recall the excess risk defined in (2) by where the class F M is defined in (9). We also consider the pointwise excess risk defined by We note that by (12), and so R(ĉ n ) = E [R(ĉ n ; X)].

Bounds on the excess risk
In Theorem 1 and Theorem 2, we establish convergence rates on the pointwise excess risk (17) and on the excess risk (16) of a sequence of estimateĉ n . We consider local and global Lipschitz regularity conditions on the conditional densities analogous to those used in [19] for the estimation of conditional distributions.
Theorem 1. Let x be a point in S X . Assume that there exists κ > 0 such that Assume that there exists δ > 0 and an integrable function h : R p → R + such that Output: Configuration c = (c1, . . . , cM ) obtained at convergence.
for all y ∈ R p and for allx with x − x ≤ δ. Then there exists a constant C := C(δ, κ, h, R) > 0 such that, for all k and n satisfying k n ≤ 1 C log k k d 2 ∧ δ d , and any sequenceĉ n of estimate, Condition (18) is a regularity condition on the support S X in a neighborhood of the point x which is used for instance in set estimation to define the notion of a standard set [see e.g. 5].
The term C(pM +1) 2 log k k in the right-hand side of (20) corresponds to the excess risk of an empirical quantizer defined on a random sample of size k [33]. As pointed out in [6], the log k factor can be eliminated at the price of added technical difficulties, and we speculate that the same applies here, so that the first term in the righ hand side of (20) could be sharpened to a constant multiple of 1/ √ k. With the choice of k n 2 d+2 , Theorem 1 leads to the following bound on the pointwise excess risk. . We note that the rate of Corollary 1 is slower than the O( log n/n) rate that would be obtained in the quantization of a sample of size n without conditioning. Hence we see that a curse of dimensionality is at play here, as expected.
To bound the excess risk, we consider a global version of the Lipschitz condition (19). We also assume that the support S X is compact, which is a standard condition in regression estimation with nearest neighbors.

Theorem 2. Suppose that S X is compact, and that there exists an integrable function
for all x andx in S X . Letĉ n be a sequence of estimate. Then with k n 2 d+2 , We note that the rates in both Corollary 1 and Theorem 2 depend adversely on the dimension d of the predictor variable X. This results from the smoothing over x used to estimate the conditional distortions E(c; x). Note also that these rates do not depend on the number of quantization points M , nor on the dimension p of the response Y ; the dependence on p and M is only through the constants, as exhibited in the right-hand side of (20) for instance. This is consistent with known bounds on the excess risk in empirical vector quantization [33,6,2], as seen in Section 2.4.
In the real-valued regression setting (with M = 1 and p = 1), condition (21) entails the Lipschitz continuity of the regression function η(x) = E[Y |X = x] by the Lebesgue dominated theorem combined with the assumption that Y is bounded. Therefore the rate in Theorem 2 is suboptimal for M = 1 since the minimax rate of convergence of the L 2 risk for a Lipschitz continuous regression function with bounded Y is n − d d+2 . Yet a striking difference between the cases M = 1 and M ≥ 2 is that the equality in the case M = 1, which allows for the decomposition into the well-known bias/variance sum, with the bias depending on the regularity of the regression function, does not extend to the case M ≥ 2. Therefore when M ≥ 2, the dependence, if any, of the convergence rate of the excess risk on the regularity of the solution(s) to the multifunction regression problem is unclear. In addition when M ≥ 2, the solution set of a quantization problem may contain multiple solutions (and as pointed out in [39], few conditions enforcing uniqueness exist). Thus one might expect that irregular solutions to the multifunction regression problem do exist. Therefore a deeper analysis of the convergence rate would necessitate a fine examination of the regularity of the argmin sets of E(c; x) over x, which is something that we remain curious about. That said, by using the connection between the quantization problem and the Wasserstein distance, we can state a convergence result on the solution, as shown in the next Section.

Convergence in solution
For any x in S X , let C (x) = arg min c∈R p ×···×R p E(c; x), the argmin set of E(c; x). Given two closed-valued multifunctions c 1 and c 2 , we measure the proximity between c 1 (x) and c 2 (x) by their Hausdorff distance d H (c 1 (x), c 2 (x)), where the Hausdorff distance between two subsets A and B of R p is defined by We state the following qualitative result under the local conditions used in Theorem 1. Thus, at any x in S X , the accumulation points ofĉ n (x) are optimal. In particular, if the multifunction regression problem admits a unique solution c , thenĉ n (x) converges almost surely to c (x) for all x in S X satisfying (18) and (19). We note that (19) holds for all x ∈ S X when (21) is satisfied, and that (18) holds for all x when the support S X is a standard set in the sense of [5].

Numerical experiments
In this section, we report on practical aspects for the implementation of the empirical conditional quantizer with k-nearest neighbor weights, as described in Algorithm 1. In particular, through two simulated examples, we discuss the choice of the parameter k corresponding to the number of neighbors and of the parameter M corresponding to the number of quantization points. The methodology is then applied to a real-world data set of speed records as a function of location along a daily automobile path in the city of Toulouse, France. This data is provided by Mediamobile (http://www.mediamobile.com).

Example 1: Two-conditional clusters
In this example, we apply the methodology to a sample of n = 2, 500 simulated points for the distribution represented in Figure 1. In details, X follows a uniform distribution over [0, 1], and given X, Y follows a mixture of two normal distributions with equal weights, with both variances equal to 0.01, and with mean functions η 1 (x) = arctan(8x) and η 2 (x) = 2 arctan(6x). The number of quantization points is set to M = 2 for all x in [0, 1].
To select the number of neighbors k, we propose a data-driven method based on the minimization of an estimate of the average prediction error E[E(ĉ n (X); X)] used to define the excess risk. For this purpose, we split the data into two parts, of size 2n/3 and n − 2n/3 . The first part is used to construct the model, while the second part is used to estimate the mean prediction error (other fractions than 2/3 − 1/3 could be taken so long as the size of the test set remains smaller than the size of the learning set). Specifically, given an integer k, for each 2n/3 + 1 ≤ i ≤ n, we determine an empirical quantizerĉ n (X i ) by minimization of the quantization error based on the k-nearest neighbors of X i among the (X j , Y j ), for 1 ≤ j ≤ 2n/3 , that is,ĉ n (X i ) := (ĉ n,1 (X i ), . . . ,ĉ n,M (X i )) minimizes Using this, we setÊ which is an estimate of the average prediction error. The data-driven value of the number of neighbors is then selected as a minimizer ofÊ P,n (k) over k.
In this example,Ê P,n (k) over k has been evaluated for values of k ranging from 10 to 150 by steps of 5. The graph ofÊ P,n (k) as a function of k is represented in the left panel of Figure 2. The minimum of the estimated prediction error is attained at k = 75. This value is then used to evaluate empirical conditional quantizers at 100 equally spaced x-values ranging from 0 to 1. The resulting conditional quantizers are represented as the green and red curves in the right panel of Figure 2.

Example 2: One or two conditional clusters
In this example, we consider a pair (X, Y ) where X follows a uniform distribution over [−1, 1], and where given X, Y follows a mixture of normal distribution with equal proportions, with variances both equal to 0.01, and with mean functions η 1 (x) = x 2 and η 2 (x) = −x 2 . A scatterplot of n = 1, 200 points simulated from this distribution is represented in the right panel of Figure 3. It appears that the conditional distribution of Y given X is well concentrated around one cluster when x is approximately in the range [−0.4, 0.4], while it clusters into two groups outside this interval. This calls for an automatic selection of both k (the number of neighbors) and M (the number of quantization points). Here, the goal is have k and M both depend on x.
The problematic of selecting a number a quantization points is standard in clustering analysis, where it corresponds to the selection of the number of clusters. Several heuristics have been introduced for that purpose. We shall use the gap heuristic proposed by [44], whereby the number of clusters is selected by comparing the change in the within-cluster variability to that expected under a null reference distribution, which is not clustered, like a uniform or unimodal distribution. In the present setting, the difficulty in selecting both M and k lies in the lack of a global criterion to optimize. Indeed, for each k, the mean prediction error decreases with M , so this criterion cannot be used to simultaneously select For each x, the number of quantization points is selected automatically using the gap heuristic [44] k and M . Moreover, the use of an empirical heuristic, like the gap heuristic, for selecting M would require k to have been specified first.
To circumvent these issues, we propose the following method. First, for each M in a given range, a value of k is selected from the data by minimizing the mean prediction error, as described in Example 1. Denote this value by k(M ). In this example, we let M vary between 1 and 8; the estimated k(M ) are represented in Figure 3  Interestingly in these simulations, for each x, the values {M gap (k(M )) , M = 1, . . . , 8} where all equal, therefore the selection of M was particularly robust to the initial value of k. It is also interesting to note that on this example the pair (k,M ) selected at each x satisfies the stability relationsM = M gap (k) and k = k(M ).
We applied this selection procedure to 100 x values equally spaced between −1 and 1. This resulted in either 1 or 2 clusters. The quantization points are represented as curves in the right panel of Figure 3.

Example: Speed data
We consider a data set of Floating Car Data (FCD) extracted from GPS devices which record the speed and location of cars at a frequency of 10 Hz. The raw data is map-matched to a network of roads. In this example, n = 70 vehicles have been monitored at different times and days while moving along a given path, 10 kilometers long, and composed of sections of inner-city roads and of a freeway. The data is represented in the top-left panel of Figure 4 as 70 curves giving the speed (in kilometers per hour) as a function of the distance along the path (normalized to unit length).
As an approach to road traffic forecasting, [1] propose to first cluster the speed-location curves to define prototypical speed patterns, and next to assign a new individual to the closest cluster. To implement the clustering approach, it is required that the data correspond to the same path. Yet when using FCD, vehicles may share only a small section of a trajectory, so that the number of data for a given path may be limited and this may hamper the prediction in some cases.
To cope with this issue, we propose to strongly localize the determination of the speed patterns by inferring the cluster structure of the speed conditionally on the location. It can be noticed from the top-left panel of Figure 4 that drivers have different behaviors at high speeds while vehicle speeds with small values present less variability. This difference in variabilities may be explained by the presence of traffic jams, which has a stronger effect on a freeway ride, where high speeds can no longer be attained, than on an inner-city ride, where the traffic is already constrained by speed limits, traffic signals and stop signs. The cluster structure of the traffic flow is well revealed by the conditional quantization, as represented in the top-right panel of Figure 4. The analysis yields either one or two cluster conditionally on the location which can be interpreted as corresponding to free flow and congested flow situations.

Proofs
We start in section 7.1 by establishing a uniform concentration inequality on the distortion. Theorem 1, Theorem 2 and Theorem 3 are proved in sections 7.2, 7.3, and 7.4 respectively.

Concentration of the distortion
Proposition 1 below gives an upper bound on the uniform deviations of E n (c; x) to E(c; x) for each fixed point x. Proposition 1. In the context of Theorem 1, let x be a point in the support of the distribution of X, let κ > 0 satisfying (18) and let δ > 0 and h : R p → R + integrable satisfying (19). There exists a constant C := C(δ, κ, h, R) > 0 such that for any > 0, any k and n satisfying k We shall need the following Lemma which is Lemma 6 in [13]. Lemma 1 ([13]). Let (U i ) i≥1 be a sequence of independent, zero mean random variables such that |U i | ≤ c almost surely. For all real numbers a 1 , . . . , a n ≥ 0 such that n i=1 a i ≤ 1, and all > 0, Proof. Given c : be a centering term. We proceed to bound the deviations of |E n (c; x) −Ẽ n (c; x)| and |Ẽ n (c; x) − E(c; x)| uniformly over c in B p (R) M . For any > 0, we have Note that for each i = 1, . . . , n, the weight W n,i (x) depends on the distance of x with respect to the X i 's and hence is σ(X 1 , . . . , X n )-measurable, the random variable Z i is almost surely bounded by 4R 2 , and E[Z i −E(c 1 , . . . , c M ; X i )|X 1 , . . . , X n ] = 0 almost surely. So, by applying Lemma 1 with coefficients a i = W n,i (x), random variables U i = Z i − E(c; X i ), conditionally on the sample X 1 , . . . , X n , we obtain that for any > 0, from which it follows that To obtain a uniform bound, we consider a covering of the set {(c 1 , . . . , c M ) : c i ∈ B(R)} = B p (R) M using the distance induced by the Euclidean norm of R pM . Since the set B p (R) M is a compact subset of R pM , the minimal number N (B p (R) M , η) of balls of radius η that are necessary to cover B p (R) M is of order η −pM , i.e., by considering an η-packing of B p (R) M , we can prove that since Y −a ,j ≤ 2R and using the fact that c−a ≤ η implies a ,j −c j ≤ η. Therefore, min and by taking the expectation conditionally on X, we deduce that E(c; x ) ≥ E(a ; x )−4Rη for any x in the support of the distribution of X. By exchanging c j with a ,j in (27), the same reasoning leads to the inequality E(a ; x ) ≥ E(c; x ) − 4Rη. Hence, for any c in B p (a , η) and for any x in S X , Similarly, by considering E n in place of E in the steps above, we also have that, for any c in B p (a i , η), Therefore, for any 1 ≤ ≤ N η , sup c∈Bp(a ,η) Then we deduce from (26) and (28) with η = /(16R), together with the exponential inequality in (24) and the bound on the covering number in (25), that P sup Now we proceed to bound the deviations of Ẽ n (c; Let δ > 0 and h : R p → R + integrable satisfying the regularity (19). Since Y is bounded by R, forx with x − x ≤ δ and any c in B p (R) M , where h 1 denotes the L 1 norm of the function h.
Denote by X (k,n) (x) the k th nearest neighbor of x among the sample X 1 , . . . , X n . Then, for any c in B p (R) M , where in the last inequality we used the fact that Hence, with probability one, Then, For any 0 < η ≤ δ, let p η = P( X − x ≤ η). Note that p η ≥ κη d where κ > 0 is defined in (18). Since, we deduce by using Chernoff's bound that, for any 0 < η ≤ δ where the last inequality holds whenever k−1 npη ≤ 1 2 , which is implied when Therefore in this case Hence, by reporting (34) in (32), for any > 0, and any k and n such that we have with Combining (29) and (36), we obtain that, for any > 0, and any k and n satisfying (35), From this, and the fact that (35) is satisfied when k n ≤ κ we conclude.

Proof of Theorem 1
Let c be an optimal quantizer, meaning that E(c , x) = E . Denote by c n := c n (x) the value of the estimateĉ n at the point x. Following standard arguments, we have so that Given a > 0, and since sup c∈Bp(R) M |E n (c; x) − E(c; x)| ≤ 8R 2 almost surely, we have By Proposition 1 there exists constants C 0 := C 0 (R, p, M ) > 0 and C 1 := C 1 (δ, κ, h, R) > 0 such that, for any > 0, and any k and n with k Let a > 0 and suppose that k n ≤ 1 C1 a d ∧ δ d . Then, since Taking a = c log k k , with c = C1(pM +1) 2

Proof of Theorem 2
For any x in S X and c in B p (R) M , letẼ n (c; x) be defined in (23). Using (37), for any x in S X , we have and we proceed to bound the two terms on the right-hand side of (41).
To bound the first term, we use the concentration inequality (29) from the proof of Proposition 1 to deduce that, for any > 0 and for any x in S X , where C > 0 is a constant not depending on x. Hence, by proceeding as in the first part of (40), followed by integrating over x, we deduce that for some constant C > 0.

Proof of Theorem 3
Let P n,k = n i=1 W n,i (x)δ Yi . By [18,Theorem 4.21], the result will hold if W 2 P n,k , P Y |X=x → 0 almost surely.
Recall that a sequence (Q n ) of probability measures converges to Q in the (L 2 ) Wasserstein distance if (Q n ) converges weakly to Q and if y 2 dQ n (y) → y 2 dQ(y) as n → ∞. Thus, since Y ≤ R almost surely, it suffices to show that P n,k = n i=1 W n,i (x)δ Yi converges weakly to P Y |X=x almost surely. Towards proving this, let g be a continuous and bounded function over R p and let m(x) = E[g(Y )|X = x]. Proceeding as in the proof of (24), we get .
Let δ > 0 and h : R p → R + integrable satisfying the regularity condition (19). For anyx with x − x ≤ δ, Hence, for any > 0, Using (34), for any k and n such that where κ is defined in (18), we have with Combining (47) and (49), we deduce that for any > 0 and k and n satisfying (48), Since k n → 0, (48) is satisfied for all n large enough. Now for any > 0, the last two terms in the right-hand side of (48) are summable over n, and the first term is summable if k log n → ∞. So by the Borel-Cantelli Lemma, n i=1 W n,i (x)g(Y i ) converges almost surely to m(x). Hence P n,k converges weakly to P Y |X=x almost surely which implies that (46) holds.