Converting High-Dimensional Regression to High-Dimensional Conditional Density Estimation

There is a growing demand for nonparametric conditional density estimators (CDEs) in fields such as astronomy and economics. In astronomy, for example, one can dramatically improve estimates of the parameters that dictate the evolution of the Universe by working with full conditional densities instead of regression (i.e., conditional mean) estimates. More generally, standard regression falls short in any prediction problem where the distribution of the response is more complex with multi-modality, asymmetry or heteroscedastic noise. Nevertheless, much of the work on high-dimensional inference concerns regression and classification only, whereas research on density estimation has lagged behind. Here we propose FlexCode, a fully nonparametric approach to conditional density estimation that reformulates CDE as a non-parametric orthogonal series problem where the expansion coefficients are estimated by regression. By taking such an approach, one can efficiently estimate conditional densities and not just expectations in high dimensions by drawing upon the success in high-dimensional regression. Depending on the choice of regression procedure, our method can adapt to a variety of challenging high-dimensional settings with different structures in the data (e.g., a large number of irrelevant components and nonlinear manifold structure) as well as different data types (e.g., functional data, mixed data types and sample sets). We study the theoretical and empirical performance of our proposed method, and we compare our approach with traditional conditional density estimators on simulated as well as real-world data, such as photometric galaxy data, Twitter data, and line-of-sight velocities in a galaxy cluster.


Introduction
A challenging problem in modern statistical inference is how to estimate a conditional density of a random variable Z ∈ R given a high-dimensional random vector X ∈ R D , f (z|x). This quantity plays a key role in several statistical problems in the sciences where the regression function E[Z|x] is not informative enough due to multi-modality and asymmetry of the conditional density.
For example, several recent works in cosmology (Sheldon et al., 2012;Kind and Brunner, 2013;Rau et al., 2015) have shown that one can significantly reduce systematic errors in cosmological analyses by using the full probability distribution of photometric redshifts z (a key quantity that relates the distance of a galaxy to the observer) given galaxy colors x (i.e., differences of brightness measures at two different wavelengths). Other fields where conditional density estimation plays a key role are time series forecasting in economics (Kalda and Siddiqui, 2013) and approximate Bayesian methods (Fan et al., 2013;Izbicki et al., 2014;Papamakarios and Murray, 2016). Conditional densities can also be used to construct accurate predictive intervals for new observations in settings with complicated sources of errors (Fernández-Soto et al., 2001) or multimodal distributions (see Fig. 1 and Fig. 7 for examples). and return too wide prediction bands, whereas a nonparametric conditional density estimator automatically returns informative predictive bands. The left plot shows 95% predictions bands from local linear regression, and the right plot shows 95% highest predictive density (HPD) bands derived from FlexCode-SAM estimates of the conditional density.
Nevertheless, whereas a large literature has been devoted to estimating the regression E[Z|x], statisticians have paid far less attention to estimating the full conditional density f (z|x), especially when x is high-dimensional. Most attempts to estimate f (z|x) can effectively only handle up to about 3 covariates (see, e.g., Fan et al. 2009). In higher dimensions, such methods typically rely on a prior dimension reduction step which, as is the case with any data reduction, can result in significant loss of information.
Contribution. There is currently no general procedure for converting successful regression estimators (that is, estimators of the conditional mean E[Z|x]) to estimators of the full conditional density f (z|x) -indeed, this is a non-trivial problem. In this paper, we propose a fully nonparametric approach to conditional density estimation, which reformulates CDE as an orthogonal series problem where the expansion coefficients are estimated by regression. By taking such an approach, one can efficiently estimate conditional densities in high dimensions by drawing upon the success in high-dimensional regression. Depending on the choice of regression procedure, our method can exploit different types of sparse structure in the data, as well as handle different types of data.
For example, in a setting with submanifold structure, our estimator adapts to the intrin-3 sic dimensionality of the data with a suitably chosen regression method; such as, nearest neighbors, local linear, tree-based or spectral series regression (Bickel and Li, 2007;Kpotufe, 2011;Kpotufe and Dasgupta, 2012;. Similarly, if the number of relevant covariates (i.e., covariates that affect the distribution of Z) is small, one can construct a good conditional density estimator using lasso, SAM, Rodeo or other additive-based regression estimators (Tibshirani, 1996;Lafferty and Wasserman, 2008;Meier et al., 2009;Yang and Tokdar, 2015). Because of the flexibility of our approach, the method is able to overcome the the curse of dimensionality in a variety of scenarios with faster convergence rates and better performance than traditional conditional density estimators; see Sections 3-4 for specific examples and analysis. By choosing appropriate regression methods, the method can also handle different types of covariates that represent discrete data, mixed data types, functional data, circular data, and so on, which generally require hand-tailored techniques (e.g., Di Marzio et al. 2016). Most notably, Sec. 3.4 describes an entirely new area of conditional density estimation (here referred to as "Distribution CDE") where a predictor is an entire sample set from an underlying distribution.
We call our general approach FlexCode, which stands for Flexible nonparametric conditional density estimation via regression.
Existing Methodology. With regards to existing methods for estimating f (z|x), several nonparametric estimators have been proposed when x lies in a low-dimensional space.
Many of these methods are based on first estimating f (z, x) and f (x) with, for example, kernel density estimators (Rosenblatt, 1969), and then combining the estimates according . Several works further improve upon such an approach by using different criteria and shortcuts to tune parameters as well as creating fast shortcuts to implement these methods (e.g., Hyndman et al. 1996;Ichimura and Fukuda 2010). Other approaches to conditional density estimation in low dimensions include using locally polynomial regression (Fan et al., 1996), least squares approaches (Sugiyama et al., 2010) and density estimation through quantile estimation (Takeuchi et al., 2009); see Bertin et al. (2016) and references 4 therein for other methods.
For moderate dimensions, Hall et al. (2004) propose a method for tuning parameters in kernel density estimators which automatically determines which components of x are relevant to f (z|x). The method produces good results but is not practical for high-dimensional data sets: Because it relies on choosing a different bandwidth for each covariate, it has a high computational cost that increases with both the sample size n and the dimension D, with prohibitive costs even for moderate n's and D's. Similarly, Shiga et al. (2015) propose a conditional estimator that selects relevant components but under the restrictive assumption that f (z|x) has an additive structure; moreover the method scales as O(D 3 ), which is also computationally prohibitive for moderate dimensions. Another framework is developed by Efromovich (2010), who proposes an orthogonal series estimator that automatically performs dimension reduction on x when several components of this vector are conditionally independent of the response. Unfortunately, the method requires one to compute D + 1 tensor products, which quickly becomes computationally intractable even for as few as 10 covariates. More recently,  propose an alternative orthogonal series estimator that uses a basis that adapts to the geometry of the data. They show that their approach, called Spectral Series CDE, as well as the k-nearest neighbor method by Zhao and Liu (1985), work well in high dimensions when there is submanifold structure. These methods, however, do not perform well when x has irrelevant components.
FlexCode, on the other hand, is flexible enough to overcome the difficulties of other methods under a large variety of situations because it makes use of the many existing regression methods for high-dimensional inference. As an example, Fig. 2 shows the level sets of the estimated conditional density in a challenging problem that involves ≈500 covariates.
Here we estimate f (z|x), where x is the content of a tweet and z is the location where it was posted (latitude and longitude). FlexCode, based on sparse additive regression, is able to estimate the location of tweets even in ambiguous cases (there is a Long Beach both in In Section 2, we describe our method in detail, and present connections with existing literature on Varying Coefficient methods and Spectral Series CDE. Section 3 presents several applications of FlexCode. Section 4 discusses convergence rates of the estimator, and Section 5 concludes the paper.

Methods
Assume we observe i.i.d. data (X 1 , Z 1 ), . . . , (X n , Z n ), where the covariates x ∈ R D with D potentially large, and the response Z ∈ [0, 1]. 1 Our goal is to estimate the full density f (z|x) rather than, e.g., only the conditional mean E[Z|x] and conditional variance V[Z|x].
We propose a novel "varying coefficient" series approach, where we start by specifying an orthonormal basis (φ i ) i∈N for L 2 (R). This basis will be used to model the density f (z|x) as a function of z. As we shall see, each coefficient in the expansion can be directly estimated via a regression. Note that there is a wide range of (orthogonal) bases one can choose from to capture any challenging shape of the density function of interest (Mallat, 1999). For instance, a natural choice for reasonably smooth functions f (z|x) is the Fourier basis: Alternatively, one can use wavelets or related bases to capture inhomogeneities in the density (see Sec.3.4 for an example), and indicator functions to model discrete responses Sec. 4.2).
Smoothing using orthogonal functions is per se not a new concept (Efromovich, 1999;Wasserman, 2006). The novelty in FlexCode is that we, by using an orthogonal series approach for the response variable, can convert a challenging high-dimensional conditional density estimation problem to a simpler high-dimensional regression (point estimation) problem.
For fixed x ∈ R D , we write orthogonal to each other, the expansion coefficients are given by That is, each "varying coefficient" β i (x) in Eq. 1 is a regression function, or conditional expectation. This suggests that we, for fixed i, estimate β i (x) by regressing φ i (z) on x using the sample (X 1 , φ i (Z 1 )), . . . , (X n , φ i (Z n )).
We define our FlexCode estimator of f (z|x) as where the results from the regression, model how the density varies in covariate space. The cutoff I in the series expansion is a tuning parameter that controls the bias-variance tradeoff in the final density estimate.
Generally speaking, the smoother the density, the smaller the value of I; see Sec. 4 Theory for details. In practice, we use cross-validation or data splitting (Sec. 2.1) to tune parameters.
With FlexCode, the problem of high-dimensional conditional density estimation boils down to choosing appropriate methods for estimating the regression functions The key advantage of FlexCode is its flexibility: By taking advantage of new and existing regression methods, we can adapt to different structures in the data (e.g., manifolds, irrelevant covariates as well as different relationships between x and the response Z), and we can handle different types of data (e.g. mixed data, functional data, and so on).
We will further explore this topic in Secs. 3-4. 8

Loss Function and Tuning of Parameters
For a given estimator f (z|x), we measure the discrepancy between f (z|x) and f (z|x) via the loss function where C is a constant that does not depend on the estimator.
To tune the parameter I, we split the data into a training and a validation set. We use the training set to estimate each regression function β i (x). We then use the validation set (z 1 , x 1 ), . . . , (z n , x n ) to estimate the loss (4) (up to the constant C) according to: This estimator is consistent because of the orthogonality of the basis {φ i } i . We choose the tuning parameters with the smallest estimated loss L( f , f ). Algorithm 1 summarizes our procedure. In line 3, we split the training data in two parts to tune the parameters associated with the regression using the standard L 2 (R) regression loss, i.e., E[(W − β i (X)) 2 ].
In terms of computational efficiency, FlexCode is typically faster than existing methods for conditional density estimation (see Section 3), especially in high dimensions and for massive data sets. If the FlexCode estimator is based on a scalable regression procedure

Extension to Vector-Valued Responses
By tensor products, one can directly extend FlexCode to cases where the response variable Z is vector-valued. For instance, if Z ∈ R 2 , consider the basis where z = (z 1 , z 2 ), and {φ i (z 1 )} i and {φ j (z 2 )} j are bases for functions in L 2 (R). Then, let where the expansion coefficients Note that each β i (x) is a regression function of a scalar response. In other words, the FlexCode framework allows one to estimate multivariate conditional densities by only using regression estimators of scalar responses.
Remark: To avoid tensor products, one can alternatively compute a spectral basis {φ i (z)} i≥0 . This basis is orthonormal with respect to the density f (z) and adapts to the density's intrinsic geometry. The expansion coefficients are then given by , in which case one needs to estimate f (Z) as well.

Connection to Other Methods
Varying-Coefficient Models. The model f (z|x) = i∈N β i (x)φ i (z) can be viewed as a fully nonparametric varying-coefficient model. Varying-coefficient models (Hastie and Tibshirani, 1993) are often seen as semi-parametric models or as extensions of classical linear models, in which a function η is modeled as η Spectral Series CDE. FlexCode recovers the spectral series conditional density estimator of  if each β i (x) is estimated via a spectral series regression . Indeed, let {ψ j } j be a spectral basis for x, where by construction Sec. 2). In spectral series CDE, one writes the conditional density as Now, a spectral series regression for By inserting β i (x) into Eq.

Experiments
In what follows, we compare the following estimators: • FlexCode is our proposed series approach. We implement six versions of FlexCode, where we use different regression methods to compute the coefficients FlexCode-SAM is based on Sparse Additive Models (Ravikumar et al., 2009). 2 FlexCode-NN is based on Nearest Neighbors regression (Hastie et al., 2001). FlexCode-Spec uses Spectral Series regression  and is, as shown in Sec. 2.3, the same as Spectral Series CDE, the conditional density estimator in . For mixed data types, we implement FlexCode-RF, which estimates the regression functions via random forests (Breiman, 2001), and for functional data, we use FlexCode-fKR, where the coefficients in the model are estimated via functional kernel regression (Ferraty and Vieu, 2006). Finally, in Sec. 3.4, we illustrate how FlexCode-SDM can extend Support Distribution Machines (SDM; Sutherland et al. 2012) and other distribution regression methods to estimating conditional densities on sample sets or groups of vectors.
• KDE is the kernel density estimator f (z|x) := f (z, x)/ f (x), where f (z, x) and f (x) are standard multivariate kernel density estimators. We rescale the data to have the same mean and variance in each direction, and we assume an isotropic Gaussian kernel for both x and z, i.e., • KDE Tree is the multivariate kernel density estimator f (z|x) for data X i = (X i1 , . . . , X id ) and a bandwidth vector h = (h 1 , . . . , h d ). We use the R package np (Hayfield and Racine, 2008) with kd-trees and Epanechnikov kernels for computational efficiency (Gray and Moore, 2003;Holmes et al., 2007).
• kNN is a kernel nearest neighbors approach (Zhao and Liu, 1985; to conditional density estimation; it is defined as where N k (x) is the set of the k closest neighbors to x in the training set, and K is a 13 multivariate (isotropic) Gaussian kernel with bandwidth .
• fkDE is a nonparametric conditional density estimator for functional data (Quintela-del Río et al., 2011). It is defined as where d is a distance measure in the (functional) space of the data, K and K 0 are isotropic kernel functions, and h x and h z are tuning parameters.
Note that for regression, SAM is designed to work well when there is a small number of relevant covariates, and both Spectral Series Regression and Nearest Neighbors Regression perform well when the covariates exhibit a low intrinsic dimensionality. To our knowledge, KDE Tree is the only CDE method that can handle mixed data types.

Toy Examples
By simulation, we create toy versions of common scenarios with different structures in data and different types of data. We use 700 data points for training, 150 for validation and 150 for testing the methods. Each simulation is repeated 200 times.
Different structures in data.
• Irrelevant Covariates. In this example, we generate data according to Z|x ∼ , that is, only the first covariate influences the response.
• Data on Manifold. Here we let Z|x ∼ N (θ(x), 0.5), where x = (x 1 , . . . , x d ) lie on a unit circle embedded in a D-dimensional space, and θ(x) is the angle corresponding to the position of x. For simplicity, we assume that the data are uniformly distributed on the manifold; i.e., we let θ(x) ∼ U nif (0, 2π).
Different types of data.
We also consider spectrometric data for finely chopped pieces of meat. These high-resolution spectra are available 3 as a benchmark for functional regression models (see, e.g., Ferraty et al. (2007)), where the task is to predict the fat content of a meat sample on the basis of its near infrared absorbance spectrum. In our study, we use 215 samples to estimate conditional densities. The covariates are spectra of light absorbance as functions of the wavelength, and the response is the fat content of a piece of meat. We compare the functional kernel density estimator (fKDE ) with a FlexCode approach (FlexCode-fKR), where the coefficients in the model are estimated via functional kernel regression (Ferraty and Vieu, 2006). We follow Ferraty et al. (2007) and implement both methods with the kernel function K(u) = 1 − u 2 and the L 2 (R) norm between the second derivatives of the spectra as a distance measure. We use 70% of the data points for training, 15% for validation and 15% for testing; the 3 http://lib.stat.cmu.edu/datasets/tecator; the original data source is Tecator AB 15 experiment is repeated 100 times by randomly splitting the data. Figures 3-4 show the results for the toy data. Our main observations are:

Results
• Irrelevant Covariates. In terms of estimated loss (Fig. 3, top left), both FlexCode-SAM and KDE Tree outperform the other methods. However, in terms of computational time (Fig. 3, bottom left), FlexCode-SAM is clearly faster than KDE Tree as the dimension D of the data grows. When D = 17, each fit with KDE Tree already takes an average of 240 seconds (4 minutes) on an Intel i7-4800MQ CPU 2.70GHz processor, compared to 22 seconds for FlexCode-SAM. Fig. 4, left, shows that the loss of FlexCode-SAM remains the same even for large D ∼ 1000, although fitting the estimator becomes computationally more challenging in high dimensions. Nevertheless, fitting KDE Tree would be unfeasible for D > 50.
• Data on Manifold. FlexCode-Spec has the best statistical performance, followed by FlexCode-NN and KDE Tree (Fig. 3, top center). As before, the computational time of KDE Tree increases rapidly with the dimension (Fig. 3, center bottom). For these data, FlexCode-SAM is slow as well even for moderate D, perhaps because SAM cannot find sparse representations of the regression functions. On the other hand (see Fig. 4, right), FlexCode-Spec has a computational time that is almost constant as a function of D and the statistical performance remains the same even for large D. The latter result is consistent with our previous findings that spectral series adapt to the intrinsic dimension of the data (Izbicki et al., 2014;.
• Non-Sparse Data. For this example, FlexCode-Spec and FlexCode-SAM are the best estimators.
• Mixed Data Types. FlexCode-RF yields better results than KDE Tree (its only competitor in this setting) both in terms of estimated loss and computational time; see

Photometric Redshift Estimation
Our first application is photometric redshift estimation. Redshift (a proxy for a galaxy's distance from the Earth) is a key quantity for inferring cosmological model parameters.
Redshift can be estimated with high precision via spectroscopy but the resource considerations of large-scale sky surveys call for photometry -a much faster measuring technique, where the radiation from an astronomical objects is generally coarsely recorded via ∼5-10 broad-band filters. In photometric redshift estimation, the goal is to estimate the redshift z of a galaxy based on its observed photometric covariates x, using a sample of galaxies with spectroscopically confirmed redshifts. Because of degeneracies (two galaxies with different redshifts can have similar photometric signatures) and because of complicated observational noise, probability densities of the form f (z|x) better describe the relationship between x and z than the regression E(z|x) does.
In this example, we test our CDE methods on n = 752 galaxies from COSMOS, with

Twitter Data
Twitter is a social network where each user is able to post a small text (a tweet) containing at most 140 characters. Information about the location of the post is available upon user permission, but only a few users allow this information to be publicly shared. Here we use samples with known locations to estimate the location of tweets where this information has not been shared publicly.
Note that most literature on the topic concerns creating point estimates for locations (see, e.g., Rodrigues et al. 2015 and references therein). In this work, we estimate the full conditional distribution of latitude and longitude given the content of the tweet; that is, we estimate f (z|x), where x are covariates extracted from the tweets and z = (z 1 , z 2 ) is the pair latitude/longitude.
Our data set contains ≈ 8000 tweets in the USA from July 2015 with the word "beach".
We extract 500 covariates via a bag-of-words method with the most frequent unigrams and bigrams (Manning et al., 2008). As we only expect a few of the 500 covariates to be relevant to locating the tweets, we implement FlexCode via sparse additive models. To our knowledge, no other fully nonparametric conditional density estimation method can be directly applied to these types of data where there are many irrelevant variables.
which covariates are most relevant for predicting location. For the example in Fig.2, left, the expressions "beachin","boardwalk", and "daytona" are included in at least 33% of the estimated regression functions. For the example to the right, the relevant covariates are "long beach", "island", "long", and "haven".

From Distribution Regression to "Distribution CDE": Estimating the Mass of a Galaxy Cluster from Sample Sets of Galaxy Velocities
Distribution regression and classification is a recent emerging field of machine learning.
Instead of treating individual data points (or feature vectors) as covariates, these methods operate on sample sets, where each set is a sample from some underlying feature distribution; see Sutherland et al. (2012) and references within. Here we show that FlexCode extends to sample sets as well; our application is estimation of the mass of a galaxy cluster given the line-of-sight velocities of the galaxies in the cluster.
Galaxy clusters, the most massive gravitationally bound systems in the Universe, can contain up to ∼1000 galaxies. These structures are a rich source of information on astrophysical processes and cosmological parameters, but to use galaxy clusters as cosmological probes one needs to accurately measure their masses. A standard approach is to employ the classical virial theorem and directly relate the mass of a cluster to the line-of-sight (LOS) galaxy velocity dispersion, i.e., the variance of the measured galaxy velocities in the cluster (Evrard et al., 2008). Recently, Ntampaka et al. (2015a) and Ntampaka et al. (2015b) have shown that one can significantly improve such mass predictions by taking advantage of the entire LOS velocity distribution of galaxies instead of only the dispersion (i.e., a summary of the distribution). Here we show that FlexCode can further improve these results.
The general set-up is that we observe data of the form (x where z i is the mass of the i-th cluster for i = 1, . . . , I; and x (j) i is a vector of galaxy observables (such as LOS velocity and the projected distance from the cluster center) for the j-th galaxy in the i-th cluster. Note that different clusters i contain different numbers J i of galaxies. The key idea behind Support Distribution Machines (SDMs; proposed for this application by Ntampaka et al. 2015a) as well as other "distribution regression" methods (Sutherland et al., 2012), is to treat each sequence x as a sample from a probability distribution p i , and to construct an appropriate kernel matrix on these sample sets.
The task is then to predict a scalar (z i ) from a distribution ( k nearest neighbors method for k = 2. That is, let X A denote the set of LOS velocities associated with the n galaxies of cluster A, and let X B denote the set of velocities associated with the m galaxies of cluster B. The estimated KL divergence from p A to p B is given by where ν k (i) is the Euclidean distance from the covariates (in this case, the LOS velocity) of 22 the i-th galaxy in X A to its k-th nearest neighbor in X B , ρ k (i) is the Euclidean distance from the covariates (the LOS velocity) of the i-th galaxy in X A to its k-th nearest neighbor in X A , and d is the number of galaxy observables (in this example, d = 1). As the computed kernel matrix k(X A , X B ) = exp (−KL n,m (X A , X B )/σ 2 ) may not be positive semi-definite (PSD), we project the matrix to the closest PSD matrix in Frobenius norm (Higham, 2002).
Using the PSD kernel matrix, we then estimate the conditional density f (z|p). We compare four approaches to conditional density estimation on distributions, which as in the rest of the paper use a Fourier basis in z; • Functional KDE: the functional kernel density estimator (Quintela-del Río et al., 2011), • FlexCode-NN: FlexCode with Nearest Neighbors regression, • FlexCode-Spec: FlexCode with Spectral Series regression, • FlexCode-SDM: FlexCode with SDM regression.
In the experiments, we also include a FlexCode estimator that use a wavelet basis in z; • FlexCode W -SDM: FlexCode with SDM regression in x, and Daubechies wavelets with 3 vanishing moments in z.
Our data consist of simulations of n = 5028 unique galaxy clusters with minimum mass of 1×10 14 M h −1 ; see Ntampaka et al. (2015a) for details. All four methods above are based on the same distance computation KL n,m (X A , X B ) with k = 2, and we use data splitting and the loss (5) for selecting tuning parameters. For simplicity, we only consider one LOS for each cluster (the x-axis LOS in the catalog).
It is clear from Table 1   The top left panel of Figure 7 shows  Finally, we notice that both the mean and the mode of FlexCode-SDM as well as FlexCode W -SDM densities improve upon plain SDM regression.   Figure 7 shows some examples of multimodal densities and their 95% HPD regions. In many cases, returning a predictive region for the cluster mass is a better alternative to just taking the mean or mode of the density. The coverage plot in the bottom right panel also indicates that the empirical coverage of these regions is indeed close to 95%.

Theory
In this section, we derive bounds and rates for FlexCode; that is, the conditional density estimator in Eq. 3. We use the notation f I (z|x) to indicate its dependence on the cutoff I.
We assume that f belongs to a set of functions which are not too "wiggly". For every denote the Sobolev space. For the Fourier basis {φ i } i , this is the standard definition of Sobolev space (Wasserman, 2006); it is the space of functions that have their s-th weak derivative bounded by c 2 and integrable in L 2 (R). We enforce smoothness in the z-direction by requiring f (z|x) to be in a Sobolev space for all x. This is formally stated as Assumption 1, where β and C are used to link the Sobolev spaces at different x.
viewed as a function of z, and s x and c x are such that inf x s x def = β > 1 2 and X c 2 We also assume that each function β i (x) is estimated using a regression method with 27 convergence rate O(n −2α/(2α+d) ), where typically α is a parameter related to the smoothness of the β i (x) function, and d is either the number of relevant covariates or the intrinsic dimension of x. In other words, we assume that each regression adapts to sparse structure in the data. This is formally stated as Assumption 2.
Assumption 2 (Regression convergence). For every i ∈ N, there exists some d ∈ N and α > 0 such that Note that the smoothness parameter α must be the same for every i ∈ N. Typically this assumption will hold because in many applications it is reasonable to assume that (i) if x 1 is close to x 2 , then f (z|x 1 ) is also close to f (z|x 2 ) for every z ∈ R (in other words, f (z|x) is smooth as a function of x), and (ii) there is some structure in x (e.g., low intrinsic dimensionality) or in the relationship between x and z (e.g., sparsity), which the regression method for estimating β i takes advantage of. Here are some examples where Assumption 2 holds: (E1) β i is the k-nearest neighbors estimator (Kpotufe, 2011), d is the intrinsic dimension of the covariate space and, for every z ∈ [0, 1], f (z|x) is L-Lipschitz in x (in this case, α = 1); (E2) β i is a local polynomial regression (Bickel and Li, 2007), d is the intrinsic dimension of the covariate space and, for every z ∈ [0, 1], f (z|x) is α times differentiable with all partial derivatives up to order α in x are bounded; (E3) β i is the Rodeo estimator (Lafferty and Wasserman, 2008), d is the number of variables that affect the distribution of Z and, for every z, all partial derivatives of f (z|x) up to fourth order in x are bounded (in this case, α = 2); (E4) β i is the regression estimator from Bertin and Lecué (2008), d is the number of variables that affect the distribution of Z, 5 and, for every z ∈ [0, 1], f (z|x) is α-Hölderian in x; 5 That is, there exists a subset R ⊆ {1, . . . , D} with |R| = d such that f (z|x) = f (z|(x i ) i∈R )

28
(E5) β i is the Spectral series regression , d is the intrinsic dimension of the covariate space and, for every z ∈ [0, 1], f (z|x) is smooth with respect to P X according to ||∇f (z|x)|| 2 dS(x) < ∞ for a smoothed version S(x) of f (in which case α = 1); (E6) β i is a local linear functional regression (Baíllo and Grané, 2009), the predictor X is a function talking values in L 2 ([0, 1]), X is fractal of order τ , and, for every z ∈ [0, 1], f (z|x) is twice differentiable with a continuous second derivative (yielding rates with α = 2 and d = τ .) In essence, Assumption 2 holds for examples E1-E6 because smoothness in f (z|x) (seen as a function of x) implies smoothness of the β i (x) functions in FlexCode. We refer to Appendix A1 for details and proofs. (See also, e.g., Yang and Tokdar (2015) and references therein for other adaptive regression methods.) We also note that the converge rates may vary depending on the choice of basis.
Under Assumptions 1-2, we bound the bias and variance of f I (z|x) separately.
To summarize: The convergence rate of FlexCode only depends on d, the "true" dimension of the problem. Moreover, the rate is near minimax with regards to d: In the isotropic setting where x and z have the same degree of smoothness, i.e., α = β, the rate becomes O n − 2α 2α+d 2α+1 2α +1 , which is close to the minimax rate O n − 2α (2α+1+d) of a conditional density estimator with d covariates . The difference is the multiplicative factor 2α+1 2α , which gets closer to 1, the smoother f is. Although FlexCode's rate is slightly slower than the optimal rate, 6 the estimator is still considerably faster than O n − 2α (2α+1+D) , the usual minimax rate of a nonparametric conditional density estimator in R D . In other words, even though there are D covariates, our estimator can overcome the curse-of-dimensionality and behave as if there are only d D covariates.
Finally, note that although we here restrict our examples to cases where either (i) the intrinsic dimension is small or (ii) several covariates are irrelevant, the theory we develop can easily be applied to other settings for high-dimensional regression estimation. For instance, Yang and Tokdar (2015) introduce a third type of sparse structure: in their paper, r may depend on all D covariates, but admits an additive structure r = k s=1 r s , where each component function r s depends on a small number d s of predictors. The authors then show that an additive Gaussian process regression achieves good rates of convergence in such a setting. It follows that FlexCode can achieve good rates under a similar additive setting f (z|x) = k s=1 f s (z|x) if one estimates the expansion coefficients via additive Gaussian process regression. computation for structured data and complex simulation models. Another interesting direction for future work is variable selection via FlexCode. For example, FlexCode-Forest and FlexCode-SAM currently perform a separate variable selection for each coefficient β i (x) in FlexCode (Eq. 2), but one can unify these results to define a common support for the final FlexCode estimate.
interior of Spain. Our final example, corresponding to the right plot in the figure, is a tweet in Portuguese about cold weather. Our FlexCode model here assigns high probability to big cities in Brazil, which is consistent with July being a winter month in the south hemisphere.
We also notice that it in the winter rains a lot in Recife, the northernmost city that are colored red in the density plot. This is why FlexCode assigns a high probability to this location despite the city being much smaller than Sao Paulo and Rio de Janeiro.

C Proofs and Additional Results
To prove that the estimators in examples E1-E6 in Sec. 4 satisfy Assumption 2, we only need to show that smoothness in the conditional density f (z|x) (seen as a function of x) implies smoothness for each varying coefficient β i (x). Assumption 2 then follows directly from known convergence results for regression. For E3 and E4, note that if there exists a subset R ⊆ {1, . . . , D} with |R| = d such that f (z|x) = f (z|(x i ) i∈R ) (i.e., there are only d relevant covariates), then β i (x) = β i ((x i ) i∈R ).
The notion of smoothness in Bertin and Lecué (2008) is based on Hölderian classes. Hence: Lemma 5. Let {φ i } i be the Fourier basis and P l (f )(·, x) be Taylor polynomial of order l associated with f at the point x. If, for every fixed z ∈ R, f z (x) := f (z|x) belongs to Σ(α, L), the α-Hölderian class, i.e., |f z (x) − P l (f z )(t, x)| ≤ L||t − x|| α 1 where l = α , then β i (x) belongs to Σ(α, √ 2L) for all i ∈ N.
Proof. Because β i (x) = φ i (z)f (z|x)dz, then Finally, the local linear functional regression estimator (Baíllo and Grané, 2009) assumes that the regression function has continuous second derivatives. Hence: