Robust Regression with Compositional Covariates

Many high-throughput sequencing data sets in biology are compositional in nature. A prominent example is microbiome profiling data, including targeted amplicon-based and metagenomic sequencing data. These profiling data comprises surveys of microbial communities in their natural habitat and sparse proportional (or compositional) read counts that represent operational taxonomic units or genes. When paired measurements of other covariates, including physicochemical properties of the habitat or phenotypic variables of the host, are available, inference of parsimonious and robust statistical relationships between the microbial abundance data and the covariate measurements is often an important first step in exploratory data analysis. To this end, we propose a sparse robust statistical regression framework that considers compositional and non-compositional measurements as predictors and identifies outliers in continuous response variables. Our model extends the seminal log-contrast model of Aitchison and Bacon-Shone (1984) by a mean shift formulation for capturing outliers, sparsity-promoting convex and non-convex penalties for parsimonious model selection, and data-driven robust initialization procedures adapted to the compositional setting. We show, in theory and simulations, the ability of our approach to jointly select a sparse set of predictive microbial features and identify outliers in the response. We illustrate the viability of our method by robustly predicting human body mass indices from American Gut Project amplicon data and non-compositional covariate data. We believe that the robust estimators introduced here and available in the R package RobRegCC can serve as a practical tool for reliable statistical regression analysis of compositional data, including microbiome survey data.

Microbiome profiling surveys are often accompanied by measurements of other covariates, including physicochemical properties of the underlying habitats, variables related to the phenotypic status of the host, or those coming from orthogonal high-throughput protocols, such as metabolomics or flow cytometry. An important step in exploratory analysis of such data sets is to infer parsimonious and robust statistical relationships between the microbial compositions and habitat-or host-specific measurements. Standard linear regression modeling can, however, not be applied in this context because the microbial count data only carry relative or compositional information. Several regression techniques have been introduced to handle compositional data, including Dirichlet multinomial mixture modeling (Holmes, Harris and Quince, 2012) and kernel penalized regression (Randolph et al., 2018). A popular approach to regression modeling with compositional covariates is the (quadratic) log-contrast regression model, as proposed by Aitchison and Bacon-Shone (1984) in the context of experiments with mixtures. When only linear log-contrasts are considered, this model expresses the continuous response as linear combination of the log-transformed compositions subject to a zero-sum constraint on the regression vector. This model allows the intuitive interpretation of the response as a linear combination of log ratios of the original compositions. An alternative equivalent lowdimensional approach considers linear regression after applying an isometric log-ratio transform to the compositions (Hron, Filzmoser and Thompson, 2012).
In the context of statistical microbiome data analysis, the linear logcontrast model has been brought to the high-dimensional setting via regularization. The sparse linear log-contrast model, introduced in Lin et al. (2014), considers variable selection via 1 regularization and has been extended to multiple linear constraints for sub-compositional coherence across predefined groups of predictors (Shi et al., 2016), to sub-composition selection via tree-structured sparsity-inducing penalties (Wang and Zhao, 2017), and to longitudinal data modeling via a constraint group lasso penalty (Sun et al., 2018). While these approaches can lead to parsimonious models linking (microbial) compositions to responses of interest, the present methods are sensitive to outliers or high-leveraged data points in the response. In this contribution, we alleviate these shortcomings by introducing a robust framework for log-contrast regression.
Main contributions. We introduce a mean shift formulation for log-contrast regression that enables the modeling of outliers in the response variable. We achieve parsimonious model identification through sparsity-promoting convex and non-convex regularizers, including a novel adaptive 1 penalty, resulting in a family of robust estimators. All estimators allow for simultaneous outlier detection and variable selection with compositional covariates. We formulate the associated (non-)convex optimization problems using an augmented Lagrangian framework and present an iterative thresholding/proximal algorithm for efficient numerical minimization. Furthermore, we equip the estimators with a novel sampling-based robust initialization procedure and a cross-validation scheme which is specifically tailored to robust model selection. We derive prediction error bounds for the introduced estimators and illustrate the behavior of the estimators in a wide range of simulation scenarios. The robust estimators enable a large-scale analysis of the challenging task of body mass index (BMI) prediction from ampliconbased and other covariate data from the American Gut Project (McDonald et al., 2018). Our analysis provides evidence that the BMI of lean and normal participants can be accurately predicted by linear log-contrasts. Overweight to obese BMI values are, to a large extent, identified as outlier, thus suggesting that more complex models are necessary. The introduced estimators as well as the data analysis examples are available in the R package RobRegCC accompanying this paper.
Notation. We denote the abundances of p compositions or counts across n observations by W = [w 1 , . . . , w n ] T = (w ij ) ∈ R n×p . Let A be an index set.
The matrices W .A and W A. denote sub-matrices of W with column and row index A, respectively. The symbol −j denotes exclusion of the jth column. For instance, W i,−j = w i,−j denotes the ith row of W without the jth column. Similarly, w A represent a sub-vector of w ∈ R p with indices A. We denote the projection matrix of W by P W = W(W T W) † W T where M † denotes the Moore-Penrose (or generalized) inverse of M. The orthogonal complement of P W is P ⊥ W = I − P W . The column space of W is given by CS(W). The notation rep(a, b) refers to a1 b . The operator J (w) denotes the support of a vector w with cardinality |J (w)|.
2. Log-contrast regression for compositional data. Our logcontrast modeling framework is motivated by regression problems arising in microbiome data analysis. There, we are given a measurement matrix D = [d 1 , . . . , d n ] T ∈ R n×p which comprises n observations of a pdimensional vector of counts. The counts correspond to the number of OTUs, ASVs, or genes that have been identified in a biological sample with targeted amplicon-based or metagenomic sequencing and analysis protocols. Due to experimental limitations, however, the counts only carry relative or proportional information and do not represent absolute abundances. A common normalization approach is to divide each sample by its total sum, resulting in a matrix W = [w 1 , . . . , w n ] T ∈ R n×p where each w i = d i /1 T p d i represents a p-dimensional vector of proportions or compositions. Prior to the normalization, any zero count is replaced by a small constant pseudocount (Aitchison, 1982) or a small random count generated from an appropriate probability distribution (Friedman and Alm, 2012). Any compositional vector w i is thus constraint to the (p − 1)-dimensional simplex S (p−1) = {[s 1 , . . . , s p ] T : 0 < s k ≤ 1, p k=1 s k = 1}. The problem of interest is to find associations between the measured compositions W and a continuous response or outcome variable of interest y = [y 1 , . . . , y n ] T ∈ R n that has been jointly collected with the microbiome data. Aitchison and Bacon-Shone (1984) provide a useful framework to model such association via log-contrast regression.
2.1. The standard log-contrast regression model. The principle idea of log-contrast regression is to model the outcome y as linear combination of log-ratios derived from the compositional covariate data W. A common transform is the additive log-ratio (alr) transform (Aitchison, 1982) which requires the choice of a reference. When considering the kth predictor as reference, the alr-transformed data are U = [u 1 , . . . , u n ] T , where u i = [u i1 , . . . , u ip ] with u ij = log(w ij /w ik ). The log-contrast regression model can then be written as T is the coefficient vector, and = [ 1 . . . n ] ∈ R n is independent and identically distributed (IID) noise with mean E( i ) = 0 and variance Var( i ) = σ 2 . A major drawback of model (2.1) is its loss of permutation invariance due to the choice of a reference (Aitchison, 1982). By expressing b k = − i =k b i , we can reformulate model (2.1) into a symmetric permutation-invariant form as (Aitchison and Bacon-Shone, 1984). The linear constraint in (2.2) ensures that, after model fitting, the response can be equivalently expressed as linear combinations of log-ratios of the original compositions (Aitchison, 2003;Sun et al., 2018;Bates and Tibshirani, 2018;Combettes and Müller, 2019). The model also ensures subcompositional coherence, a key principle in compositional data analysis. This principle states that the analysis should be coherent even if we had only selected subcompositions out of the full compositions, or if the analyzed compositions are only parts of larger compositions containing other parts.
2.2. Extended log-contrast regression models. The linear log-contrast model (2.2) can be conveniently extended in several ways. When the linear model is not expressive enough, Aitchison and Bacon-Shone (1984) suggest to include higher-order log-ratios terms. For instance, the quadratic log-contrast model reads (2.3) Given p compositional covariates, the quadratic part of the model comprises p q = p 2 squared log-ratio covariates per sample and p q unconstrained coefficients q kl . These non-compositional covariates form a matrix S ∈ R n×pq with coefficients q = [q 11 , q 12 , . . . , q p−1p ] ∈ R n . To reduce the number of the covariates prior to model estimation, a subset of the quadratic covariates can be selected via correlation screening. When additional m habitat and host-associated covariates H ∈ R n×m are available, we propose to form the predictor matrix N = [S H] and introduce the extended linear log-contrast model where Z = [z 1 , . . . , z n ] T , b ∈ R p is the coefficient vector for the compositional covariates, and a ∈ R pq+m is the coefficient vector for the noncompositional covariates, respectively.
The zero-sum constraint in (2.4) can be generalized when grouping information about the predictors is available. In the microbiome context, each predictor represents an OTU which can be associated with taxonomic or phylogenetic information. This information is typically encoded in a taxonomic or phylogenetic tree T K,p with p leaves and K levels. Following Shi et al. (2016), we can include this information in (2.4) via a generalized linear constraint. Let us assume that we are interested in analyzing the microbiome data at a fixed (taxonomic or phylogenetic) level of the tree, e.g., at the phylum level. This induces a grouping of the p taxa into k disjoint sets with column index set A r such that |A r | = p r for r = 1, . . . , k and k r=1 p r = p. Each set represents the taxa in the respective phylum. If the goal of the analysis is to be subcompositionally coherent with respect to the tree grouping, we define the subcomposition matrix C: where c j accounts for the composition in the subgroup with index set A j such that (c j ) A j = 1 p j . The model in (2.4) can thus be generalized by including the subcomposition matrix C: where {Z Ar , b Ar } are the covariates and unknown coefficients corresponding to the rth sub-group. The model in (2.4) is a special case of model (2.6) with C = 1 p .
3. Robust log-contrast regression. Many biological datasets, including microbiome profiling data, contain outliers or other forms of data corruptions that can hamper statistical estimation. For example, the extended log-contrast model in (2.6) assumes errors to be well behaved, i.e., free from outliers in [y, Z, N]. It has long been established (see, e.g., Huber (1964Huber ( , 1981) that robust statistics can offer concepts and techniques that allow to relax strict model assumptions on the data. While several robust statistics concepts have been proposed for low-dimensional compositional data analysis (see (Filzmoser and Hron, 2011) for an overview), few approaches are available in the high-dimensional setting. In , robust estimators are used for microbiome data normalization prior to differential abundance analysis. In (Shi, Zhou and Zhang, 2018), an error-invariable approach for high-dimensional microbiome data is proposed which considers data corruption in W. We introduce a family of robust extended log-contrast model estimators that can handle outliers in observations. 3.1. Formulation of the robust log-contrast regression model. We equip the extended log-contrast model in (2.6) with a mean shift vector γ = [γ 1 , . . . , γ n ] T resulting in the model: The mean shift vector γ = [γ 1 , . . . , γ n ] T can account for grossly corrupted observations with respect to the linear model. The support set J (γ) thus comprises the outliers. Extending a linear model with a mean shift vector has been proposed for wavelet estimation in partial linear models (Antoniadis, 2007;Gannaz, 2007) and for robust linear regression in (She and Owen, 2011;Lee, MacEachern and Jung, 2012;Nasrabadi, Tran and Nguyen, 2011).
For ease of presentation, we fuse the compositional and non-compositional covariates into the general design matrix X = [Z N]. The corresponding model coefficients are denoted by β = (b a) ∈ R p . The linear constraint matrix C is augmented by a k × (m + p q ) zero-matrix leading to C = [C T 0 k×(m+pq) ] T . The model in (3.1) thus simplifies to The model in (3.2) forms the basis for the Robust log-contrast Regression estimators with Compositional Covariates (RobRegCC) introduced next.
3.2. Robust log-contrast regression estimation. The RobRegCC model is over-specified even in the low-dimensional setting as it comprises (p+p q +m+ n) unknown parameters. We thus consider sparsity-inducing penalization for proper model parameter identification. The proposed class of estimators for the parameters (γ, β) are associated with the following general optimization problem: where P 1 λ 1 (γ) and P 2 λ 2 (β) are sparsity-inducing regularizers with tuning parameters λ 1 and λ 2 , respectively. The regularization framework involves solving the optimization problem (3.3) over a grid of tuning parameters {λ 1 , λ 2 }. When the same penalty functions are used for regularizing β and γ, i.e., P λ (·) = P 1 λ (·) = P 2 λ (·), we can simplify the optimization problem in (3.3). Normalizing the 2 -norm of the columns of X to length √ n, scaling the mean shift vector γ by the factor √ n, and concatenating the unknowns where P λ (δ) is a sparsity-inducing penalty with a single tuning parameter λ. This allows joint sparse variable selection in β and sparse outlier detection in γ. We focus, in theory and practice, on three different choices for the penalty function P λ (δ): T represents the model parameters, w a vector of non-negative weights, and α ∈ [0, 1] the mixing weight between the sparsityinducing 0 / 1 -norm and the 2 -norm, respectively. The symbol • denotes the element-wise product. Penalty function I comprises a mixture of the nonconvex 0 "norm" and ridge (or Tikhonov) regularization via the squared 2 -norm. Following She and Owen (2011), we refer to this penalty as the hard-ridge penalty P H . Penalty II considers a convex relaxation of penalty I and comprises the "Elastic-Net" penalty (Zou and Hastie, 2005) as mixture of 1 -norm and squared 2 -norm and is denoted by P S . Penalty III augments penalty II by a non-negative weight vector w in the 1 -norm leading to a weighted or "adaptive" penalty P A , similar to the adaptive lasso framework (Zou, 2006). This convex penalty is novel in the context of mean shift estimation and requires the construction of appropriate weights w via a robust data-driven initialization procedure, as detailed in 4.2. The choice of the penalty function determines the properties of the corresponding robust estimator. In the low-dimensional setting, Antoniadis (2007) and Gannaz (2007) showed the equivalence between the Huber's M-estimator and mean shift model estimation with 1 norm penalization. She and Owen (2011) extended these results to equivalence relationships between non-convex penalties and non-convex M-estimators that can handle gross outliers. We extend these results to the compositional setting.

4.
A unifying computational framework for robust log-contrast regression. We introduce a complete computational framework for parameter estimation of the robust extended log-contrast regression model in (3.2). The framework comprises three parts: (i) a general algorithm for solving the optimization problem in (3.4) that can encompass any of the introduced penalty functions, (ii) a novel robust initialization procedure that is instrumental when penalty functions I or III are used, and (iii) a new cross-validation-based model selection strategy specifically tailored to robust estimation. 4.1. An general optimization algorithm. We describe an algorithmic framework that can handle the optimization problem in (3.4) with any of the penalty functions I-III. Without regularization, the objective function belongs to the class of quadratic programs with linear constraints and can be solved efficiently with a wide variety of standard solvers. With regularization, however, the problem requires specialized optimization strategies. As the considered penalty functions allow for the efficient computation of their corresponding proximity or thresholding operators (see, e.g., Antoniadis and Fan (2001); Combettes and Pesquet (2011); She and Owen (2011)), several algorithms are available for the estimators with the convex penalties II-III, including modern splitting methods (Combettes and Pesquet, 2012;Briceño-Arias and Rivera, 2018). For ease of simplicity, however, we propose an iterative thresholding algorithm which is derived from an augmented Lagrangian formulation and can encompass all penalty functions.
A fundamental building block for the proposed algorithm is the use of the proximity or thresholding operator Θ(·) associated with a penalty function P (·): For the penalty functions considered here, we can derive explicit formulations for the operators (see also, e.g., Antoniadis and Fan (2001); She and Owen (2011); Combettes and Pesquet (2011)). For any scalar a, the soft thresholding operator is defined as Θ S λ (a) = sign(a)(|a| − λ) + , and the hard threshold operator is Θ H λ (a) = a1 |a|>λ . Table 1 summarizes the parameterized scalar thresholding operators, associated with the penalty functions I-III. Note that for vector-valued input to the penalty functions, the thresholding Table 1 Penalty function and corresponding thresholding/proximity operator.
operators are applied element-wise. To solve the optimization problem in (3.4), we first handle the linear constraint as follows. The linear constraint C T β = 0 implies β = P ⊥ C θ for any θ ∈ R p . We thus reformulate the optimization problem in (3.4) as . When only compositional covariates as predictors are considered, the subcompositional constraint matrix C implies that the matrix XP ⊥ C carries the centered logratio (clr) transformed variables of the original compositional covariates W. Hence, sparse estimates of θ facilitate variable selection in clr-transformed covariates.
We solve the constraint optimization problem in (4.1) by an augmented Lagrangian approach. The standard augmented Lagrangian for the problem reads where ζ ∈ R k are the Lagrange multipliers and µ > 0 is a regularization parameter. By reparameterizing η = ζ/µ and completing the "square", the augmented Lagrangian simplifies to We consider an iterative dual descent approach for solving the associated optimization problem.
The primal update requires solving an unconstrained optimization problem for fixed Lagrange multipliers η (i) . By grouping all terms in the Lagrangian appropriately, we can rewrite this subproblem in standard form For the penalty functions I-III, this problem formulation is amenable to iterative shrinkage/thresholding algorithms (ISTA) (see, e.g., Daubechies, Defrise and De Mol (2004); Beck and Teboulle (2009)) or, equivalently, to the thresholding-based iterative selection procedure (TISP) (She, 2009). Convergence guarantees, however, depend on the specific properties of the penalty function. ISTA algorithms comprise a (forward) gradient step and a (backward) proximal/thresholding step. To solve the primal update at the (i+1)th stage for the objective in (4.2), the (j + 1)th iteration in ISTA reads is the thresholding operator corresponding to the considered penalty function (see Table 1). The operator is applied element-wise to the entries of the vectors. The iterative algorithm is stopped when a prescribed convergence criterion on the consecutive iterates is reached. To ensure monotone decrease in the objective function, the scaling constant needs to satisfy (2016)). Global and local convergence of the iterates can be proven for convex and non-convex penalties, respectively (Bauschke and Combettes, 2011;Bayram, 2016;She, 2009). In order to solve the primal update fast and robustly, we provide penalty-dependent initial parameter estimates θ (i,0) . For the convex penalties II-III, we employ a "warm start" strategy and set . When the non-convex penalty I is used, we set θ (i,0) =δ, which is the solution of our robust initialization procedure, detailed in Section 4.2. This robust solution is also used to construct weights for penalty II. Following Zou (2006), we set w = |δ| −ν with ν = 1.
The convergence properties of the entire dual descent approach naturally depends on the convergence properties of the algorithms for the subproblems. We refer to Bertsekas (1982), Proposition 2.1 -2.3, for convergence guarantees regarding the Lagrangian approach. In practice, we see fast and robust performance of the algorithm under a wide range of simulation and application scenarios. We summarize the described algorithm in pseudo-code notation in Algorithm 1.
Before outlining the robust initialization scheme, we briefly describe a simplification of Algorithm 1 when we do not assume sparsity in the coefficient vector β in the generative model (3.2), thus removing the penalty P 2 λ 2 (β) in (3.3). We refer to this setting as the non-sparse model. In this setting, the parameter β can be effectively marginalized out of the model as follows. For a fixedγ, the estimateβ solving the linearly constrained least-squares problem (3.3) is given by Consider {β (i) , γ (i) } be the parameter estimate in the (i + 1)th iteration. The update rule for β is then given by β which automatically ensures the linear constraint satisfaction C T β (i+1) = 0. To marginalize the model parameter β, we substitute β with its estimate, thus leading to an optimization problem for the shift parameter γ: Using the thresholding operators from Table 1, we can solve the problem in (4.5) explicitly for all penalty functions I-III using ISTA-type schemes.
For penalty functions I-III, the updates read respectively. Note again that for penalty I and III, a robust initialization (see Section 4.2) is necessary for the initial estimate and the construction of the weights w, respectively. The pseudo-code for estimation in the non-sparse model is summarized in the Supplementary material in Algorithm 2.
Robust initialization. Robust estimation procedures for linear models, including S-estimators (Rousseeuw and Yohai, 1984), MM-estimators (Yohai, 1987), the Θ-IPOD (She and Owen, 2011), and the Penalized Elastic-Net S-Estimator (PENSE) (Cohen Freue et al., 2017), are multi-stage esti-mators that comprise an initialization stage and one or several improvement stages. A common theme for the initialization stage is the use of resamplingbased approaches in combination with robust loss functions (Maronna, Martin and Yohai, 2006;Salibian-Barrera and Yohai, 2006). While most methods assume the model coefficients to be dense, a variant of the Θ-IPOD as well as the PENSE encourage sparse coefficients. However, the latter methods operate under the standard linear model and are not suited for the compositional setting.
When solving the optimization problem in (4.1) with penalty I or III, RobRegCC also requires good initial estimates of the coefficient and the mean shift vector, respectively. One initialization strategy could have been to extend ideas from fast S-estimators (Salibian-Barrera and  to the compositional setting. However, we found the procedure to be prone to masking and swapping effect, especially for high leverage outliers in higher dimensions. We thus propose to use the concept of principal sensitivity components (PSC) (Peña and Yohai, 1999) for RobRegCC.
In the linear model, PSC analysis has been originally introduced for ordinary least squares (Peña and Yohai, 1999) and extended to robust ridge regression (Maronna, 2011) and robust sparse regression (Cohen Freue et al., 2017). Briefly, PSC analysis relies on the idea of leave-one-out sensitivities of the following form. Given all samples, let y i be the estimated prediction of the model under consideration for observation i, and y i(j) the corresponding prediction value with the jth observation removed. The sensitivity of the ith observation is then defined as The sensitivity matrix of all observation is defined as R = [r 1 , . . . , r n ] ∈ R n×n . To identify potential outliers in the data, PSC analysis proceeds by first computing an eigenvalue decomposition of the matrix RR T . The eigenvectors u i of that matrix are called the principal sensitivity components (PSCs) of the matrix R (Peña and Yohai, 1999). Observations that comprise extreme values with respect to the PSCs are deemed outliers and removed from the samples. In RobRegCC, we mainly follow the protocol that has been designed for PENSE (see (Cohen Freue et al., 2017) 2.2.), and propose a PSC based analysis for the sparse log-contrast model (as detailed in Shi et al. (2016)) and obtain an estimate of the coefficient on the subsequent potentially clean subsamples. The exact computational protocol is given in the Supplementary Material 1.2. The final outcome of the robust initialization procedure is the estimateδ = [β Tγ T ] T which is used in RobRegCC as initial estimate in the optimization when the hard-ridge penalty is used, and for weight construction for the adaptive Elastic Net penalty, respectively.
4.3. Model selection. In practical applications, the robust log-contrast estimators require data-driven tuning of the regularization parameter λ. Despite the extensive literature on model selection for penalized regression models, few methods are available that tackle the challenge of joint sparsification of regression vectors and outlier detection. One exception is the model selection procedure for the Θ-IPOD (She and Owen, 2011). There, model selection is performed via a modified Bayesian Information Criterion (BIC). The empirical BIC curve over the regularization path is smoothed via a spline, and the local minimum over a subset of the curve is selected. We found this BIC criterion to be unreliable and inconsistent in practice and thus devise a novel cross-validation-based model selection scheme.
After model initialization (resulting in a good initialβ estimate for the estimator with hard-ridge penalty and weights w for the adaptive Elastic net penalty, respectively), we divide the training data into k = 5 folds and solve, for each fold, the optimization problem associated with the estimator along the λ-path. We consider l = 50 λ values log-linearly spaced in the interval [λ min , λ max ] where λ max = max(|P ⊥ X y|/w, |y|) and λ min is the multiple of λ max at which at most the mean shift parameterγ comprises n/2 non-zeros, i.e., half of the data are considered outliers. For penalties I and III, the weights are set to w = 1 n . We compute solutions along the path for penalties II and III via warm starts, by using the solution from larger λ values as initial value for the next λ value. For the non-convex problem with penalty I, we always useβ from the initialization procedure as initial estimate. For a given fold, denote by {y tr , X tr } the training data with n tr data points, and by {y te , X te } the test data with n te data points. For a given λ, denote by θ λ and γ λ the regression and mean shift estimates, respectively. We compute the residuals and scale as follows: For the test data, we compute the residuals As the test data may also contain outlier observations, we require a robust scale estimation procedure. Setting the scaled test error to r te,λ = te,λ /s tr,λ , we compute its median absolute deviation (MAD) (Rousseeuw and Hubert, 2011) as the robust estimate of scale denoted byŝ te,λ . After removing the outliers from r te,λ identified by |r te,λ | > 1.345ŝ te,λ , we calculate standard deviation σ te,λ of the "clean" test sample error and use λ = |σ te,λ − 1| as test statistics for the cross-validation procedure. We select the λ CV that minimizes the average 5-fold cross-validation error λ over the regularization path. The empirical performance of the scheme is evaluated in Section 6.
5. Non-asymptotic Analysis. We observe that the number of unknown parameters in the robust log-contrast regression model increases linearly with the sample size n. Hence, a finite sample analysis is desirable to understand the effect of the number of samples n, number of predictors p, and linear constraints k on the model prediction error. For simplicity, we perform the analysis of the robust model (3.2) with only compositional covariates Z, i.e, where β * ∈ R p is the true coefficient, γ * ∈ R n is the true mean shift, and is the IID sub-Gaussian error with mean zero and variance σ 2 . T * = J (γ * ) and S * = J (β * ) denote the support index sets of γ * and β * such that |T * | = t * and |S * | = s * , respectively.
From the general RobRegCC model formulation (3.3), the optimization problem for the reduced model (5.1) is given by In order to focus on the core issue, we consider dropping the quadratic component of the penalty functions (equivalent to setting α = 1 as defined in Table 1) with separate tuning parameters {λ 1 , λ 2 } for penalizing {γ, β}, respectively. Hereafter, for the ease of notation, we drop the subscript and denote the optimal solution of (5.2) by { β, γ}.
We define the model prediction error as M( β−β * , γ−γ * ) where M(a, b) = Za + b 2 2 . Our analysis upper-bounds the prediction error in terms of the model bias and variance. For the unrestricted predictor matrix Z, Theorem 5.1 provides a slow rate bound on the prediction error regardless of the type of sparsity-inducing penalty functions. Consequently, remark following the theorem states the oracle bound in case of 0 norm penalty (case I). Theorem 5.3 provides the result to attain the required oracle bound in case of 1 penalty under a compatibility condition on Z, also referred as fast rate bound. Moreover, under some additional regularity assumptions, the finite sample analysis of the prediction error can be extended i) to perform the asymptotic analysis; ii) to obtain the estimation error bound in various norms; and iii) to establish selection consistency of the parameter estimates (Lounici et al., 2011).
The proofs of our theorems rely on and extend prior work, in particular She (2016); She and Chen (2017); She (2017).
Remark Consider P 1 λ 1 (γ) λ 2 1 and P 2 λ 2 (β) λ 2 2 . In case of β = β * and γ = γ * , it follows from Theorem 5.1 that The oracle bound suggests that, with moderate number of outlier, the dependence of variance on t * allows the parameter estimate to reduce model bias.
Corollary 5.2. Let us assume 0 log 0 = 0. Following the proofs of Theorem 5.1 and Lemma 1.1 in the supplementary material, we have where t is cardinality of J (γ).
The parameter estimate obtained by solving the optimization problem (5.2) with 1 penalty attains the required oracle bound under the following compatibility conditions: for any suitable dimension γ , β β , and the projection matrix P ζ mapping the column space of ζ ⊆ Z. Here, parameters κ 1 , κ 2 and ν are positive compatibility constants .
6. Simulation benchmarks. We evaluate the performance of the family of robust log-contrast regression estimators under a wide range of simulation scenarios. For simplicity, we focus on the n > p setting and consider only compositional covariates as predictors. We consider the robust log-contrast regression estimators for all penalty functions I-III. We closely follow the simulation setting, outlined in Shi et al. (2016), and extend it to the robust setting. We primarily focus on the correct identification of outliers in data sets.
6.1. Benchmark setup. We generate count data W = [w ij ] ∈ R n×p by simulating n instances of the multivariate random variable w ∼ Lognormal(µ, Σ) with mean µ = {log(p/2) 1 5 , 0 p−5 } ∈ R p and covariance matrix Σ ∈ R p×p such that Σ ij = 0.5 |i−j| . Relative abundances X = [x ij ] ∈ R n×p are derived from W via total sum normalization x ij = w ij / p j=1 w ij , and the log-transformed covariates Z = [ z ij ] are given by z ij = log(x ij ).
To evaluate the ability of the robust log-contrast estimators to detect outliers in the response we consider the following scenarios. We generate mean shift vectors γ with O = {20%, 15%, 10%, 5%}n outliers. Outliers are either considered (i) moderate, i.e., generated by adding a shift of 6σ to the corresponding true response y, or (ii) high, i.e, generated by adding a shift of 8σ to the corresponding true response y. We also consider the challenging setting where half of the O outliers are leveraged. The leveraged instances in Z are obtained by modifying the entries in the count data W. The first O/2 instances of W are replaced with leveraged observations. A leveraged observation comprises a feature (taxon) in each subgroup that is inflated to a large value while the remaining taxa in the subgroup are deflated to small values. For each subgroup identified by C, we begin by identifying the corresponding column subset matrix of W, then arrange its first column in descending order after adding the constant 4, and then append the remaining columns in ascending order. The first O/2 instances of the rearranged matrix are the leveraged observations. 6.2. Performance metrics. We evaluate the performance of the estimators in terms of their ability to recover outliers and inliers. Here, false positives (FP) are inlier observations marked as outlier and false negatives (FN) are outlier observations, misidentified as inliers. In the robust statistics literature, these misidentifications are referred to as swapping and masking, respectively. The Hamming (HM) distance between true and identified outliers measures total misclassifications and is defined as HM = FN+FP. Across the different simulation scenarios, we report both the best possible performance of the estimators when the tuning parameter λ is chosen by an oracle, and the actual performance when λ is chosen by our robust cross-validation procedure. We refer to the former as achievability of the estimator (Wainwright, 2009) which is instrumental in identifying fundamental limitations of estimators. Performance in terms of model estimation error is evaluated using err( β) = β * − β . Each simulation scenario is repeated R = 100 times, and average and standard error of the different performance metrics are reported for all estimators.
6.3. Simulation Results. We first report achievability results for the robust log-contrast estimators. We evaluate the robust log-contrast estimators with penalty functions I-III and denote the variants in the Figures by the labels {H, S, A}, representing the hard-ridge, the Elastic Net, and the adaptive Elastic Net penalty, respectively. Figure 1 reports achievability results for a p = 40-dimensional test case with a non-sparse coefficient vector. We report the probability of misclassification in terms of Hamming distance. The sample size is n = {150, 200, 300, 500}, the number of outlier O = {5%, 10%, 15%, 20%} with standard and leveraged outlier at two different noise levels. We observe that both the adaptive Elastic Net and the hard-ridge penalty achieve near-optimal identification of outliers when n > 200 across all scenarios. The standard Elastic Net penalty performs well only for standard outliers and when the number of outliers are below 15%. Its performance drastically deteriorates for leveraged outliers even in the large sample regime. Near-identical behavior is observed for the probability of false negative identification (Supplementary material, Figure S2). We next probe the ability of the Algorithm 1 to jointly identify model parameters and outliers. We also investigate the influence of sparsity in the regression vector on proper identification of outliers when using Algorithms 1 and 2, respectively. We use the sparse β * vector (with seven non-zero entries) in the generative model at  Figure 2 shows the performance of the different estimators in terms of probability of false negative identification as the number of outliers increases. We observe several noteworthy effects in this simulation scenario. First, the inclusion of the sparsity-promoting penalty for the regression vector uniformly improves the performance of all estimators. Both the estimators with hard-ridge and the adaptive Elastic net penalty are able to recover outliers across all settings. The standard Elastic net penalty is unable to recover all outliers when the number of outliers > 10. Second, we observe that the robust estimator with Elastic Net penalty is less sensitive to the sparsity level in β * in the setting without leveraged observations and without sparsity-promoting penalty, as compared to the estimators with hard-ridge and adaptive penalty (see the first two left panels in Figure 2). We now evaluate the full robust log-contrast regression framework with data-driven selection of the tuning parameter λ, using the novel 5-fold robust cross-validation strategy. We consider the generative model with sparse β * and measure outlier detection in terms of Hamming distance(HM). We consider the settings p = {80, 150}, n = 300, and vary the number of outlier O = {15, 30, 45, 60} under the usual settings of leverage and shift.
Across all settings, we observe excellent performance of the estimators with hard-ridge and adaptive penalty, with the hard-ridge having slightly superior performance. The data-driven model selection scheme shows excellent ability to recover near-optimal tuning parameters and is particularly effective for large number of outliers. The estimator with Elastic Net penalty is outperformed by the other estimators across all settings. For completeness, we also report performance results for outlier identifications when the model parameter β * is not sparse (see Supplementary material Figure S1). The results are similar to the one for the sparse setting, showing the superior performance of the estimators with hard-ridge and the adaptive penalty.
7. Application to gut microbiome data. A key objective in hostassociated microbiome research is to find robust and reproducible relationships between the microbial species inhabiting the host, host life style, host genotype, and host phenotype. In the gut microbiome context, such links have been established between host diet and changes in gut microbial compositions (Wu et al., 2011;David et al., 2014). However, it has been proven to be particularly difficult to find gut microbial signatures that are predictive of host phenotype, including body mass index (BMI). For instance, using the COMBO dataset from Wu et al. (2011) with n ≤ 98 participants, Lin et al. (2014) and Shi et al. (2016) identified a small set of microbial genera that are only moderately predictive of host BMI (R 2 < 0.25) with a non-robust sparse log-contrast models. We considerably extend these efforts by predicting BMI from microbial abundance data and non-compositional covariate data from a large publicly available dataset, the American Gut Project (AGP) data (McDonald et al., 2018), using RobRegCC. This appli-q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q cation can also serve as a general template for prediction tasks that involve compositional and non-compositional explanatory covariates. The entire AGP data comprises samples from more than 11, 000 human participants and > 40, 000 unique 16S rRNA fragments. We consider here a filtered gut microbial subset of the data with n = 1358 samples comprising 8120 OTUs. Using the taxonomic information associated with the OTUs we aggregate the data at the genus level resulting in p = 192 microbial genus compositions. We impose subcompositional coherence on the order level, resulting in k = 19 subcompositions. In addition, we also consider m = 42 host-related covariates, including different age categories, sex, milk and cheese frequency, red meat frequency, sugar-sweetened drink frequency, fruit frequency, vegetable frequency, and whole grain frequency in the analysis. The overall RobRegCC model (3.2) thus comprises the relative abundance of OTUs as compositional covariates in Z ∈ R n×p , the host associated features as non-compositional covariates in N ∈ R n×m , and the subcomposition constraint matrix is C ∈ R (p+m)×k . Several subcompositions only comprise a single genus and are thus treated as non-compositional covariates. The task is to predict n = 1358 mean-variance standardized BMI measurements y of the participants. 7.1. Predicting BMI with RobRegCC. We use Algorithm 1 to estimate the parameters of the RobRegCC model with penalty functions I-III and select the regularization parameter via our robust CV procedure. For comparison, we also consider the standard non-robust (NR) log-contrast model without mean shift.
On the full data set, the RobRegCC model identifies 21%, 40%, and 25% of the observations as outliers when using the hard-ridge (H), the Elastic Net (S), and the adaptive Elastic Net (A) penalty, respectively. This reveals that the convex model (S) discards considerably more measurements, thus reducing the efficiency of the estimator. Irrespective of the penalty function, RobRegCC deems many of the larger BMI values (in the overweight to obese range) as outliers with respect to the model in (3.2). The overall R 2 between observed y and predictedŷ BMI values is 0.35 for the non-robust model. Using the RobRegCC model, the R 2 on the inlier data is considerably improved to 0.71 (H), 0.87 (S), and 0.72 (A). Scatter plots between y and y for all models are shown in Figure S6 in the Supplementary Material.
We also evaluated out-of-sample predictive performance as follows. We replicated the model fitting procedure by considering 200 random splits of the data into 80% (n tr = 1086) training data and 20% (n te = 272) out-ofsample test data. The training data was used for model fitting and model selection. Suppose θ λcv and γ λcv be the parameter estimates on the training data. Then, we follow Section 4.3 to compute the model evaluation statistics λcv using s tr,λcv , te,λcv , and r te,λcv for the test data. Table 2 reports mean and standard deviation (in parenthesis) of the scaled test error te,λcv , average sample size n te,r after outlier removal in the test data, and the percentage of outliers O tr % identified in the training phase for the RobRegCC models and the non-robust model, respectively. We observe that all Ro-bRegCC models lead to excellent test errors as compared to the non-robust model, and that the hard-ridge removes the fewest number of outliers, followed by adaptive and standard Elastic Net. All models deem about 75% of the test data as inliers (n te,r in the range 204 to 214 out of 278). We next analyze the sparse set of compositional and non-compositional BMI predictors on the full AGP data. Figure 4 reports the microbial (left panel) and host-related covariates (right panel) identified by the robust and non-robust regression models. Note that, as the RobRegCC models mainly identify people with higher BMI as outliers, the robust predictors largely reflect the population with lean to normal BMI whereas the non-robust predictors reflect the entire population. We make several key observations. Firstly,  the different methods select a very diverse set of features. As expected, the non-robust and RobRegCC with Elastic-Net (S) tend to select similar features whereas RobRegCC with adaptive and hard-ridge penalty show more consistent joint selection. The non-robust model selects the fewest microbial predictors (33), followed RobRegCC with the hard-ridge (42), adaptive (40), and standard Elastic-Net (66) penalty. Few features are selected by all methods, including five age-related covariates and three microbial features. The effect size of the host-related covariates are considerably larger than the microbial features. In particular, age-related covariates are large, with "age-child" and "age-unspecified" being strongly negatively associated with BMI, as well as the feature "vegetable-never". All methods predict the genus Alistipes composition to be positively related to BMI which is in strong contrast to results on the COMBO data (Lin et al., 2014;Shi et al., 2016;Combettes and Müller, 2019). The Clostridium genus composition is negatively associated with BMI in all models. Overall, we do not see sign inconsistency across the methods, i.e., none of the methods predict any of the compositions to be both negatively and positively associated. Compared to the non-robust procedure, our robust procedures identify several genera that have been previously shown to be related to BMI, including {TM7-3, Coriobacteriales} (Clarke et al., 2012), YS2 (Zeng et al., 2015), Rikenellaceae , and {Actinomyces, Odoribacter} (Goffredo et al., 2016). In contrast to RobRegCC (H,A), the non-robust procedure and RobRegCC(S) detect the genus Acidaminococcus as potential positive predictor of BMI, consistent with prior analysis (Lin et al., 2014;Shi et al., 2016;Combettes and Müller, 2019). In Yun et al. (2017), Acidaminococcus is observed to be positively associated with overweight/obese people but disappears with fiber intake. As our robust analysis treats samples with higher BMI largely as outliers, Acidaminococcus is thus not detected as predictive.

Non-compositional Compositional
A similar argument holds for the experimentally observed negative association of Akkermansia (Louis et al., 2016) which is detected in the non-robust analysis but not with RobRegCC (H, A).

7.2.
Identifying a sparse log-ratio model from RobRegCC. As the microbial features are compositional in nature, an interpretation of individual regression coefficients is misleading and should be replaced by log-ratios of compositions, i.e., the outcome y should be modeled as linear combination of log-ratios of compositions. However, the solution of the RobRegCC model (as well as the standard log-contrast model) leads to arbitrarily many equivalent log-ratio models. To achieve a unique yet parsimonious log-ratio solution, Bates and Tibshirani (2018) proposed a two-stage estimator that i) identifies a small set of predictors among all p predictors using a sparse logcontrast model and ii) forms all log-ratios from the set of selected predictors and solves an unconstrained lasso with the log-ratios as predictors. We here extend this idea to RobRegCC (with hard-ridge and adaptive penalty) with subcompositional constraints. Firstly, for each of the k subcompositional groups, we denote the subset of predictors identified by RobRegCC by the index set A [1] , . . . , A [k] and form all possible log-ratios of these predictors restricted to each group. Secondly, as the RobRegCC analysis also identifies a "clean" set of samples denoted by [I], we propose a log-ratio model restricted to subcompositions and "clean" samples of the form ij ]'s and a are the model parameters. For this model, we apply the standard lasso with 5−fold cross-validation and the "one standard error" rule (the default setting in the glmnet R package (Friedman, Hastie and Tibshirani, 2010)) to obtain a sparse set of log-ratios. Figure 5 shows the identified set of log-ratios from RobRegCC with hard-ridge (leading to 21 log-ratio terms) and the adaptive penalty (leading to 17 log-ratio terms), respectively. The models agree on seven log-ratios, including a negative Odoribactor to Alistipes ratio, a negative Megasphaera to Ph2 ratio, and a positive Collinsella to Eggerthella ratio. The sparse log-ratios identified here can serve as direct interpretable hypotheses for what may constitute a standard (healthy) gut microbiome for a population fed on a Western diet with lean to normal BMI.

Burkholderiales Clostridiales
Coriobacteriales Erysipelotrichales Actinomycetales Figure 5. Stacked bar plot of the coefficients identified by the log-ratio model on the clean data with subset of predictors identified by the RobRegCC model (with hard-ridge (H) and adaptive penalty (A)), restricted to the respective subcompositions on the order level.
8. Discussion and conclusion. In this contribution, we have developed RobRegCC, a robust log-contrast regression framework that allows simultaneous outlier identification and sparse model coefficients for regression problems with compositional and non-compositional covariates. The approach combines the idea of mean shift estimation in linear regression with robust initialization and penalization for linear log-contrast regression (Aitchison and Bacon-Shone, 1984). We have tackled the resulting overspecified model parameter estimation problem via a regularization approach with suitable sparsity-inducing penalty functions. We have explored the performance of RobRegCC with the hard-ridge, the Elastic Net, and the adaptive Elastic Net penalty function. The latter two cases lead to a convex optimization problem. While the estimation approach with the Elastic Net penalty lacks the ability to handle masking and swapping effect (She and Owen, 2011), the adaptive Elastic Net penalty alleviates this problem but re-quires an initial robust estimate of the parameters to construct appropriate weights. RobRegCC with the hard-ridge penalty function leads to a nonconvex optimization problem, thus also requiring a suitable initial estimator of the parameters to converge to a good locally optimal solutions. To achieve good initial estimate, we have brought the idea of principal sensitivity component analysis (Peña and Yohai, 1999) from robust statistics to the realm of robust log-contrast regression. Finally, we have shown how to estimate interpretable log-ratio models for compositional data from robust log-contrast models, extending recent ideas by Bates and Tibshirani (2018). We have implemented a generic Lagrangian-based optimization procedure that uses Iterative Shrinkage/Thresholding algorithms (ISTA) as key ingredient that can handle all introduced penalty functions. All methods are implemented and freely available in the RobRegCC R package. We have shown, on simulated and real compositional microbiome data, the validity and generality of our approach. Our modeling and computational efforts are also accompanied by novel theoretical results that give prediction error bounds for the RobRegCC estimators in the finite sample setting.
On the algorithmic side, future efforts will include exploring and implementing other computationally efficient optimization strategies, including recent path-based algorithms for log-contrast regression (Gaines, Kim and Zhou, 2018). On the theoretical side, we will analyze the variable selection properties of the RobRegCC model estimators. A natural extension of our modeling framework is robust logistic regression when responses are given as indicators rather than continuous variables. In summary, we believe that the RobRegCC estimators presented in this contribution provide a useful tool for statisticians and data analysts that are faced with regression tasks in the presence of compositional covariates.
Supplementary material.

Details about the robust initialization.
Here, we discuss the principal sensitive component (PSC) based analysis for the sparse log-contrast model (S-LCM) (Shi et al., 2016). Computing the sensitivity R (4.6) for the least square estimator is trivial (Peña and Yohai, 1999) as it avoids the separate model fitting to obtainŷ i 's andŷ i(j) 's. Interestingly, the log-contrast model (LCM) follows the linear model. But the same is not true for the S-LCM, hence, computing R is nontrivial. To overcome the challenge, we identify the support of the S-LCM coefficient estimate, and compute R for the subsequent LCM (non-sparse). Please refer Algorithm 3 for the formula of calculating R.
For the analysis, consider setting the parameter τ ∈ (0, 0.5) and obtain m = nτ . Suppose [u 1 , . . . , u q ] denote the principal components of R. Peña and Yohai (1999) characterized the extreme observations in terms of the value of entries in a principal component of R. Following PENSE, for each u i , we generate three candidate subsamples by removing m observations corresponding to the: I) largest u i ; II) smallest u i ; III) largest |u i |. Including the one with all observations, the protocol results in total 3 * q + 1 candidate samples. For each candidate sample, we estimate the coefficient of S-LCM using the default procedure specified in Shi et al. (2016). Now, using the coefficient estimate, we evaluate the candidate samples in terms of the Mestimator of scale (Rousseeuw and Hubert, 2011) of the residuals obtained on full sample. Suppose the chosen candidate sample attains the minimum scale value s 1 . A potentially "clean subsample" is then obtained after discarding observations with the residuals magnitude (on the full data) greater than some C s 1 . See Cohen Freue et al. (2017) for the choice of C .
The analysis may not detect the low-leveraged outliers. To solve, Peña and Yohai (1999) suggested to iterate the process several times or until convergence. Final S-LCM coefficient estimate on the clean subsample, and the residuals on full data, are used as initial estimator of the RobRegCC model, i.e.,δ = [β Tγ T ] T . We have summarized the initialization procedure in the Algorithm 3. 1.3. Non-asymptotic analysis proofs.
On applying the above inequality results in the union bound (1.5), we prove that P sup q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1.5. American gut data analysis. We present the additional plots from the analysis of AGP data.
1.5.1. Model fitting RobRegCC. Model selection and outliers detection using RobRegCC.  Figure S7. Model diagnostic after removing outliers (green) obtained from the robust analysis (Algorithm 1) in (case I and III). Fitted vs observed BMI using log-ratio model on the subset of predictors and outlier removed observations. CORR(y,ŷ) with {A, H} = {0.87, 0.84}.