HB-PLS: An algorithm for identifying biological process or pathway regulators by integrating Huber loss and Berhu penalty with partial least squares regression

Gene expression data features high dimensionality, multicollinearity, and the existence of outlier or non-Gaussian distribution noise, which make the identification of true regulatory genes controlling a biological process or pathway difficult. In this study, we embedded the Huber-Berhu (HB) regression into the partial least squares (PLS) framework and created a new method called HB-PLS for predicting biological process or pathway regulators through construction of regulatory networks. PLS is an alternative to ordinary least squares (OLS) for handling multicollinearity in high dimensional data. The Huber loss is more robust to outliers than square loss, and the Berhu penalty can obtain a better balance between the ℓ2 penalty and the ℓ1 penalty. HB-PLS therefore inherits the advantages of the Huber loss, the Berhu penalty, and PLS. To solve the Huber-Berhu regression, a fast proximal gradient descent method was developed; the HB regression runs much faster than CVX, a Matlab-based modeling system for convex optimization. Implementation of HB-PLS to real transcriptomic data from Arabidopsis and maize led to the identification of many pathway regulators that had previously been identified experimentally. In terms of its efficiency in identifying positive biological process or pathway regulators, HB-PLS is comparable to sparse partial least squares (SPLS), a very efficient method developed for variable selection and dimension reduction in handling multicollinearity in high dimensional genomic data. However, HB-PLS is able to identify some distinct regulators, and in one case identify more positive regulators at the top of output list, which can reduce the burden for experimental test of the identified candidate targets. Our study suggests that HB-PLS is instrumental for identifying biological process and pathway genes.

networks. PLS is an alternative to ordinary least squares (OLS) for handling multicollinearity in 23 high dimensional data. The Huber loss is more robust to outliers than square loss, and the Berhu 24 penalty can obtain a better balance between the ℓ " penalty and the ℓ # penalty. HB-PLS therefore 25 inherits the advantages of the Huber loss, the Berhu penalty, and PLS. To solve the Huber-Berhu 26 regression, a fast proximal gradient descent method was developed; the HB regression runs much 27 faster than CVX, a Matlab-based modeling system for convex optimization. Implementation of 28 HB-PLS to real transcriptomic data from Arabidopsis and maize led to the identification of many 29 pathway regulators that had previously been identified experimentally. In terms of its efficiency 30 in identifying positive biological process or pathway regulators, HB-PLS is comparable to sparse 31 partial least squares (SPLS), a very efficient method developed for variable selection and 32

Introduction 42
In a gene regulatory network (GRN), a node corresponds to a gene and an edge represents a 43 directional regulatory relationship between a transcription factor (TF) and a target gene. 44 Understanding the regulatory relationships among genes in GRNs can help elucidate the various 45 biological processes and underlying mechanisms in a variety of organisms. Although experiments 46 can be conducted to acquire evidence of gene regulatory interactions, these are labor-intensive and 47 time-consuming. In the past two decades, the advent of high-throughput techniques, including 48 microarray and RNA-Seq, have generated an enormous wealth of transcriptomic data. As the data 49 in public repositories grows exponentially, computational algorithms and tools utilizing gene 50 expression data offer a more time-and cost-effective way to reconstruct GRNs. To this end, 51 efficient mathematical and statistical methods are needed to infer qualitative and quantitative 52 relationships between genes. 53 Many methods have been developed to reconstruct GRNs, each employing different theories and 54 principles. The earliest methods involved differential equations [1], Boolean networks [2], 55 stochastic networks [3], Bayesian [4,5] or dynamic Bayesian networks (BN) [6,7], and ordinary 56 differential equations (ODE) [8]. Some of these methods require time series datasets with short 57 time intervals, such as those generated from easily manipulated single cell organisms (e.g. bacteria, 58 yeast) or mammalian cell lines [9]. For this reason, most of these methods are not suitable for gene 59 expression data, especially time series data involving time intervals on the scale of days, from 60 multicellular organisms like plants and mammals (except cell lines). 61 62 In general, the methods that are useful for building gene networks with non-time series data 63 generated from high plants and mammals include ParCorA [10], GGM [11], and mutual 64 information-based methods such as Relevance Network (RN) [12], Algorithm for the 65 Reconstruction of Accurate Cellular Networks (ARACNE) [13], C3NET [14], maximum 66 relevance/minimum redundancy Network (MRNET) [15], and random forests [16,17]. Most of 67 these methods use the information-theoretic framework. For instance, Relevance Network (RN) 68 [18], one of the earliest methods, infers a network in which a pair of genes are linked by an edge 69 if the mutual information is larger than a given threshold. The context likelihood relatedness (CLR) 70 algorithm [19], an extension of RN, derives a score from the empirical distribution of the mutual 71 information for each pair of genes and eliminates edges with scores that are not statistically 72 where is the number of samples, / = ( /# , … , /? ) 4 is the expression level of TF genes, and 94 / is the expression level of the target gene in sample . * is the intercept and * = ( # * , … , ? * ) 4 95 are the associated regression coefficients; if A * ≠ 0, then TF gene regulates target gene . { / } 96 are independent and identically distributed random errors with mean 0 and variance " . The 97 method to get an approximation F for * is to transform this statistical problem to a convex 98 optimization problem: 99 cases where a given that leads to optimal estimation rate ends up with an inconsistent selection 120 of variables. For the first limitation, Zou and Hastie [29] proposed elastic net, in which the penalty 121 is a mixture of LASSO and ridge regressions: , where (0 < 122 < 1) is called the elastic net mixing parameter. When = 1, the elastic net penalty becomes 123 the LASSO penalty; when = 0, the elastic net penalty becomes the ridge penalty. For the second 124 limitation, adaptive LASSO [28] was proposed as a regularization method, which enjoys the oracle 125 properties. The penalty function for adaptive LASSO is: , where adaptive 126 weight _ A = # |J aba | c , and F /O/ is an initial estimate of the coefficients obtained through ridge 127 regression or LASSO; is a positive constant, and is usually set to 1. It is evident that adaptive 128 LASSO penalizes more those coefficients with lower initial estimates. 129 130 It is well known that the square loss function is sensitive to heavy-tailed errors or outliers. 131 Therefore, adaptive LASSO may fail to produce reliable estimates for datasets with heavy-tailed 132 errors or outliers, which commonly appear in gene expression datasets. One possible remedy is to 133 remove influential observations from the data before fitting a model, but it is difficult to 134 differentiate true outliers from normal data. The other method is to use robust regression. and they also demonstrated that this procedure encourages a grouping effect. In [33,34], the 151 authors solved a Huber-Berhu optimization problem using CVX software [35], a Matlab-based 152 modeling system for convex optimization. CVX turns Matlab into a modeling language, allowing 153 constraints and objectives to be specified using standard Matlab expression syntax. However, since 154 CVX is slow for large datasets, a proximal gradient descent algorithm was developed for the 155 Huber-Berhu regression in this study, which runs much faster than CVX. 156 157 Reconstruction of gene regulatory networks often involves ill-posed problems due to high 158 dimensionality and multicollinearity. Partial least squares (PLS) regression has been an alternative 159 to ordinary regression for handling multicollinearity in several areas of scientific research. PLS 160 couples a dimension reduction technique and a regression model. Although PLS has been shown 161 to have good predictive performance in dealing with ill-posed problems, it is not particularly 162 tailored for variable selection. Chun et al. [36] first proposed a SPLS regression for simultaneous 163 dimension reduction and variable selection. Cao et al. [37] also proposed a sparse PLS method for 164 variable selection when integrating omics data. They added sparsity into PLS with a LASSO 165 penalization combined with singular value decomposition (SVD) computation. In this study, the 166 Huber-Berhu regression was embedded into a PLS framework. Real gene data was used to 167 demonstrate that this approach is applicable for the reconstruction of GRNs. 168 169

Materials and Methods 170 171
High-throughput gene expression data 172 The lignin pathway analysis used an Arabidopsis wood formation compendium dataset containing 173 128 Affymetrix microarrays pooled from six experiments (accession identifiers: GSE607, 174 GSE6153, GSE18985, GSE2000, GSE24781, and GSE5633 in NCBI Gene Expression Omnibus 175 (GEO) (http://www.ncbi.nlm.nih.gov/geo/)). These datasets were originally obtained from 176 hypocotyledonous stems under short-day conditions known to induce secondary wood formation 177 [38]. The original CEL files were downloaded from GEO and preprocessed using the affy package 178 in Bioconductor (https://www.bioconductor.org/ ) and then normalized with the robust multi-array 179 analysis (RMA) algorithm in affy package. This compendium data set was also used in our 180 previous studies [39,40]. The maize B73 compendium data set used for predicting photosynthesis 181 light reaction (PLR) pathway regulators was downloaded from three NCBI databases: (1)  In estimating regression coefficients, the square loss function is well suited if / follows a 192 Gaussian distribution, but it gives a poor performance when / follows a heavy-tailed distribution 193 or there are outliers. On the other hand, the LAD loss function is more robust to outliers, but the 194 statistical efficiency is low when there are no outliers in the data. The Huber function,introduced 195 in [43], is a combination of linear and quadratic loss functions. For any given positive real 196 (called shape parameter), the Huber function is defined as: 197 This function is quadratic for small but grows linearly for large values of . The parameter 199 determines where the transition from quadratic to linear takes place (see Figure 1, top left). In this 200 study, the default value of was set to be one tenth of the interquartile range (IRQ), as suggested 201 by [44]. The Huber function is a smooth function with a derivative function: 202 The ridge regression uses the quadratic penalty on regression coefficients, and it is equivalent to 204 putting a Gaussian prior on the coefficients. LASSO uses a linear penalty on regression coefficients, 205 and it is equivalent to putting a Laplace prior on the coefficients. The advantage of LASSO over 206 ridge regression is that it implements regularization and variable selection simultaneously. The 207 disadvantage is that, if a group of predictors is highly correlated, LASSO picks only one of them 208 and shrinks the others to zero. In this case, the prediction performance of ridge regression 209 dominates the LASSO. The Berhu function, introduced in [33], is a hybrid of these two penalties. 210 It gives a quadratic penalty to large coefficients while giving a linear penalty to small coefficients, 211 as shown in Figure 1 (top right). The Berhu function is defined as: 212 The shape parameter was set to be the same as that in the Huber function. As shown in Figure  214 1 (top right), the Berhu function is a convex function, but it is not differentiable at = 0. Figure  215 1 (bottom) also shows the 2D contour of Huber and Berhu functions. When the Huber loss function 216 and the Berhu penalty were combined, an objective function, as referred as Huber_Berhu, was 217 obtained, as shown below. 218 Next, X was deflated as X = X − ξc n and Y was deflated as Y = Y − ξd n , and this process was 275 continued until enough components were extracted. As mentioned above, the Huber function is more robust to outliers and has higher statistical 289 efficiency than LAD loss, and the Berhu penalty has a better balance between the ℓ # and ℓ " 290 penalty. The Huber loss and the Berhu penalty were adopted to extract each component for PLS. 291 The optimization problem becomes: 292 The objective function in (13) is not convex on and , but it is convex on when is fixed and 294 convex on when is fixed. For example, when is fixed, each / in parallel can be solved by: 295 Similarly, when is fixed, each A in parallel can be computed by: 297 Equations (14) and (15) can be solved using Algorithm 1. Therefore (9) can be solved iteratively 299 by updating and alternately. Note, it is not cost-efficient to spend a lot of effort optimizing 300 over in line 6 before a good estimate for is computed. Since Algorithm 1 is an iterative 301 algorithm, it may make sense to stop the optimization over early before updating . In the 302 implementation, one step of proximal mapping was used to update and . That is: 303

Tuning criteria and choice of the PLS dimension 318
The Huber-Berhu PLS has two tuning parameters, namely, the penalization parameter and the 319 number of hidden components . To select the best penalization parameter, , a common k-fold 320 cross-validation (CV) procedure that minimizes the overall prediction error is applied using a grid 321 of possible values. If the sample size is too small, CV can be replaced by leave-one-out validation; 322 this procedure is also used in [36,49] for tuning penalization parameters. 323

324
To choose the dimension of PLS, the " criteria was adopted. " criteria were first proposed by 325 Tenenhaus [50]; These criteria characterize the predictive power of the PLS model by performing 326 cross-validation computation. " is defined as: 327

Validation of Huber-Berhu PLS with lignin biosynthesis pathway genes and regulators 339
The HB-PLS algorithm was examined for its accuracy in identifying lignin pathway regulators 340 using the A. thaliana microarray compendium data set produced from stem tissues [39]. TFs 341 identified by HB-PLS were compared to those identified by SPLS. The 50 most relevant TFs in 342 the lignin biosynthesis pathway were identified using HB-PLS ( Figure 3A) and compared to those 343 identified by SPLS ( Figure 3B The results indicate that the HB-PLS and SPLS methods, in many cases, are much more efficient 393 in recognizing positive regulators to a pathway gene compared to PLS method. For most pathway 394 genes, like PAL1, C4H, CCR1, C3H, and COMT1, HB-PLS method could identify more positive 395 regulators when the top cut-off lists contained fewer than 20 regulators compared to SPLS method. 396 For HCT, CCoAOMT1, CAD8 and F5H, HB-PLS was almost always more efficient than SPLS 397 when the top cut-off lists contained fewer than 40 regulators. For pathway gene CAD8, SPLS and 398 PLS both failed to identify positive regulators while HB-PLS performed efficiently. 399 400

Prediction of photosynthetic pathway regulators using Huber-Berhu PLS 401
Photosynthesis is mediated by the coordinated action of about 3,000 different proteins, commonly 402 referred to as photosynthesis proteins [56]. In this study, we used genes from the photosynthesis 403 light reaction (PLR) pathway to study which regulatory genes can potentially control 404 photosynthesis. Analysis was performed using HB-PLS, with SPLS as a comparative method. The 405 compendium data set we used is comprised of 63 and 36 RNA-seq data sets from maize leaves and 406 seedlings, respectively. Expression data for 2616 TFs and 30 PLR pathway genes were extracted 407 from the above compendium data set and used for analyses. The results of HB-PLS and SPLS 408 methods are shown in Figure 5A   In this study, we developed a PLS regression that incorporates the Huber loss function and the 466 Berhu penalty for identification of pathway regulators using gene expression data. Although the 467 Huber loss function and the Berhu penalty have been proposed in regularized regression models 468 [43,77], this is the first time that both have been used in the PLS regression at the same time. The 469 Huber function is a combination of linear and quadratic loss functions. In comparison with other 470 loss functions (e.g., square loss and least absolute deviation loss ), Huber loss is more robust to 471 outliers and has higher statistical efficiency than the LAD loss function in the absence of outliers. 472 The Berhu function [33] is a hybrid of the ℓ " penalty and the ℓ # penalty. It gives a quadratic penalty 473 to large coefficients and a linear penalty to small coefficients. Therefore, the Berhu penalty has 474 advantages of both the ℓ " and ℓ # penalties: smaller coefficients will tend to shrink to zero while 475 the coefficients of a group of highly correlated predictive variables will not be changed much if 476 their coefficients are large. In this regression, the Huber function was used as the loss function and the Berhu function was 495 used as the penalty function. An optimal one-dimensional clustering algorithm was adopted to 496 cluster the regression coefficients and then the elbow point was used to determine the non-zero 497 variables. The Huber function is more robust in dealing with outlier and non-Gaussian error while 498 the Berhu function integrates the advantages of both ℓ # and ℓ " penalties. The group effect of the 499 Huber-Berhu regression makes it suitable for modeling transcriptional regulatory relationships. 500 Simulation results showed that the Huber-Berhu regression has better performance in identifying 501 non-zero variables. When modeling the regulatory relationships from TFs to a pathway, HB-PLS 502 is capable of dealing with the high multicollinearity of both TFs and pathway genes. 503 Implementation of the HB-PLS to Arabidopsis and maize data showed that HB-PLS can identify 504 comparable numbers of positive TFs in the two pathways tested. However, there were differences 505 in the pathway regulators identified and their rankings; in particular, positive TFs tended to be 506 present in highly ranked positions in output lists. This is an advantage for selecting candidate 507 regulators for experimental validation. Our results indicate that HB-PLS will be instrumental for 508 identifying novel biological process or pathway regulators from high dimensional gene expression 509