TGPred: efficient methods for predicting target genes of a transcription factor by integrating statistics, machine learning and optimization

Abstract Four statistical selection methods for inferring transcription factor (TF)–target gene (TG) pairs were developed by coupling mean squared error (MSE) or Huber loss function, with elastic net (ENET) or least absolute shrinkage and selection operator (Lasso) penalty. Two methods were also developed for inferring pathway gene regulatory networks (GRNs) by combining Huber or MSE loss function with a network (Net)-based penalty. To solve these regressions, we ameliorated an accelerated proximal gradient descent (APGD) algorithm to optimize parameter selection processes, resulting in an equally effective but much faster algorithm than the commonly used convex optimization solver. The synthetic data generated in a general setting was used to test four TF–TG identification methods, ENET-based methods performed better than Lasso-based methods. Synthetic data generated from two network settings was used to test Huber-Net and MSE-Net, which outperformed all other methods. The TF–TG identification methods were also tested with SND1 and gl3 overexpression transcriptomic data, Huber-ENET and MSE-ENET outperformed all other methods when genome-wide predictions were performed. The TF–TG identification methods fill the gap of lacking a method for genome-wide TG prediction of a TF, and potential for validating ChIP/DAP-seq results, while the two Net-based methods are instrumental for predicting pathway GRNs.


INTRODUCTION
Construction and delineation of transcriptional regulatory networks are essential for a systematic understanding of how various biological processes and complex traits are regulated and how plants grow and de v elop in response to environmental cues.Although biological experiments can be carried out to obtain gene regulatory relationships, they are labor-intensi v e and time-consuming, and can ther efor e only be used to infer a small number of true regulatory relationships.In the last two decades, the advent of high throughput technologies, including microarray and RNA-Seq, has made it easier to generate a terabyte of transcriptome data for inferring gene r egulatory r elationships and gen regulatory networks (GRNs).As the high-throughput data in public repositories increases exponentially, computational algorithms and tools that utilize high-throughput transcriptome data and ChIP / DAP-seq data provide an alternate approach to infer gene regulatory relationships and GRNs.Howe v er, the validity of this approach relies on the accuracy of the methods.
Initially, high-throughput transcriptome data sets were primarily generated from the single-celled organisms like bacteria and yeast, or the eukaryotic cell lines.These organisms and cell lines allowed for the generation of timecourse microarray data with small time intervals (e.g.5-10 min), which spurred the de v elopment of dynamic methods, including differential equations ( 1 ), finite state linear model ( 2 ), dynamic Bayesian ( 3 ), Boolean network ( 4 ), and stochastic networks ( 5 ) and ordinary differential equations (ODE) ( 6 ).These methods r equir e time-course data sets with very small-time intervals for inferring gene regula tory rela tionships and GRNs due to the use of temporal variables.When microarray and RNA-seq were applied to multicellular organisms such as mammals and plants, 2 NAR Genomics and Bioinformatics, 2023, Vol. 5, No. 3 acquisition of data from specific cells or tissues in a time series became challenging and time-consuming.As a result, most transcriptome data sets generated from multicellular organisms are static data, which are either in a non-timecourse (e.g.treatments vs controls) or in a small time-course with very large time intervals (e.g. on a scale of many hours, days or weeks).This kind of data can be classified into static data because implementation of dynamic methods as mentioned above to this kind of data is inappropriate.Since it is not appropriate to analyze static data using dynamic methods, a number of static algorithms have been developed that do not rely on any temporal variables to simulate gene r egulatory r elationships.These methods includes ParCorA ( 7 ), Maxim um Relevance / Minim um Redundancy Network (MRNET) ( 8 ), Mutual Information Based Relevance Networks ( 9 ), Trustful Inference of Gene Regulation using Stability Selection (TIGRESS) ( 10 ), Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNE) ( 11 ), Context Likelihood of Relatedness (CLR) ( 12 ), Mutual Information 3 (MI3) ( 13 ), Probabilistic-Based Bayesian Network ( 14 ) and Random For ests ( 15 ).Recently, mor e methods have been developed for constructing multilayered hierarchical gene regulatory networks (ML-hGRNs), such as top-down ( 16 , 17 ), and bottom-up GGM algorithms ( 18 ), and BWERF ( 19 ), and GRNs that controls a pathway or a biological process, for instance, TGMI ( 20 ) and HB-PLS ( 21 ).In addition, the methods for constructing multiple joint GRNs using data from multiple tissues or conditions hav e been de v eloped, for instance, JGL ( 22 ) and JRmGRN ( 23 ).Some recent studies ( 21 , 23 ) have shown that the integration of machine learning, statistics and optimization for inferring transcription factor (TF)-target gene (TG) relationships is promising and may open a ne w av enue for identifying regulatory relationships and constructing GRNs.
Plant r esear chers often produce a large number of transgenic lines in which a TF is up-/ down-regulated, or use a transient expression system, for instance, CRISPR-CAS9 ( 24 ) or dTALE ( 25 ), to perturb a TF and then gener ate tr anscriptome data after the TF is ov ere xpressed, suppressed or perturbed.Even with this kind of da ta, identifica tion of the true target genes of the TF for further experimental studies is still a challenging task, e v en after a ChIP-seq or DAP-seq experiment of this TF is conducted.This is because ChIP-seq and DAP-seq experiments often yield hundred up to twenty thousand putati v e target genes.In addition, the ChIP / DAP-seq results show if a TF can bind to the promoters of candidate target genes but do not provide the information if the putati v e target genes are actually activ ated / suppressed b y the TF.Ther efor e, in-silico methods that enable identification of the true target genes of a TF using the above-mentioned transcriptomic data is desperately demanding.The aforementioned dynamic and static network construction methods are often not specifically tailored for inferring the true TGs of a TF from large number of candidate TGs.
In this study, four statistical selection methods were developed to infer the potential TGs for a gi v en TF by combining two loss functions, mean squared error (MSE) or Huber loss function (Huber), with two penalty functions, elastic net (ENET) or least absolute shrinkage and selec-tion operator (Lasso).These four methods ar e r eferr ed to as Huber-ENET, Huber-Lassso, MSE-ENET and MSE-Lasso.The MSE and Huber loss functions were used to measure the errors between the predicted values and the observed values.Huber can mor e r eadily avoid the sensitivity of heavy-tailed errors or outliers than MSE.The penalty functions, ENET and Lasso contain the l 1 norm of the estima ted ef fect sizes which can control the sparsity of the selected TGs.In addition, a network-based penalty (Net) was proposed, and combined with Huber or Lasso loss function to de v elop two Net-based methods, r eferr ed to as Huber-Net and MSE-Net, which can be used to identify pathway GRNs.Net penalty can incorporate prior annotated pathway or biological process information into the prediction ( 26 ).
To solve the r egr essions for the methods described above, an accelerated proximal gradient descent (APGD) algorithm was de v eloped for the parameter optimization in all six methods.Our simulations showed that the APGD was equally effecti v e but m uch faster than a commonl y used method called conv e x optimization solv er (CVX).To obtain stable selection results, we applied a stability selection method, the half-sample a pproach, w hich does not need to choose the optimal tuning parameters in selection methods.All the methods were tested using simulated data, and the four TF-TG identification methods were also tested with the real transcriptomic data from SND1 and gl3 ov ere xpression studies.In addition, the two Net methods were tested with the real transcriptome data of all metabolic pathway genes from the maize B73 line from public repository.Our stud y showed tha t the four TF-TG identifica tion methods had higher efficacy in genome-wide prediction than the three comparison methods, CLR, MRNET and TIGRESS, implying that the methods can be used to validate TGs of a TF resulting from ChIP-seq or DAP-seq experiments, while the two Net-based methods can identify pathway GRNs.

Simulated gene expression data
The simulated data were generated in three settings: (i) a general setting; (ii) two network settings: a hierarchical network setting and a Bar abasi-Albert (B A) network setting.In the general setting, p TGs were independent with each other and the first 50 TGs were regulated by a gi v en TF (details in Supplemental Text S1).In the network settings, we simulated p TGs with two biological network structur es, the hierar chical network and Barabasi-Albert network.For the hierarchical network, there were 5 disjointed subnetworks and each of them consisted of 100 T Gs.The subnetw ork was constructed as previously described ( 26 ) (Supplementary Figure S1).For the Barabasi-Albert (BA) network, there were 50 subnetworks and each of them was a BA-based network comprising of 10 TGs ( 27 ).Ther e wer e 45 TGs and 40 TGs that wer e r egulated by a gi v en TF for the hierarchical networ k and Barabasi-Albert networ k, respecti v ely (details in Supplemental Text S2).Bioinformatics, 2023, Vol. 5, No. 3 3

Populus trichocarpa SND1 o ver expr ession tr anscriptomic data and analysis
The poplar data used for simulation was from our previous study ( 16 ).The data can be retrie v ed from Gene Expression Omnibus (GEO) with accession number GSE49911.Briefly, the data was generated and then analyzed as following: Poplar protoplasts isolated from stem de v eloping xylem were transfected with the plasmid vector harboring poplar SND1, a TF that is known to control lignocellulosic biosynthesis, under the control of 35S promoter, and were then harvested for RNA-seq at 7, 12 and 25 h.Three samples of transfected protoplasts (35S-SND1) and three control samples (control vector without SND1) at each time point were harvested.The raw read counts of all genes of each sample were used for identification of differentially expressed genes (DEGs) at each time point using the edgeR package ( 28 ).Meanwhile, the raw reads of all genes of each sample were normalized with trimmed mean of M -values (TMM), a scaling method contained in the edgeR package.The normalized data was used for real data simulation to validate the methods we de v eloped in this study.

Maize gl3 o ver expr ession tr anscriptomic data and analysis
Two transcriptional-activator like effectors (dTALes) that target two non-overlapping 16-bp regions of the gl3 promoter for ov ere xpression were constructed.The two regions targeted are located 5 and 48 bp upstream of the gl3 's transcription start site.The 14-day-old seedlings were used to test the dTALes-mediated induction of gl3 .Three samples and three controls, upon being infected with Xv1601 bacteria carrying dTALes, dT1 or dT2, were harvested in a time-series with four time points: 18 and 24 h.Sequencing data were trimmed by Trimmomatic (version 0.38) ( 29 ) and trimmed r eads wer e aligned to the maize B73 r efer ence genome (B73Ref4) using STAR (2.7.3a) ( 30 ).Fragments per kilobase of transcript per million mapped reads FPKM values were generated with Cufflink package ( 31 ), and DEGs were identified with Cuf fdif f package ( 32 ).FPKM da ta were used for simulation with gl3 as a TF and all DEGs or all genomic genes as candidate target genes.

Maize B73 transcriptomic data for validation of net-based methods
In total, the expression levels of 736 RNA-seq data of B73 were downloaded from NCBI Sequence Read Archi v e (SRA) repository.The accession numbers are shown in Table S1.The sequence reads were preprocessed as described for gl3 data as described above.2539 unique pathway genes were extracted from the Plant Metabolic Network (PMN) ( 33 ) and 23 lignin pathway genes as well as 23 transcription factors (TFs) that are known to regulate lignin pathway (34)(35)(36)(37)(38) were used for validating the two Net-based methods, Huber-Net and MSE-Net, with three network construction methods, CLR, MRNET and TIGRESS used as comparison.

Rationale for methods
Consider that the expression levels of a TF y and the expression le v els of the TGs x in the whole-genome fit a linear relationship of the following: where n is the number of samples, is the e xpression le v els of p target genes in sample i , and y i is the expression level of the TF gene in sample i .β 0 is the intercept and β = ( β 1 , • • • , β p ) T are the regulated regression coefficients.The TF gene regulates target gene j if β j = 0 ( j = 1 , • • • , p ) ; the tar get gene j and tar get gene k ar e co-r egulated by TF if both β j = 0 and β k = 0 .ε i is independent and identically distributed random errors with mean 0 and variance σ 2 .

Statistical selection methods
Based on the above statistical model, we de v eloped four statistical selection methods to infer the potential TGs for a gi v en TF and two methods to infer pa thway regula tory networks based on the penalized r egr ession model.The general objecti v e function of the penalized r egr ession model was defined as where L ( β; y i , x i ) is the loss function according to the observ ed e xpression le v els of TGs and TF and P ( β; λ, α) is the penalty function which can control the sparsity of the selected TGs.
Loss functions.In the above general objecti v e function of the penalized r egr ession model, we consider ed the following two loss functions, mean squared error (MSE) and Huber.The MSE loss function is defined as 2 , which is very sensitive to outliers.The Huber loss function is defined as L Hub e r ( β; y, x ) , where H M ( z) is the Huber function for an input value z, which is quadratic function for small z values but grows linearly for large values of z.In this study, the parameter M is defaulted to be one-tenth of the interquartile range (IRQ), as suggested by Deng et al. ( 21 ).For any gi v en positi v e real M (called shape parameter), the Huber function is defined as Penalty functions.All of the three penalty functions we considered, Lasso, ENET and Net, contained the l 1 norm of the estima ted ef fect sizes ( β 1 ).The ENET penalty is the combination of the l 1 norm and squared λ > 0 and α ∈ [ 0 , 1 ] are the tuning parameters, where λ controls the sparsity and α is the mixing proportion between l 1 norm and l 2 norm.The Lasso penalty is the special case of ENET ( α = 1 ) and P Las s o ( β; λ, α) = λβ 1 , w hich onl y contains one tuning parameter λ > 0 .For the Net penalty, we assume that the genes involved in the same pathway are often coregulated by a TF or the same regulatory mechanism, which is supported by previous studies (39)(40)(41).The Net penalty function can utilize prior biological network knowledge such as genetic pathways ( 26 ), which is a combination of the l 1 norm and squared l 2 penalty using the genetic network structure.As introduced in Kim and Sun ( 26 ), the P Net ( β; λ, α) is defined as In formula ( 4), S = diag( s 1 , • • • , s p ) is a diagonal matrix whose diagonal entries are the signs of estimated r egr ession coefficients, which can be obtained from either the ordinary r egr ession when plt; n , or the ridge r egr ession when p ≥ n .It has been shown that the matrix S can accommodate the problem of failure of local smoothness between linked genes ( 42 ).For example, if two nearby target genes are negati v ely regulated by TF, the signs in their regression coefficients are expected to be different.In formula ( 4 ), L is a symmetric normalized La placian matrix, w here the elements of L , L jk , are gi v en by , where j ∼ k means that the target genes j and k are linked in the genetic network and d j is the total number of genes linked with the target gene j .Note that the genetic network information L are considered as the functional relationships among the target genes, which can be obtained from the existing annotation.For example, we can construct an association network using the pathways or biological processes information, where the targets genes are associated with each other if they are within a metabolic pathway or a biological process.
Based on the above two loss functions along with three penalty functions, we de v eloped six statistical selection methods, MSE-Lasso, MSE-ENET, MSE-Net, Huber-Lasso, Huber -ENET and Huber -Net.For a gi v en pair of λ and α, we can estimate the r egr ession coefficients of p target genes, β, by minimizing the objecti v e function f ( β; λ, α) introduced in formula ( 2 ).In other words, β = argmi n β f ( β; λ, α) .The penalty function P ( β; λ, α) is conv e x ( 26 , 43 ), so the solution to β can be obtained via one of the conv e x optimization algorithms.

Algorithm to solve the penalized r egr ession models
Since | β j | is conv e x but not dif ferentiable a t β j = 0 for j = 1 , • • • , p, it is difficult to use the gradient descent method to find β = argmi n β f ( β) .Although we can use the general conv e x optimization solv er CVX ( 44 ), it is too slow for real biological applications especially when there are a large number of genes involved in the analysis.Therefore, we adapted an accelerated proximal gradient descent (APGD) algorithm which is an effecti v e algorithm when the objecti v e function can be decomposed as a sum of a conv e x differentiab le function and a conv e x non-differentiab le function.In the six methods we de v eloped, the objecti v e function f ( β) can be decomposed as f ( β) = g( β) + h ( β) , where g( β) is a conv e x differentiab le function and h ( β) is a conv e x non-differentiab le function.The idea behind APGD method is to make a quadratic approximation to g( β) and leave h ( β) unchanged ( 45 ), then use the iterations to solve β = argmi n β f ( β) (Details in the Supplemental Texts S3-S8).

Selection probability
To obtain a stable selection result, we applied the stability selection method, namely, half-sample approach, to each target gene, which does not need to choose the optimal tuning parameters in selection methods.For a pair of fixed values of λ and α ( α = 1 for Lasso penalty), n/ 2 samples are selected at random without replacement and then the regr ession coefficients ar e estimated based on this subset of samples.This process is repeated B times for each pair of α and λ over a grid set of α and λ.Let ˆ β j ( S b ; α, λ) denote the estimated r egr ession efficient for the bth sample ( S b , b = 1 , • • • , B), the selection probability of target gene j , S P j , is the maximum portion of non-zero ˆ β j ( S b ; α, λ) over all pairs of α and λ.In other words, where Ther e ar e two major advantages for the use of selection probability.First, we avoid selecting the optimal tuning parameters λ and α, which is challenging in penalized r egr ession analysis.Second, it has been shown that the results obtained from the half-sample approach and the selection probabilities ar e mor e stable than those obtained from the cross-validation ( 26 , 46 ).The main challenge of the stability selection method is how to a ppropriatel y choose the grid sets of the two parameters λ and α.For a gi v en α, the smallest λ such that all estimated coefficients are zer os fr om two loss functions, MSE and Huber, can be defined as λ Hub e r max = max where is the gradient of Huber function.Ther efor e, the grid set of λ can be set as a log 10 -scale from ratio * λ max to λ max , where the ratio = 0 .01 as suggested by R package glmnet.

Implementation
Six statistical selection methods based on the penalized r egr ession models and the APGD algorithm for solving these six statistical methods had been implemented in both Python3 and R and then packed into TGPred.Both of them used commonly used libraries for scientific computing.For Python3 version of TGPred, we used numpy, scipy and sklearn to support efficient mathematical and dataframe computing, cvxp y to compar e the runtime and estimated results of APGD with commonly used CVX, and networkx to generate synthetic data based on the BA network setting.For R version of TGPred, we used Matrix and MASS to support the ef ficient ma thema tical computing, and mvtnorm and igraph to generate synthetic data.TGPred can be directly used within Python and R.Both regulation effect β j and selection probability S P j of target gene j can be calculated by TGPred for j = 1 , • • • , p.Note that the large-scale genetic data set is acceptable to APGD and the computation time was evaluated on the high-performance computing (HPC) cluster (Intel Xeon E5-2670 2.6 GHz, 16 GB RAM).For example, when the number of target genes is > 30 000 ( p > 30 000) and the half-sample approach with B = 500 times of resamplings was used, the computation times of ENET-based methods were about 12h CPU time with 90 pairs of tuning parameters α and λ; the computation times of Lasso-based methods were about 8h CPU time with 50 tuning parameters λ; and the computation times of Net-based methods were about 26 h CPU time with 90 pairs of tuning parameters α and λ.TGPred package has been made publicly available on GitHub as open-sour ce softwar e for downloading ( https://github.com/xueweic/TGPr ed ); mor e detailed information on how to install and run the tool was enclosed in the packages.

Validation of the methods with simulated data
Simulation studies were used to evaluate the performance of the six statistical selection methods we de v eloped based on the penalized r egr ession models.We considered three simulation settings, the general setting and two network settings, and we used n = 300 samples and p = 500 TGs in all simulation settings.For each simulation setting, the regulation effects for all genes based on each method were estimated by the improved APGD, and the selection probabilities were calculated by the half-sample approach with B = 500 times.Then, the true positi v e rates (TPRs) were used to e valuate the selection performance, which is defined as the number of the truly regulated genes among the selected top-ranked genes divided by the total number of truly regulated genes.
In the general setting, TGs were independent with each other.Ther efor e, we only compared the performances of Huber-Lasso , MSE-Lasso , Huber-ENET and MSE-ENET with the three comparison methods, CLR, MRNET and TIGRESS.Figure 1 showed the TPRs of these for methods based on the number of selected top-ranked genes.As it is known, the bigger pre-set regulation effect of β may result in the higher TPRs of all methods, while the lower pr e-set r egula tion ef fect may result in the lower TPRs.In both cases, we cannot dif ferentia te the performances of differ ent methods.Ther efor e, we pr e-set the r egula tion ef fect β = 0 . 2 or 0 .3 , and 50 TGs wer e r egulated by a gi v en TF in this simulation setting.When β = 0 .3 , all four methods performed equivalently well as we cut 40 top-ranked genes or less, achie v ed ov er 80% TPRs as we cut 50 top-ranked genes, and 95% TPRs as we cut 85 top-ranked genes.MSE-ENET and Huber-ENET performed better than the other fiv e methods.
For the network settings, we considered two network structur es, the hierar chical network (Supplementary Figur e S1) and the Barabasi-Albert network (not shown).Figure 2 showed how TPRs varied with the different numbers of the top-ranked genes for different methods.For the hierarchical network where 45 TGs (out of 500 genes) were truly regulated by a gi v en TF, we pre-set the regulation effects β = 0 .3 or 0 .4 .Since the Net penalty function incorporated the network structure, TPRs of Huber-Net and MSE-Net were higher than the other se v en methods.For the Barabasi-Albert network setting where 40 true TGs (out of 500 genes) wer e r egulated by a gi v en TF, we pre-set the regula tion ef fect β = 0 . 1 or 0 . 2 .Huber-Net and MSE-Net also had the highest TPRs as expected, indicating that Huber-Net and MSE-Net can incorporate the functionally associated genes to increase the probability of these genes to be selected as the TGs for a gi v en TF.Based on both TPRs, we concluded that Huber-Net and MSE-Net performed slightly better than MRNET and CLR and better than all other methods.Compared to the general setting, it is obvious that the four TF-TG identification methods performed less differentially in the two network settings, as shown in Figure 2 .
We also compared the computational efficiency and the r egr ession coefficients estimated by APGD and CVX, a commonly used package for conv e x optimization, for several pairs of tuning parameters λ and α.Figures S2-S4 showed that the computation times of CVX and APGD among all grid sets of α and λ based on B = 500 subsamples drawn with the half-sample approach.Supplementary Figure S2 showed the computation times of Huber-Lasso, Huber-ENET, MSE-Lasso and MSE-ENET under the general setting with β = 0 . 2 .For ENET penalty function, n λ = 1 , • • • , 10 indicated the order of selected λ in a log 10 -scale from ratio * λ max to λ max , where λ max is related to α = 0 . 1 , • • • , 0 .9 .For Lasso penalty, n λ = 1 , • • • , 100 indicated the order of selected λ in a log 10 -scale from ratio * λ max to λ max , where λ max is rela ted to α = 1 .The da ta sets were simulated under the same setting (Supplemental Text S1).All analyses were performed on a macOS (2.7 GHz Quad-Cor e Intel Cor e i7, 16 GB memory).APGD had much lower computational complexity than CVX since the running time of APGD was usually less than one fifth of that of CVX f or f our TF-TG identification methods (Supplementary Figure S2).A disadvantage of CVX is that all of the estimated r egr ession coefficients wer e not equal to 0 (around 10 −22 for non-zero r egr ession coefficients).Ther efor e, the stability selection method may not be applicable to the CVX method since it is difficult to find a cut-off threshold for the r egr ession coefficients.The APGD algorithm was also evaluated under the hierarchical network and Barabasi-Albert network settings for all six methods.As shown in Figures S3-S4, the computation times of APGD were much shorter than those of CVX no matter which methods (Huber-Lasso, Huber -ENET, Huber -Net, MSE-Lasso, MSE-ENET and MSE-Net) it was applied to.The results manifested the consistent lower computational complexity of APGD than CVX, as we had observed for the general setting (Supplementary Figure S2).The true positi v e rates (TPRs) of the four statistical selection methods for identifying transcription factor (TF)-target gene (TG) relationships in the general setting.The selection probabilities were calculated using the half-sample approach method with B = 500 times of resampling.Each curve r epr esents the mean of 100 simulations.The standard errors for all methods were too small, and were thus not plotted.
We compared the regression coefficients estimated by APGD and CVX for se v er al pairs of tuning par ameters λ and α.Figures S5-S7 showed that the QQ plots of the regression coef ficients estima ted by both CVX and APGD.Supplementary Figure S5 showed the estimation of regula tion ef fects of Huber -Lasso, Huber -ENET, MSE-Lasso and MSE-ENET under the general setting with β = 0 . 2 .The values lied along the diagonal line as the Huber loss function was used, indicating the r egr ession coefficients estimated by CVX and APGD were identical.When the MSE loss function was used, the non-zero estimations of regula tion ef fects of CVX wer e gr eater than those of APGD (Supplementary Figure S7).Howe v er, there were only 50 true TGs (out of 500 genes) that wer e r egulated by a gi v en TF in this simulation setting.That is, CVX obtained more false positi v es than APGD.Except for those false positi v es estimated by CVX, the r egr ession coefficients estimated by these two methods were nearly identical.Figures S6-S7 showed that the estimation of regulation effects of our proposed six statistical selection methods under the network setting, where we used β = 0 .4 in the hierarchical network setting (Supplementary Figure S6) and β = 0 . 1 in the Barabasi-Albert network setting (Supplementary Figure S7).We observed that the patterns of the estimation performance were similar to those shown in Supplementary Figure S5.

Validation of the four TF-TG identification methods with SND1 o ver expr ession tr anscriptomic data
178 DEGs in response to SND1 ov ere xpression were identified in our early publication ( 16 ).These 178 DEGs were classified into two groups, 84 direct target genes and 94 indirect target genes of SND1 using Top-down GGM Algorithm.Of these 84 direct target genes, 16 randomly drawn genes were tested with ChIP-PCR assay using SND1 antibodies, all of them were proven to be the true direct tar-get genes of SND1( 16).Sixteen genes r andomly dr awn from these 94 indirect target genes were also subjected to ChIP-PCR using SND1 antibodies and 15 were proven to be indirect target genes of SND1, indicating the 84 genes can be assumed to be true target genes of SND1 for testing our methods.Using the three time-point SND1 ov ere xpression transcriptomic data sets, we tested our methods for identifying these 84 direct target genes of SND1 from 178 DEGs and all 33691 genomic genes based on the selecti v e probabilities yielded from each method.The results, as shown in shown in Figures 3 A and B, demonstrated the following: (i) When applied to 178 DEGs, Huber-ENET and MSE-ENET in overall identified less positive TGs than CLR and MRNET methods but more than Huber-Lasso, MSE-Lasso and TI-GRESS; Huber-Lasso, MSE-Lasso identified more positi v e genes than TIGRESS.Based on ROCs, Huber-ENET and MSE-ENET appeared to rank more positi v e genes to the very top of TG list than any other methods.(ii) When applied to all 33 691 genomic genes, Huber-ENET and MSE-ENET identified more positi v e genes than Huber-Lasso, and MSE-Lasso, while Huber-Lasso and MSE-Lasso identified more positi v e TGs than CLR, MRNET and TIGRESS.

Validation of the four TF-TG identification methods with glossy (gl3) o ver expr ession tr anscriptomic data
We also validated our methods with gl3 o ver expr ession tr anscriptomic data that was generated with Transcriptional-Activator Like effectors (TALes) system.Two TALes, dT1 and dT2, were constructed to target two non-overlapping 16-bp regions (4 and 48 bp upstream of the transcription start site) in the gl3 's promoter to activate it.Analysis of RNA-seq data yielded at 24 h re v ealed 144 DEGs (93 upregulated and 51 downregulated genes), that were activ ated b y both dT1 and dT2 ( 25 ).From these 144 genes, we identified 93 tightly responsi v e genes to gl3 and all identified less positi v e genes than TIGRESS but more than CLR and MRNET.Huber-Lasso , MSE-Lasso , underperformed all other methods; (ii) w hen a pplied to all 30 263 genomic genes, Huber-ENET and MSE-ENET outperformed all other methods because they identified more positi v e genes than Huber-Lasso and MSE-Lasso, while Huber-Lasso and MSE-Lasso identified more positi v e TGs than CLR, MRNET and TIGRESS.

Validation of the two net-based methods for identifying lignin pathway genes and GRN
Maize B73 compendium transcriptomic data of 736 samples was used for predicting the regulatory relationships between transcription factor (TFs) and pathway genes (PWGs).A total of 2539 PWGs belonging to 446 pathways were obtained after the PWGs with 90% or more expression values being zero were removed.The Laplacian matrix L of 2539 PWGs was constructed based on 446 pathways, that is, two PWGs were associated together if they belong to at least one of 446 pathways.Since these 23 TFs are the known TFs that regulate lignin pathway in multiple plant species (34)(35)(36)(37)(38).We specifically examined 21 pathway genes in maize which were curated by Plant Metabolic Pathway ( 47 ) as lignin pathway genes.
We applied Huber-Net and MSE-Net to two input subsets of transcriptomic data : one contains 2539 PWGs × 736 samples, and the other contains 23 TFs × 736 sam-  S2 and S3, respecti v ely, and the results by the three comparison methods, CLR, MRNET and TIGRESS, were shown in Tables S4, S5 and S6, respecti v ely.We then e xtracted the selection probabilities of the 21 lignin PWGs × the 23 TFs resulting from all methods, and were shown in Table S7.Since the comparison methods, CLR, MRNET and TIGRESS, use different statistics to evaluate the regulatory strengths, we could not use the same criterion to cut off the ranked PWGs to each TF.We hypothesized that the top 100 genes identified from 2539 PWGs for each TF by each method are its putati v e TGs, and summarized the TGs for all 23 TFs for each method.We then extracted the TF-TG pairs where the TGs are lignin pathway genes and compared which method could identify more regulatory relationships between the lignin pathway genes and the 23 TFs.The r esults ar e shown in Figur e 4 , wher e TFs wer e ranked clockwise based on the number of their connectivity to pathway genes; the TFs with higher connectivity are assumed to regulate more pathway genes and / or have larger impact on pathway genes and thus were ranked highly.The results showed that Huber-Net and MSE-Net methods identified 50 and 49 regulatory relationships, respecti v ely, more than that of TIGRESS.The high number of regulatory relationships is the suggesti v e of a potential recognition of known regulatory relationships.Howe v er, both our methods identified less regulatory relationships compared to MRNET and CLR, which identified 84 and 70 r egulatory r elationships, r especti v el y.Currentl y, we do not know which method is better than another.This is because we know these 23 TFs are lignin pathway regulators but exactly which pathway genes are the true targets of which TF are mostly unknown, and we thus cannot draw further and more firm conclusions that our methods are better or worse than comparison methods.Ne v ertheless, if the objecti v e is to identify pa thway regula tors, the top TFs ranked for lignin pathway by different methods shared many TFs in common.We also cut off the ranked PWG list to each TF generated by two Net-based methods using a selection probability threshold of 0.9, and the r esults ar e shown in Supplementary Figur e S8, Huber-Net and MSE-Net identified 76 and 28 regulatory relationships, respecti v ely.The ranking or der of TFs changed slightly as compared to same methods as shown in Figure 4 .
Huber-Net identified the unique pathway genes that were not identified by any other methods including the three comparison methods.For example, C4H regulated by MYB103-2, and HCT-1 by KN AT7.MSE-Net uniquel y identified CCoAOMT1-2 regulated by MYB69, 4CL3 by MYB42, HCT-1 / 2 / 3 by MYB42.Huber-Net and MSE-Net to gether uniquel y identified F AH1-1, F AH1-2 by MYB63.To show the overlaps of the regulatory relationships identified by different methods, we generated a Venn diagram (Figure 5 ) based on the regulatory relationships shown in Figure 4 .The results indica ted tha t Huber-Net and MSE-Net are very similar methods because the gene regulatory relationships identified by the two methods had 42 in common.Similarly, MRNET and CLR are very similar methods because the gene regulatory relationships identified by two methods had 62 in common (Figure 5 ).In addition, of the 36 r egulatory r elationships identified by TIGRESS, 24 overlapped those identified by Huber-Net and / or MSE-Net, w hile onl y 17 overla pped those identified by MR-NET and / or CLR, indicating it is more similar to Huber-Net and MSE-Net methods rather than MRNET and CLR (Figure 5 ).

Solving convex optimization problem by implementing APGD
The loss functions and the penalty functions we used in this study are all conv e x functions ( 26 , 39 ).Although CVX is the software commonly used for solving conv e x optimization prob lems ( 44 ), but one ov ert prob lem of CVX is its slowness when being used for large datasets.In this study, we implemented an APGD algorithm ( 45 ) to replace the CVX.APGD is an effecti v e algorithm to solv e an optimization problem with a decomposable objective function.Our simulation studies have shown that APGD was not only capable of obtaining the similar estima ted regula tion ef fects of all TGs for a gi v en TF, but it also shortened the computation time to 1 / 5 of that by using CVX, enabling the prediction of true TGs of a TF from a large number of candidate TG genes (e.g.> 30 000 as demonstrated in Supplementary Figure S2).In principle, CVX cannot be used to calculate the stable selection probability.Stable selection probability is calculated based on the ratio of the number of non-zero estima ted regula tion ef fects of a TG to the number of times we resampled in the half-sample approach, and all candidate tuning parameters.When using APGD, we can obtain a subset of TGs with non-zero regulation effects, and the rest subsets of TGs with zero regulation effects.Ther efor e, we do not need to choose threshold by a ppl ying APGD to the half-sample approach.

Development and elucidation of six novel methods for identifying TGs of a TF
With the improved new APGD algorithm, we set out to de v elop nov el methods to predict the TGs of a TF of interest using omics data, an important issue that has not been resolved in current bioinformatics.Using two loss functions, Huber and MSE, and three penalty functions, Lasso, ENET and Net, we de v eloped four statistical selection methods, MSE-ENET, Huber-ENET, MSE-Lasso and Huber-Lasso for identifying TF-TG relationships, and two additional methods, MSE-Net and Huber-Net, for building pathway GRNs.The Huber loss function, which is a hybrid of squared errors for relati v ely small errors and absolute errors for relati v ely large errors (see Formula ( 3 )), has been shown to be more robust than MSE loss function when there are outliers ( 48 ).To test the four TF-TG identification methods, we used the synthetic data generated from the general setting and found that ENET-based methods performed better than Lasso-based methods if all TGs are independent (Figure 1 ).When the two network settings were used to test all six methods and the three comparison methods, the MSE-Net and Huber-Net outperformed the other four non-Net methods since the Net penalty could incorporate the network structure of TGs for enhancing the prediction.Notably, one tuning parameter λ from Lasso penalty and two tuning parameters α and λ from ENET or Net penalty were obtained from the crossv alidation b y minimizing the predicted accuracy ( 21 , 49 ).Howe v er, the results are not stable since the samples were randomly split in the cross-validation ( 26 ).Fortunately, a stability selection method has been de v eloped by Meinshausen and B ühlmann ( 46 ) that uses a subsampling approach to obtain a stable selection result; this subsampling approach can be used to determine the amount of regularization.In this study, we employed the selection probabilities to evaluate and select candidate TGs of a gi v en TF.
Plant biolo gists frequentl y employ a transient system or de v elop transgenic lines to perturb a TF, as shown in Figure 6 ; (i) this is followed by RNA-seq experiments to obtain the transcriptomic data; after the DEGs pertaining a gi v en TF are identified, biolo gists usuall y selected some DEGs based on their significant le v els (e.g.corrected P -values or q -values) to validate their functions; (ii) using a correlation method ( 50 ), a dependence-based method ( 11 , 51 ) or a modeling method to identify candidate genes to validate ( 52 ); (iii) using top-down GGM algorithm ( 16 , 17 , 53 , 54 ) to predict TGs of the TF from the DEGs; Howe v er, correlation and independence-based methods usually have a low accuracy and top-down GGM algorithm suffers from constraint to scaling up due to the high computational cost of searching the space of a complete combination of a subset of candidate genes.For this r eason, ther e is a pressing need to de v elop efficient methods for identifying the true TGs of a gi v en TF.In addition, ther e ar e some other circumstances where we need new methods to identify or validate the TGs of a TF.For example, when genome-wide experiments like ChIP-seq and DAP-seq are conducted to identify putati v e TGs of a TF, a few hundred to e v en twenty thousand putati v e TGs can be identified, indicating that their promoters contain TF binding site.Howe v er, the presence of a TFbinding site of the TF does not necessarily guarantee an activation.We need highly efficient methods to validate the existence of an effect-and-r esponse r elationship between the TF and the putati v e TGs identified by ChIP-/ DAP-seq.In this sense, our methods, Huber-ENET, Huber-Lasso, MSE-ENET and MSE-Lasso, fill in a gap of lacking an effecti v e method for predicting and / or validating TGs of a TF of interest using large-scale omics data.Such methods are sought by many biologists.Our methods resampled a large number of subsets of data (e.g.500) to compute the selection probabilities of all genes to one TF sim ultaneousl y, and then selected top-ranked TGs based on the stabilities of selection probabilities across all subsets.Ther efor e, our methods augmented the selection process and increased the reliability of TGs.Even if each time we computed linear relationships of one TF with all genomic genes or the DEGs resulting from the TF ov ere xpression with one re-sampled subset, the aggregation of the selection probabilities from all sampled subsets could increase the chance of the nonlinear true relationships being captured.
Instead of identifying TGs of a TF independently, we sometimes need to investigate if a TF regulates a pathway or a biological process.In this case, we can determine if a TF's TGs contain multiple genes belonging to a pathway or a gene ontology that r epr esents a biological process.Toward this goal, we de v eloped Huber-Net and MSE-Net methods based on a network-based penalty.In our e xtensi v e simulation studies based on the network setting, we showed that Huber-Net and MSE-Net performed better than the other four methods in terms of the true positi v e discovery rates.We then applied these two methods to identify true TGs of 23 TFs, which are known to regulate lignin pathway (34)(35)(36)(37)(38), from all 2539 pathway genes (PWGs) of maize.

The power of statistics, machine learning and optimization combined approaches
In this study, we combined statistics (half-sample approachderi v ed selection probability), machine learning (regularization in unsupervised learning) and conv e x optimization (solving regularization with APGD) to identify TGs of a TF of interest, the flowchart depicting the methods is illustrated in Figure 6 .Our results showed that this kind of combined approach has higher efficacy in identifying the true TGs, as we shown early ( 21 ).
In our methods, we utilized two loss functions.The Huber loss function is a combination of linear and quadratic loss functions, and the MSE loss function, which measures the average of the squared errors, ensures that our trained model has no outlier predictions with huge errors.MSE puts more weights on these errors due to the squared portion of the function.The ma thema tical benefit of MSE is particularly evident in its use at analyzing the performance of linear r egr ession, as it allows one to partition the variation in a dataset into variation explained by the model and v ariation explained b y randomness.Huber loss is more robust to outliers than MSE loss and least absolute deviation (LAD) loss, and has higher statistical efficiency than the LAD loss function in the absence of outliers ( 48 ).
In addition, we utilized three different penalty functions: Lasso, ENET and Net.Lasso penalty adds a penalty for non-zero coefficients to penalize the sum of their absolute values ( l 1 penalty).As a result, for high values of λ, many coefficients are exactly zero under Lasso.ENET penalty was proposed in response to the critique that the the variable selection of Lasso only considers the absolutely value of estimated effects, resulting in instability.ENET penalty combines the penalties of ridge r egr ession and Lasso to gain 'super-penalty'.Net penalty can incorporate a set of genes like a pathway or a biological process as r epr esented by a gene ontology, and enables us to investigate if a TF regulates multiple genes involved in a pathway or a biology process.When TGs of multiple TFs are predicted, we can then use the results to screen the TFs for regulation of a specific metabolic pathway, biological process and complex trait.
We demonstra ted tha t the improved APGD had less computational complexity for solving the conv e x optimization problem with both differentiable and undifferentiable functions.Traditional regularization methods need to choose optimal tuning parameters.One limitation of traditional regularization methods with cross-validation is that they depend on the sa tura tion of the da ta; dif ferent da ta sets may obtain different tuning parameter sets, leading to different or unstable results.APGD is a highly efficient approach to solve our proposed methods as well as the other penalized r egr essions.The incorporation of half-sample-based selection probability allows for more stable results and avoids the choice of optimal tuning parameters.Ther efor e, integration of statistics, machine learning and optimization enables us to take the advantage of all methods and combines them to generate a powerful approach to identify true TGs of a TF with high efficacy.
Due to the disadvantage of the feature selection procedure, we cannot check if the selected genes have strong evidence related to the outcome.For future studies, we plan to integra te sta tistical inference in the selection procedure and further investigate the selection performance by integrating both selection and statistical inference.

CONCLUSIONS
Four new statistical selection methods, r eferr ed to as Huber-ENET , MSE-ENET , Huber-Lasso and MSE-Lasso for identifying TGs of a TF of interest and two new methods, Huber-Net and MSE-Net, for inferring pathway GRNs hav e been de v eloped by integra tion of sta tistics, machine leaning and conv e x optimization approaches.An APGD algorithm was de v eloped to solv e conv e x optimization with significantly reduced computation times.Comprehensi v e simulations and analyses of the four TF-TG identification methods using synthetic data under a general setting indica ted tha t Huber-ENET , MSE-ENET , Huber-Lasso and MSE-Lasso could be used to identify true TGs of a TF with high efficacy, especially in genome-wide predictions.In simulations using the data from two network settings, Huber-Net and MSE-Net outperf ormed an y other non-Net methods for identifying true TGs involved in a pathway or biological process.For real data, the Huber optimization has a noticeable contribution to the identification of true TGs of a gi v en TF by increasing the selection probabilities as compared to MSE, and Huber-Net and MSE-Net could identify many unique r egulatory r elationships as compared to CLR, MRNET and TIGRESS.The ENET penalty function contributed greatly to enhancement of the method efficacy as compared to Lasso.AuROC plotting showed that all six methods could rank more positi v e known TGs to top of TG lists.The TF-TG identification methods de v eloped will fill a methodolo gical ga p for genome-wide TF-TG prediction and have a great potential for being used to validate ChIP / DAP-seq results, while the Net-based methods will be instrumental for inferring pathway GRN.

Figure 1 .
Figure1.The true positi v e rates (TPRs) of the four statistical selection methods for identifying transcription factor (TF)-target gene (TG) relationships in the general setting.The selection probabilities were calculated using the half-sample approach method with B = 500 times of resampling.Each curve r epr esents the mean of 100 simulations.The standard errors for all methods were too small, and were thus not plotted.

Figure 2 .
Figure2.The true positi v e rates (TPRs) of the six statistical selection methods in the two network settings, the hierarchical network and the Barabasi-Albert network.The selection probabilities were calculated using the half-sample approach method with B = 500 times of resampling.Each curve r epr esents the mean of 50 simulations.The standard errors for all methods were too small, and were not plotted.

Figure 3 .
Figure 3.The performance of four transcription factor (TF)-target gene (TG) identification methods.( A ) ROCs generated with the SND1 ov ere xpression transcriptomic data set of 178 differ entially expr essed genes (DEGs) (resulting from SND1 ov ere xpression) from Populus trichocarpa .( B ) ROCs generated with the SND1 ov ere xpression transcriptomic data set of all genes (33 691) from Populus trichocarpa .( C ) ROCs generated with the gl3 ov ere xpression transcriptomic data set of 144 DEGs (resulting from gl3 ov ere xpression) from Zea mays .( D ) ROCs generated with the gl3 ov ere xpression transcriptomic data set of all genes (30 263) from Zea mays .AuROC, area under the recei v er-oper ating char acteristic curve.

Figure 4 .
Figure 4.The gene regulatory networks of lignin pathway produced by Huber-Net and MSE-Net methods, with three network construction methods, CLR, MRNET and TIGRSS, used as comparison.The transcription factors (TFs) were ranked based on their connectivity to lignin pathway genes in clockwise.The number shown near each network is the number of edges identified by each method.

3 Figure 5 .
Figure 5.The Venn diagram that shows the overlaps of regulatory relationships between 23 transcription factors and lignin pathway genes identified by Huber-Net and MSE-Net, CLR, MRNET and TIGRESS methods.

Figure 6 .
Figure 6.An integrati v e frame wor k for identifying target genes (TGs) of a transcription factor (TF) of interest using transcriptomic data by integration of statistics, machine leaning and conv e x optimization.