Enabling Studies of Genome-Scale Regulatory Network Evolution in Large Phylogenies with MRTLE

Zhang, Shilu; Knaack, Sara; Roy, Sushmita

doi:10.1007/978-1-0716-2257-5_24

Shilu Zhang³,
Sara Knaack³ &
Sushmita Roy^3,4

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2477))

1554 Accesses
1 Altmetric

Abstract

Transcriptional regulatory networks specify context-specific patterns of genes and play a central role in how species evolve and adapt. Inferring genome-scale regulatory networks in non-model species is the first step for examining patterns of conservation and divergence of regulatory networks. Transcriptomic data obtained under varying environmental stimuli in multiple species are becoming increasingly available, which can be used to infer regulatory networks. However, inference and analysis of multiple gene regulatory networks in a phylogenetic setting remains challenging. We developed an algorithm, Multi-species Regulatory neTwork LEarning (MRTLE), to facilitate such studies of regulatory network evolution. MRTLE is a probabilistic graphical model-based algorithm that uses phylogenetic structure, transcriptomic data for multiple species, and sequence-specific motifs in each species to simultaneously infer genome-scale regulatory networks across multiple species. We applied MRTLE to study regulatory network evolution across six ascomycete yeasts using transcriptomic measurements collected across different stress conditions. MRTLE networks recapitulated experimentally derived interactions in the model organism S. cerevisiae as well as non-model species, and it was more beneficial for network inference than methods that do not use phylogenetic information. We examined the regulatory networks across species and found that regulators associated with significant expression and network changes are involved in stress-related processes. MTRLE and its associated downstream analysis provide a scalable and principled framework to examine evolutionary dynamics of transcriptional regulatory networks across multiple species in a large phylogeny.

You have full access to this open access chapter, Download protocol PDF

Reconstruction and Analysis of the Evolution of Modular Transcriptional Regulatory Programs Using Arboretum

Improving Gene Regulatory Network Inference by Incorporating Rates of Transcriptional Changes

Article Open access 08 December 2017

Gene expression of functionally-related genes coevolves across fungal species: detecting coevolution of gene expression using phylogenetic comparative methods

Article Open access 20 May 2020

Key words

1 Introduction

Transcriptional regulatory networks define interactions between regulators, such as transcription factors (TFs) and signaling proteins and target genes [1], and control context-specific gene expression patterns important in various developmental, environmental, evolutionary, and disease processes. Recently, several studies have measured genome-wide expression levels in multiple species for comparative functional genomic analyses [2,3,4], facilitating a comprehensive study of how regulatory networks vary across species in response to environmental stimulus. However, computational tools to define and compare regulatory networks across multiple species in a complex phylogeny are in their early stages. To address this need, we developed Multi-species Regulatory neTwork LEarning (MRTLE), a computational approach, based on probabilistic graphical models [5] and multi-task learning to infer regulatory networks for multiple species in a phylogeny. The multi-task learning framework treats each species network inference as a separate task and uses the phylogenetic relationships as a probabilistic prior to constrain the inferred network topologies. A multi-task learning framework enables us to share information between tasks, which is beneficial for species with limited samples. MRTLE’s probabilistic framework also handles complex orthology relations arising due to loss and gene duplications that occur in large phylogenies spanning millions of years [6].

The inputs to MRTLE are a phylogenetic tree, branch-specific probabilities for gain or loss of a regulatory edge, gene expression data for each species, and finally gene orthology relationships. One optional input is sequence-specific transcription factor binding motifs for each species, which can be incorporated as prior knowledge to improve the quality of the inferred networks. The outputs of MRTLE are the regulatory networks of each species. The regulatory network of each species is modeled as a probabilistic graphical model [7]. The phylogenetic information is incorporated as a model structure prior to capture the evolutionary dynamics of gaining or losing edges from the ancestral species to the extant species. In this chapter, we describe how to apply MRTLE and several downstream analyses to examine network evolutionary dynamics across large phylogenies in detail. We demonstrate MRTLE’s application with a case study of six ascomycete yeast species and how these networks can be validated and analyzed to identify patterns of change that can be associated to different phylogenetic points as well as species-specific traits.

2 Materials

MRLTE is developed in C++, and the source code is publicly available at https://github.com/Roy-lab/mrtle. To compile MRLTE, a C++ compiler (e.g. GCC) and the GNU scientific library (https://www.gnu.org/software/gsl/) are needed to be installed with the option setting to C++14 with GNU extensions (-std=gnu++14). To run MRTLE on expression data in three species, with 1000 genes and 100 regulators, with 30 measurements per species, a laptop or desktop with adequate memory (<1 GB) and disk space (<1 GB) is sufficient. For large datasets, high-throughput computing resources are recommended. The memory requirements increase with the number of species, number of genes and regulators. For each species, MRTLE needs the expression data, the set of target genes, regulators, an orthology mapping (OGIDs file, below), and an optional collection of regulator-gene relationships based on sequence-specific motifs.

The following section provides the steps for installing and applying MTRLE using real expression data from six ascomycete yeasts species: Saccharomyces cerevisiae (S. cerevisiae), Candida glabrata (C. glabrata), Saccharomyces castellii (S. castellii), Candida albicans (C. albicans), Kluyveromyces lactis (K. lactis), and Schizosaccharomyces (S. pombe). The expression data measure the transcriptional response of genes to four stress conditions over time: glucose depletion, heat shock, osmotic stress, and oxidative stress and were originally presented in the MRLTE paper [6]. This dataset has 30 measurements for S. cerevisiae, C. glabrata, S. castellii, K. lactis, and C. albicans and 21 measurements for the sixth species, S. pombe. Species-specific sequence motifs identified by the Cladeoscope algorithm by Habib et al. [8] were used as prior information.

3 Methods

We first introduce some theory of the MRTLE model to help better understand the outputs of MRTLE (Subheading 3.1) and then describe the main steps of using MRTLE (Subheadings 3.2–3.5). To use MRTLE, there are four main steps: downloading and installing MRTLE (Subheading 3.2), preparing input files for MRTLE (Subheading 3.3), application of MRTLE to multi-species datasets (Subheading 3.4), evaluation and validation of results (Subheading 3.5).

3.1 The Theory of MRTLE Model

MRTLE is based on a probabilistic framework for multi-task learning [9], where each task is to learn the network for each species. Each network is modeled as a dependency network [10], which is a class of probabilistic graphical models for representing predictive statistical relationships among random variables. Each gene is modeled as a random variable X_j which encodes the expression level of gene j. A conditional probability distribution P(X_j | X_i) models the strength of the interaction between a regulator i and a gene j. In MRTLE, these distributions are estimated using conditional Gaussian distributions. MRTLE incorporates two priors, one is the per-species prior that controls the sparsity of the network and favors a regulatory edge if there exists a species-specific motif instance of the regulator in the promoter region, and another one is the phylogenetic prior. The per-species prior probability is modeled using a logistic function:

$$ P\left({R}_{ij}=1\right)=\frac{1}{1+{e}^{-\left({\beta}_0+{\beta}_1\ast {m}_{ij}\right)}} $$

where R_ij is a binary variable defining whether there is an edge between regulator i and target j (1 means present, 0 means absent). The β₀ parameter is a sparsity prior that controls the penalty of adding of a new edge to the network, which takes a negative value (β₀ < 0). A smaller value of β₀ will result in a higher penalty on adding new edges. The β₁ parameter controls how strongly motifs are incorporated as prior (β₁ ≥ 0). A higher value of β₁ will result in motif presence being valued more strongly to select an edge. β₁ is set to 0 when there is no species-specific motif information available. m_ij represents the motif instance score if gene j is bound by a motif of regulator i in its promoter region.

The phylogenetic prior is modeled using a continuous-time Markov process, parameterized by the gain and loss probabilities for each branch on the tree defined as:

$$ P\left({R}_{ij}^Z\right|{R}_{ij}^A\Big)={e}^{-Q{t}_z} $$

$ {R}_{ij}^Z $ denotes the state of the edge between regulator i and target gene j in species Z. A is the ancestor of species Z, and $ {R}_{ij}^A $ denotes the state of the edge between regulator i and target gene j in ancestor A. t_z denotes the branch length for species Z. The gain and loss probabilities for each branch on the tree depend upon the branch length (t_z) and the rate matrix Q which specifies the rates at which we expect regulators to gain or lose targets per unit time. To estimate the rate matrix Q, expectation-maximization-based approaches can be applied, such as SiPhy developed by Garber et al. [11, 12].

3.2 Downloading and Installing MRTLE

Below are the Unix commands to download and install MRTLE in a standard command line terminal. There are three steps: (1) Download the source code from GitHub. (2) Enter the MRTLE code directory using cd. (3) Compile the code.

1) git clone https://github.com/Roy-lab/mrtle.git 2) cd mrtle/code/ 3) make

3.3 Preparing Inputs for MRTLE

MRTLE needs several input files. Below we list these and describe its contents and how it can be created. All the files mentioned are available at https://github.com/Roy-lab/mrtle/tree/master/data.

3.3.1 Configuration File

This file specifies the location of the species-specific input data files and the location of the outputs. Each line corresponds to a species and should have the following tab-delimited species-specific entries: 1) species name, 2) filename of the expression data for each species, 3) the output directory, 4) filename of a list of regulators to be considered for a given target gene, 5) filename of a list of target genes to be used, 6) filename of prior network for each species. Here is an example of three lines in the configuration file (horizontal and vertical lines are just for clarity):

Scer	data/expression/Scer_geneexp.txt	out/Scer/	data/TFs/Scer_tf_mrna_binding.txt	data/genelists/Scer_allGenes.txt	data/motifs/Scer_motifs.txt
Cgla	data/expression/Cgla_geneexp.txt	out/Cgla/	data/TFs/Cgla_tf_mrna_binding.txt	data/genelists/Cgla_allGenes.txt	data/motifs/Cgla_motifs.txt
Scas	data/expression/Scas_geneexp.txt	out/Scas/	data/TFs/Scas_tf_mrna_binding.txt	data/genelists/Scas_allGenes.txt	data/motifs/Scas_motifs.txt

3.3.2 Expression Data File

In the file of the expression data for each species, each row represents one gene, and each column represents one condition. The first row is the header. Pre-processing of the expression data is described in Note 1. Here is an example of the expression data file for the S. cerevisiae species, Scer_geneexp.txt mentioned above.

Name	LAG	LL	DS	PS	PLAT	HS5
YAL016W	−2.57316	0.18674	0.351552	0.884193	0.461141	0.305003
YAL042W	−0.827777	−0.068987	−0.223257	−1.07268	−1.00647	0.0169562
YAL053W	0.981654	0.421282	0.0732861	−0.274487	0.344917	0.0213995
YBL030C	−0.956715	0.738636	1.99264	2.5738	2.57844	−0.077775
YKL112W	0.91353	−0.441429	−0.512388	−2.79075	−1.76492	0.264341

3.3.3 Species-Specific Regulators File

This file contains the list of the genes to be considered as potential regulators for each species. A regulator per line is specified. Example lines from the species-specific regulators file Scer_tf_mrna_binding.txt:

YKL112W YAL021C YBR130C YAL011W

3.3.4 Species-Specific Targets File

This file contains the list of the genes to be considered as potential targets for each species. Each line lists the name of the target gene. Example lines from the species-specific target genes file Scer_allGenes.txt:

YAL016W YAL042W YAL053W YBL030C

3.3.5 Motif-Based Prior Network File (Optional)

The motif-based prior network file has three tab-separated columns, listing the regulator, target gene, and motif score. Example lines from the prior network file Scer_motifs.txt:

YKL112W	YAL016W	0.868431
YKL112W	YAL042W	0.902258
YKL112W	YAL053W	0.836423
YKL112W	YBL030C	0.807526

3.3.6 Tree File

This file specifies the phylogenetic tree structure and can be obtained from public resources like the Ensembl Compara database (http://useast.ensembl.org/info/genome/compara/index.html). For the six yeasts species example, the species tree was obtained from Wapinski et al. [13]. To generate the species tree, multiple sequence alignments from orthogroups were concatenated [13], and the tree topology was learned using the Phylip package [14]. The PAML [15] package is another tool that can be used for the species-tree-inference task from gene orthology data. The tree file should have five columns: (1) the child species, (2) the relation of the child species to the parent species, whether it is a left or right child, (3) the parent species, (4) branch-specific gain probability p_g, ranging from 0 to 1, (5) branch-specific loss probability p_l, ranging from 0 to 1. In Note 2, we discuss how to determine branch-specific gain probability p_g and branch-specific loss probability p_l. Example lines from the file yeast_tree_rates.txt are as below:

Anc11	left	Anc14	0.0529786	0.882977
Spom	right	Anc14	0.0538444	0.897406
Anc9	left	Anc11	0.0465452	0.775753
Calb	right	Anc11	0.0543971	0.906619
Anc5	left	Anc9	0.022713	0.378551
Klac	right	Anc9	0.0515299	0.858832
Anc4	left	Anc5	0.0139051	0.231751
Scas	right	Anc5	0.0423569	0.705948
Scer	left	Anc4	0.0454312	0.757187
Cgla	right	Anc4	0.0471082	0.785137

3.3.7 Gene Orthology Mappings File (OGID File)

This file has the orthogroup information for genes. The gene trees can be obtained from public resources like the Ensembl Compara database (http://useast.ensembl.org/info/genome/compara/index.html), and the Inparanoid eukaryotic ortholog database (http://inparanoid.sbc.su.se/cgi-bin/index.cgi) [16]. For the six yeasts species example, the gene orthology mappings were created using the SYNERGY algorithm [17]. Each row represents an orthogroup of genes. The first column of this file is of the format OG{ID}_{DUP}. Each orthogroup is denoted by an ID, which takes a numeric value. For orthogroups with duplications, DUP denotes the i^th duplication. If there are no duplications in the dataset, DUP is set to 1. The second column of this file is the gene name for each orthogroup from the respective species and separated by a comma across all species. The gene names should be in the same order as in the species list file described next (Subheading 3.3.8). If an orthogroup does not exist in one species, NONE is used as the gene name. Gene names must be unique across species. Example lines from the file OGid_members.txt are as follows:

Gene_OGID	NAME
OG16_1	YLL043W,CAGL0C03267g,Scas400.1,KLLA0E00550g,NONE,SPAC977.17
OG16_2	YLL043W,CAGL0C03267g,Scas400.1,KLLA0E00550g,NONE,SPAC977.17
OG23_1	NONE,NONE,NONE,NONE,NONE,SPAC1F8.07c

3.3.8 Species List File

This file contains the list of all the species with orthology relationships in the OGID file. The order should be the same as the species listed in the second column of the OGID file. Example lines from the file specorder_allclade.txt are as below:

Scer Cgla Scas Klac Calb Spom

3.3.9 Potential Regulator Orthogroups ID File

This file contains the list of IDs of the orthogroups to be considered as potential regulators. The ID is the same number as in the OGID file (Subheading 3.3.7). A regulator must also be present in the species-specific list of regulators given in the species-specific configuration file (Subheading 3.3.1). Example lines from the file TFs_OGs.txt are as below:

10261 1033 104 105

3.3.10 Potential Target Orthogroups ID File

This file specifies the list of IDs of the orthogroups to be considered as potential targets. This option can be used as a way to parallelize MRTLE on separate gene sets (see Note 4). The ID is same number as in the OGID file describing the orthology relationships (Subheading 3.3.7). A target must also be present in the species-specific list of targets given in the species-specific configuration file (Subheading 3.3.1). Example lines from the file AllGenes.txt are as below:

100 1000 1001 10013

3.4 Applying MRTLE to Multi-Species Data

MRTLE takes the following inputs: (1) expression data for each species, (2) a phylogenetic tree relating the species, (3) branch-specific probabilities for gain or loss of a regulatory edge, (4) gene orthology relationships specified in an OrthoGroup ID (OGID) file, (5) list of regulators in each species, (6) optional species-specific regulatory information, for example, from motif instances in gene promoters (Fig. 1a). Example inputs are available from the website https://github.com/Roy-lab/mrtle/tree/master/data. These are specified through various MRTLE arguments and file contents described in Subheading 3.3.

The arguments of MRTLE are: (1) a configuration file, (2) maximum number of regulators to be considered for a given target gene, (3) the number of cross-validation, (4) the gene orthology relationship file, (5) a list of the orthogroups to be considered as potential regulators, (6) a list of orthogroups to be considered as target genes, (7) the phylogenetic tree, (8) a list of the species, (9) the sparsity parameter, and (10) the parameter that controls how strongly the motif networks are incorporated as prior. An example usage of running MRTLE using the example data is as below:

./mrtle -f data/speciesconf.txt -x300 -v1 -m data/OGid_members.txt -l data/TFs/TFs_OGs.txt -n data/genelists/AllGenes.txt -d data/yeast_tree_rates.txt -s data/specorder_allclade.txt -b -0.9 -q 4.0

Below we describe each of these arguments to the MRTLE program in detail.

1.
-f takes the configuration file in which each row represents information of one species. See Subheading 3.3.1.
2.
-x specifies the maximum number of regulators to be considered for a given target gene. This can be a large number, e.g., 200, to enable prediction of as many regulators as possible.
3.
-v specifies the number of cross-validation, which can be set to 1 as well. If it is set to 1, the model is trained on the entire data.
4.
-m specifies the name of the OGID file, describing the gene orthology relationships. See Subheading 3.3.7 for the description of how to make this file.
5.
-l specifies the list of IDs of the orthogroups to be considered as potential regulators. The ID number is the ID in the OGID file (specified by -m). See Subheading 3.3.9 for details.
6.
-n specifies the list of IDs of the orthogroups to be considered as potential targets. This option can be used as a way to run MRTLE in parallel on separate gene sets (see Note 4). The ID number is the ID in the file describing the orthology relationships (specified by -m). See Subheading 3.3.10 for details.
7.
-d takes the file of the phylogenetic tree. This file provides the phylogenetic relationship of the species, and branch-specific gain and loss probabilities. See Subheading 3.3.6 for how to prepare the tree file.
8.
-s specifies the list of all the species present in the file describing the orthology relationships (specified by -m). See Subheading 3.3.8 for how to prepare the species list file.
9.
-b specifies the β₀ parameter which controls the sparsity of the network, a penalty for adding new edges. In Note 3, we discuss how to choose the β₀ parameter.
10.
-q specifies the β₁ parameter which controls how strongly motif instances are incorporated as prior. A higher value will result in motif network being valued more strongly. See Note 3 about how to choose the β₁ parameter.

3.5 MRTLE Outputs and their Interpretation

Assuming MRTLE is run according to the usage provided in Subheading 3.4, the MRTLE outputs will be available in the out/ directory where the command is executed. These results include (1) inferred regulatory network for each species within its own species-specific folder: var_mb_pw_k299.txt, (2) model parameters: modelparams.txt, (3) the data likelihood of the MRTLE model: scoreFile.txt. Most of the interpretation is done on the network file.

In the regulatory network file, the first column denotes the regulator, the second column denotes the target gene, and the third column denotes the regression edge weight capturing the strength of the interaction between the regulator and the target gene (see Subheading 3.1). Example lines from the regulatory network file var_mb_pw_k299.txt are given below:

YDR443C	Q0045	−0.0392211
YPL128C	Q0045	0.16192

In order to get reliable results, we typically run MRTLE algorithm in a stability selection framework. We randomly subsample the expression data for each species 25 times keeping 2/3 of the samples each time, and subsequently apply MRTLE algorithm on each subsample to infer a regulatory network. More subsamples could be beneficial if there are more measurements in the expression data. A confidence score is then calculated for each edge as the fraction of how frequently that edge is present in the inferred networks across all subsamples.

MRTLE outputs can be interpreted using several downstream analyses:

1.
Compare MRTLE inferred regulatory networks to gold standard networks, such as transcription factor knockout network and ChIP-chip-based regulatory network (Fig. 1b).
2.
Identify target genes of some TFs of interest, visualize the regulatory subnetworks in each species, and compare the dynamics of targets (Fig. 2).
3.
Compare MRTLE inferred regulatory networks across species to get patterns of conservation and divergence across a phylogenetic tree (Fig. 3a).
4.
Visualize the inferred regulatory networks in each species to identify important hubs (Fig. 3b).
5.
Identify transcriptional modules using algorithms such as Arboretum [18] and combine MRTLE inferred regulatory networks with gene modules to get context-specific networks associated with different gene modules.
6.
Assess the role of a regulator as an activating or repressive factor by computing the significance of overlap between the targets of this regulator in MRTLE inferred networks and the targets in activating module versus repressive module based on hypergeometric distribution. We demonstrate the first four analyses as the examples in this chapter.

First, we evaluate how well MRTLE inferred networks recover edges from gold standard networks, derived from large-scale experiments in S. cerevisiae as well as smaller regulator specific datasets in other species. We constructed a ChIP-chip-based regulatory network using datasets from MacIsaac et al. [19] and a knockout network obtained by deleting regulators and assessing the effects on downstream expression levels from Hu et al. [20]. We compared the MRTLE-inferred network for S. cerevisiae to these two gold standard networks, computing precision and recall for each confidence threshold from 0 to 1. The precision and recall curves shown in Fig. 1b visualize the area under the precision–recall curve (AUPR) for comparison to the MacIsaac et al. (0.0607) and Hu et al. (0.0206) networks [19, 20]. The greater the AUPR, the better the inferred networks recover the experimental validated edges. Here the AUPR is not high and is understood to be dominated by the small size of the two gold standard networks available for comparison. In addition, we compared the target genes inferred for specific regulators with ChIP-chip data in species other than S. cerevisiae. We obtained binding gene targets for a set of TFs measured with ChIP-chip experiments by Tuch et al. in S. cerevisiae, K. lactis, and C. albicans [21], and tested how well the MRTLE-inferred networks recapitulate these experimentally derived target sets. We compared how many targets are recovered in each species using AUPR, F-score, and fold enrichment metrics. Among 227 targets of MCM1 in S. cerevisiae, 99 target genes are recovered in S. cerevisiae MRTLE inferred network considering all edges (AUPR: 0.148, Fig. 2a). Among 562 targets of MCM1 in K. lactis, 183 target genes are found in K. lactis MRTLE inferred network (AUPR: 0.187, Fig. 2a). 225 target genes out of 680 targets of MCM1 in C. albicans are recovered in MRTLE inferred network (AUPR: 0.153, Fig. 2a). Among all predicted MCM1 targets, we picked top 500, 1000 MCM1 targets with highest confidence and obtained an F-score ranging from 0.130 to 0.196 compared to ChIP-chip MCM1 targets (Fig. 2a). Moreover, we selected the top 30k edges inferred in each species ranking by the confidence score and calculated the fold enrichment of the gene targets of MCM1 from ChIP-chip data in the predicted MCM1 targets (Fig. 2a). We found that the predicted targets were enriched for MCM1 targets from ChIP-chip data in all three species.

MRTLE inferred networks can be compared to assess the global pattern of conservation and divergence based on the similarity of the networks of each pair of species (Fig. 3a). We picked the top 30k edges based on confidence in each species, and computed the F-score of similarity for each pair of networks to quantify the variation in the networks across species. We found that the inferred networks exhibit patterns of conservation and divergence consistent with the phylogenetic tree. For example, S. cerevisiae is most similar to C. glabrata (F-score 0.578), its similarity to S. castellii (F-score 0.536), K. lactis (F-score 0.480), and C. albicans (F-score 0.279) decreases and is least similar to S. pombe (F-score 0.188). Similarly, the network similarity of C. glabrata with S. castellii, K. lactis, C. albicans, and S. pombe decreases along the phylogenetic tree. This indicates that the networks inferred by MRTLE obey a phylogenetic trend.

MRTLE networks can also be compared by visualizing the subnetworks of known and important regulators across species (e.g., in our study, regulators involved in stress response are useful to compare across species). As an example, we visualize the MRTLE subnetworks for targets genes of the transcription factor MCM1 in S. cerevisiae, K. lactis, and C. albicans (Fig. 2b). The top 100 target genes of MCM1 are shown and the nodes supported by the MCM1 targets from ChIP-chip experiments are named. To compare how the targets change across species, we picked top 1000 target genes from each inferred network and computed the F-score similarity. We see that target genes in S. cerevisiae and K. lactis are more conserved (F-score 0.405), compared to S. cerevisiae and C. albicans (F-score 0.362), indicating the phylogenetic trend is also recapitulated at a local factor-specific level (Fig. 2c). Additionally, we visualize the MRTLE networks of top 1k edges in S. cerevisiae and C. glabrata (Fig. 3b) and identify top important hubs such as MSN2, MSN4, STP1, KRE33, and BAS1 which are conserved in both the species. MSN2 and MSN4 have been found to be general stress response regulators and are key transcriptional regulators of yeast glycolysis by regulating the expression of glycolytic enzymes [22]. STP1 plays a key role in pre-tRNA processing in yeast, regulating the expression of genes encoding amino-acid permeases and controlling the uptake of several amino acids [23].

4 Notes

1.
Pre-processing of Gene Expression Data.

The input expression data for MRTLE can be Microarray data or RNA-sequencing (RNA-seq) data. MRTLE model assumes that the expression values are from a multivariate Gaussian distribution, so we typically perform some normalization and transformation on the expression data. For microarray data, the values are log ratio of gene expression level from an experimental sample and gene expression level in the reference sample, which can be used as input expression data without further transformation. The transcriptomic data in the six yeast species example are microarray data. For RNA-seq data, we recommend normalizing the data across all samples for each species by applying a log transformation and doing quantile normalization.
2.
Estimating the Parameters: Branch-Specific Gain and Loss Probabilities.

Setting the appropriate prior probability of gaining or losing an edge for each branch along the phylogenetic tree may need some tunings and knowledge about the phylogeny. For our example data with six yeasts species, the rate matrix Q was estimated using the average rate of gaining or losing a motif binding site as in the Habib et al. paper [8]. The branch-specific gain probability $ {p}_g=P\left({R}_{ij}^Z=1\right|{R}_{ij}^A=0\Big) $ and branch-specific loss probability $ {p}_l=P\left({R}_{ij}^Z=0\right|{R}_{ij}^A=1\Big) $ were then calculated based on this rate matrix and the branch length for each branch in the phylogenetic tree.
3.
Setting the Parameters: β₀ and β_1.

The β₀ parameter is a sparsity prior that controls the penalty of adding of a new edge to the network, which takes a negative value (β₀ < 0). We suggest trying β₀ for values between −5 and −0.1. A smaller (more negative) value will add a greater cost to edge addition and end up with a sparser network. The β₁ parameter controls how strongly motifs are incorporated as prior (β₁ ≥ 0). We suggest setting β₁ to a value between 0 and 5. A higher value will increase the weight of the motif prior network. MRTLE can be run with different configurations of β₀ and β₁, and one can choose the configuration that gives the highest data likelihood score or recapitulates the known phylogenetic trends.
4.
Running MRTLE in Parallel for Multiple Gene Subsets.

To make MRTLE run faster, we can split the target gene set into subsets, e.g., of 50 genes and run MRTLE in parallel on each set. This can be implemented by splitting the potential target orthogroups into non-overlapping sets of 50 orthogroups per set and storing the corresponding OGID numbers into AllGenes0.txt, AllGenes1.txt, etc. (Subheading 3.3.10). Next, MRTLE can be run in parallel using each of the potential target orthogroup ID file (AllGenes${i}.txt) while keeping the rest of the arguments the same. After all of the parallel runs are finished, we merge the inferred networks of each 50 target genes together as the final inferred network.

References

Kim HD, Shay T, O’Shea EK et al (2009) Transcriptional regulatory circuits: predicting numbers from alphabets. Science 325:429–432
Article CAS PubMed PubMed Central Google Scholar
Brawand D, Soumillon M, Necsulea A et al (2011) The evolution of gene expression levels in mammalian organs. Nature 478:343–348
Article CAS PubMed Google Scholar
Brawand D, Wagner CE, Li YI et al (2014) The genomic substrate for adaptive radiation in African cichlid fish. Nature 513:375–381
Article CAS PubMed PubMed Central Google Scholar
Thompson DA, Roy S, Chan M et al (2013) Evolutionary principles of modular gene regulation in yeasts. elife 2:e00603
Article PubMed PubMed Central Google Scholar
Markowetz F, Spang R (2007) Inferring cellular networks—a review. BMC Bioinformatics 8:S5
Article PubMed PubMed Central Google Scholar
Koch C, Konieczka J, Delorey T et al (2017) Inference and evolutionary analysis of genome-scale regulatory networks in large phylogenies. Cell Syst 4:543–558.e8
Article CAS PubMed PubMed Central Google Scholar
Friedman N (2004) Inferring cellular networks using probabilistic graphical models. Science 303:799–805
Article CAS PubMed Google Scholar
Habib N, Wapinski I, Margalit H et al (2012) A functional selection model explains evolutionary robustness despite plasticity in regulatory networks. Mol Syst Biol 8:619
Article PubMed PubMed Central Google Scholar
Caruana R (1997) Multitask Learning. Mach Learn 28:41–75
Article Google Scholar
Heckerman D, Chickering DM, Meek C et al (2001) Dependency networks for inference, collaborative filtering, and data visualization. J Mach Learn Res 1:49–75
Google Scholar
Hobolth A, Jensen JL (2005) Statistical inference in evolutionary models of DNA sequences via the EM algorithm. Stat Appl Genet Mol Biol 4:18
Article Google Scholar
Garber M, Guttman M, Clamp M et al (2009) Identifying novel constrained elements by exploiting biased substitution patterns. Bioinformatics 25:i54–i62
Article CAS PubMed PubMed Central Google Scholar
Wapinski I, Pfeffer A, Friedman N et al (2007) Natural history and evolutionary principles of gene duplication in fungi. Nature 449:54–61
Article CAS PubMed Google Scholar
Felsenstein J (1989) PHYLIP-phylogeny inference package (version 3.2). Cladistics 5:164–166
Google Scholar
Yang Z (2007) PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol 24:1586–1591
Article CAS PubMed Google Scholar
O’Brien KP, Remm M, Sonnhammer ELL (2005) Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res 33:D476–D480
Article PubMed Google Scholar
Wapinski I, Pfeffer A, Friedman N et al (2007) Automatic genome-wide reconstruction of phylogenetic gene trees. Bioinformatics 23:i549–i558
Article CAS PubMed Google Scholar
Roy S, Wapinski I, Pfiffner J et al (2013) Arboretum: reconstruction and analysis of the evolutionary history of condition-specific transcriptional modules. Genome Res 23:1039–1050
Article CAS PubMed PubMed Central Google Scholar
MacIsaac KD, Wang T, Gordon DB et al (2006) An improved map of conserved regulatory sites for Saccharomyces cerevisiae. BMC Bioinformatics 7:113
Article PubMed PubMed Central Google Scholar
Hu Z, Killion PJ, Iyer VR (2007) Genetic reconstruction of a functional transcriptional regulatory network. Nat Genet 39:683–687
Article CAS PubMed Google Scholar
Tuch BB, Galgoczy DJ, Hernday AD et al (2008) The evolution of combinatorial gene regulation in fungi. PLoS Biol 6:e38
Article PubMed PubMed Central Google Scholar
Kuang Z, Pinglay S, Ji H et al (2017) Msn2/4 regulate expression of glycolytic enzymes and control transition from quiescence to growth. elife 6:e29938
Article PubMed PubMed Central Google Scholar
Jorgensen MU, Gjermansen C, Andersen HA et al (1997) STP1, a gene involved in pre-tRNA processing in yeast, is important for amino-acid uptake and transcription of the permease gene BAP2. Curr Genet 31:241–247
Article CAS PubMed Google Scholar

Download references

Acknowledgments

This work is supported by the National Institutes of Health (R01-HG010045-01), James McDonnell Foundation and the National Science Foundation (DBI-1350677).

Author information

Authors and Affiliations

Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, WI, USA
Shilu Zhang, Sara Knaack & Sushmita Roy
Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA
Sushmita Roy

Authors

Shilu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Sara Knaack
View author publications
You can also search for this author in PubMed Google Scholar
Sushmita Roy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sushmita Roy .

Editor information

Editors and Affiliations

Institut de Biologie Paris Seine, Sorbonne Université, Paris, France
Frédéric Devaux

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Zhang, S., Knaack, S., Roy, S. (2022). Enabling Studies of Genome-Scale Regulatory Network Evolution in Large Phylogenies with MRTLE. In: Devaux, F. (eds) Yeast Functional Genomics. Methods in Molecular Biology, vol 2477. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-2257-5_24

Download citation

DOI: https://doi.org/10.1007/978-1-0716-2257-5_24
Published: 01 May 2022
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-2256-8
Online ISBN: 978-1-0716-2257-5
eBook Packages: Springer Protocols

Publish with us

Policies and ethics