Key words

1 Introduction

Transcriptional regulatory networks define interactions between regulators, such as transcription factors (TFs) and signaling proteins and target genes [1], and control context-specific gene expression patterns important in various developmental, environmental, evolutionary, and disease processes. Recently, several studies have measured genome-wide expression levels in multiple species for comparative functional genomic analyses [2,3,4], facilitating a comprehensive study of how regulatory networks vary across species in response to environmental stimulus. However, computational tools to define and compare regulatory networks across multiple species in a complex phylogeny are in their early stages. To address this need, we developed Multi-species Regulatory neTwork LEarning (MRTLE), a computational approach, based on probabilistic graphical models [5] and multi-task learning to infer regulatory networks for multiple species in a phylogeny. The multi-task learning framework treats each species network inference as a separate task and uses the phylogenetic relationships as a probabilistic prior to constrain the inferred network topologies. A multi-task learning framework enables us to share information between tasks, which is beneficial for species with limited samples. MRTLE’s probabilistic framework also handles complex orthology relations arising due to loss and gene duplications that occur in large phylogenies spanning millions of years [6].

The inputs to MRTLE are a phylogenetic tree, branch-specific probabilities for gain or loss of a regulatory edge, gene expression data for each species, and finally gene orthology relationships. One optional input is sequence-specific transcription factor binding motifs for each species, which can be incorporated as prior knowledge to improve the quality of the inferred networks. The outputs of MRTLE are the regulatory networks of each species. The regulatory network of each species is modeled as a probabilistic graphical model [7]. The phylogenetic information is incorporated as a model structure prior to capture the evolutionary dynamics of gaining or losing edges from the ancestral species to the extant species. In this chapter, we describe how to apply MRTLE and several downstream analyses to examine network evolutionary dynamics across large phylogenies in detail. We demonstrate MRTLE’s application with a case study of six ascomycete yeast species and how these networks can be validated and analyzed to identify patterns of change that can be associated to different phylogenetic points as well as species-specific traits.

2 Materials

MRLTE is developed in C++, and the source code is publicly available at https://github.com/Roy-lab/mrtle. To compile MRLTE, a C++ compiler (e.g. GCC) and the GNU scientific library (https://www.gnu.org/software/gsl/) are needed to be installed with the option setting to C++14 with GNU extensions (-std=gnu++14). To run MRTLE on expression data in three species, with 1000 genes and 100 regulators, with 30 measurements per species, a laptop or desktop with adequate memory (<1 GB) and disk space (<1 GB) is sufficient. For large datasets, high-throughput computing resources are recommended. The memory requirements increase with the number of species, number of genes and regulators. For each species, MRTLE needs the expression data, the set of target genes, regulators, an orthology mapping (OGIDs file, below), and an optional collection of regulator-gene relationships based on sequence-specific motifs.

The following section provides the steps for installing and applying MTRLE using real expression data from six ascomycete yeasts species: Saccharomyces cerevisiae (S. cerevisiae), Candida glabrata (C. glabrata), Saccharomyces castellii (S. castellii), Candida albicans (C. albicans), Kluyveromyces lactis (K. lactis), and Schizosaccharomyces (S. pombe). The expression data measure the transcriptional response of genes to four stress conditions over time: glucose depletion, heat shock, osmotic stress, and oxidative stress and were originally presented in the MRLTE paper [6]. This dataset has 30 measurements for S. cerevisiae, C. glabrata, S. castellii, K. lactis, and C. albicans and 21 measurements for the sixth species, S. pombe. Species-specific sequence motifs identified by the Cladeoscope algorithm by Habib et al. [8] were used as prior information.

3 Methods

We first introduce some theory of the MRTLE model to help better understand the outputs of MRTLE (Subheading 3.1) and then describe the main steps of using MRTLE (Subheadings 3.2–3.5). To use MRTLE, there are four main steps: downloading and installing MRTLE (Subheading 3.2), preparing input files for MRTLE (Subheading 3.3), application of MRTLE to multi-species datasets (Subheading 3.4), evaluation and validation of results (Subheading 3.5).

3.1 The Theory of MRTLE Model

MRTLE is based on a probabilistic framework for multi-task learning [9], where each task is to learn the network for each species. Each network is modeled as a dependency network [10], which is a class of probabilistic graphical models for representing predictive statistical relationships among random variables. Each gene is modeled as a random variable Xj which encodes the expression level of gene j. A conditional probability distribution P(Xj | Xi) models the strength of the interaction between a regulator i and a gene j. In MRTLE, these distributions are estimated using conditional Gaussian distributions. MRTLE incorporates two priors, one is the per-species prior that controls the sparsity of the network and favors a regulatory edge if there exists a species-specific motif instance of the regulator in the promoter region, and another one is the phylogenetic prior. The per-species prior probability is modeled using a logistic function:

$$ P\left({R}_{ij}=1\right)=\frac{1}{1+{e}^{-\left({\beta}_0+{\beta}_1\ast {m}_{ij}\right)}} $$

where Rij is a binary variable defining whether there is an edge between regulator i and target j (1 means present, 0 means absent). The β0 parameter is a sparsity prior that controls the penalty of adding of a new edge to the network, which takes a negative value (β0 < 0). A smaller value of β0 will result in a higher penalty on adding new edges. The β1 parameter controls how strongly motifs are incorporated as prior (β1 ≥ 0). A higher value of β1 will result in motif presence being valued more strongly to select an edge. β1 is set to 0 when there is no species-specific motif information available. mij represents the motif instance score if gene j is bound by a motif of regulator i in its promoter region.

The phylogenetic prior is modeled using a continuous-time Markov process, parameterized by the gain and loss probabilities for each branch on the tree defined as:

$$ P\left({R}_{ij}^Z\right|{R}_{ij}^A\Big)={e}^{-Q{t}_z} $$

\( {R}_{ij}^Z \) denotes the state of the edge between regulator i and target gene j in species Z. A is the ancestor of species Z, and \( {R}_{ij}^A \) denotes the state of the edge between regulator i and target gene j in ancestor A. tz denotes the branch length for species Z. The gain and loss probabilities for each branch on the tree depend upon the branch length (tz) and the rate matrix Q which specifies the rates at which we expect regulators to gain or lose targets per unit time. To estimate the rate matrix Q, expectation-maximization-based approaches can be applied, such as SiPhy developed by Garber et al. [11, 12].

3.2 Downloading and Installing MRTLE

Below are the Unix commands to download and install MRTLE in a standard command line terminal. There are three steps: (1) Download the source code from GitHub. (2) Enter the MRTLE code directory using cd. (3) Compile the code.

1) git clone https://github.com/Roy-lab/mrtle.git

2) cd mrtle/code/

3) make

3.3 Preparing Inputs for MRTLE

MRTLE needs several input files. Below we list these and describe its contents and how it can be created. All the files mentioned are available at https://github.com/Roy-lab/mrtle/tree/master/data.

3.3.1 Configuration File

This file specifies the location of the species-specific input data files and the location of the outputs. Each line corresponds to a species and should have the following tab-delimited species-specific entries: 1) species name, 2) filename of the expression data for each species, 3) the output directory, 4) filename of a list of regulators to be considered for a given target gene, 5) filename of a list of target genes to be used, 6) filename of prior network for each species. Here is an example of three lines in the configuration file (horizontal and vertical lines are just for clarity):

Scer

data/expression/Scer_geneexp.txt

out/Scer/

data/TFs/Scer_tf_mrna_binding.txt

data/genelists/Scer_allGenes.txt

data/motifs/Scer_motifs.txt

Cgla

data/expression/Cgla_geneexp.txt

out/Cgla/

data/TFs/Cgla_tf_mrna_binding.txt

data/genelists/Cgla_allGenes.txt

data/motifs/Cgla_motifs.txt

Scas

data/expression/Scas_geneexp.txt

out/Scas/

data/TFs/Scas_tf_mrna_binding.txt

data/genelists/Scas_allGenes.txt

data/motifs/Scas_motifs.txt

3.3.2 Expression Data File

In the file of the expression data for each species, each row represents one gene, and each column represents one condition. The first row is the header. Pre-processing of the expression data is described in Note 1. Here is an example of the expression data file for the S. cerevisiae species, Scer_geneexp.txt mentioned above.

Name

LAG

LL

DS

PS

PLAT

HS5

YAL016W

−2.57316

0.18674

0.351552

0.884193

0.461141

0.305003

YAL042W

−0.827777

−0.068987

−0.223257

−1.07268

−1.00647

0.0169562

YAL053W

0.981654

0.421282

0.0732861

−0.274487

0.344917

0.0213995

YBL030C

−0.956715

0.738636

1.99264

2.5738

2.57844

−0.077775

YKL112W

0.91353

−0.441429

−0.512388

−2.79075

−1.76492

0.264341

3.3.3 Species-Specific Regulators File

This file contains the list of the genes to be considered as potential regulators for each species. A regulator per line is specified. Example lines from the species-specific regulators file Scer_tf_mrna_binding.txt:

YKL112W YAL021C YBR130C YAL011W

3.3.4 Species-Specific Targets File

This file contains the list of the genes to be considered as potential targets for each species. Each line lists the name of the target gene. Example lines from the species-specific target genes file Scer_allGenes.txt:

YAL016W YAL042W YAL053W YBL030C

3.3.5 Motif-Based Prior Network File (Optional)

The motif-based prior network file has three tab-separated columns, listing the regulator, target gene, and motif score. Example lines from the prior network file Scer_motifs.txt:

YKL112W

YAL016W

0.868431

YKL112W

YAL042W

0.902258

YKL112W

YAL053W

0.836423

YKL112W

YBL030C

0.807526

3.3.6 Tree File

This file specifies the phylogenetic tree structure and can be obtained from public resources like the Ensembl Compara database (http://useast.ensembl.org/info/genome/compara/index.html). For the six yeasts species example, the species tree was obtained from Wapinski et al. [13]. To generate the species tree, multiple sequence alignments from orthogroups were concatenated [13], and the tree topology was learned using the Phylip package [14]. The PAML [15] package is another tool that can be used for the species-tree-inference task from gene orthology data. The tree file should have five columns: (1) the child species, (2) the relation of the child species to the parent species, whether it is a left or right child, (3) the parent species, (4) branch-specific gain probability pg, ranging from 0 to 1, (5) branch-specific loss probability pl, ranging from 0 to 1. In Note 2, we discuss how to determine branch-specific gain probability pg and branch-specific loss probability pl. Example lines from the file yeast_tree_rates.txt are as below:

Anc11

left

Anc14

0.0529786

0.882977

Spom

right

Anc14

0.0538444

0.897406

Anc9

left

Anc11

0.0465452

0.775753

Calb

right

Anc11

0.0543971

0.906619

Anc5

left

Anc9

0.022713

0.378551

Klac

right

Anc9

0.0515299

0.858832

Anc4

left

Anc5

0.0139051

0.231751

Scas

right

Anc5

0.0423569

0.705948

Scer

left

Anc4

0.0454312

0.757187

Cgla

right

Anc4

0.0471082

0.785137

3.3.7 Gene Orthology Mappings File (OGID File)

This file has the orthogroup information for genes. The gene trees can be obtained from public resources like the Ensembl Compara database (http://useast.ensembl.org/info/genome/compara/index.html), and the Inparanoid eukaryotic ortholog database (http://inparanoid.sbc.su.se/cgi-bin/index.cgi) [16]. For the six yeasts species example, the gene orthology mappings were created using the SYNERGY algorithm [17]. Each row represents an orthogroup of genes. The first column of this file is of the format OG{ID}_{DUP}. Each orthogroup is denoted by an ID, which takes a numeric value. For orthogroups with duplications, DUP denotes the ith duplication. If there are no duplications in the dataset, DUP is set to 1. The second column of this file is the gene name for each orthogroup from the respective species and separated by a comma across all species. The gene names should be in the same order as in the species list file described next (Subheading 3.3.8). If an orthogroup does not exist in one species, NONE is used as the gene name. Gene names must be unique across species. Example lines from the file OGid_members.txt are as follows:

Gene_OGID

NAME

OG16_1

YLL043W,CAGL0C03267g,Scas400.1,KLLA0E00550g,NONE,SPAC977.17

OG16_2

YLL043W,CAGL0C03267g,Scas400.1,KLLA0E00550g,NONE,SPAC977.17

OG23_1

NONE,NONE,NONE,NONE,NONE,SPAC1F8.07c

3.3.8 Species List File

This file contains the list of all the species with orthology relationships in the OGID file. The order should be the same as the species listed in the second column of the OGID file. Example lines from the file specorder_allclade.txt are as below:

Scer Cgla Scas Klac Calb Spom

3.3.9 Potential Regulator Orthogroups ID File

This file contains the list of IDs of the orthogroups to be considered as potential regulators. The ID is the same number as in the OGID file (Subheading 3.3.7). A regulator must also be present in the species-specific list of regulators given in the species-specific configuration file (Subheading 3.3.1). Example lines from the file TFs_OGs.txt are as below:

10261 1033 104 105

3.3.10 Potential Target Orthogroups ID File

This file specifies the list of IDs of the orthogroups to be considered as potential targets. This option can be used as a way to parallelize MRTLE on separate gene sets (see Note 4). The ID is same number as in the OGID file describing the orthology relationships (Subheading 3.3.7). A target must also be present in the species-specific list of targets given in the species-specific configuration file (Subheading 3.3.1). Example lines from the file AllGenes.txt are as below:

100 1000 1001 10013

3.4 Applying MRTLE to Multi-Species Data

MRTLE takes the following inputs: (1) expression data for each species, (2) a phylogenetic tree relating the species, (3) branch-specific probabilities for gain or loss of a regulatory edge, (4) gene orthology relationships specified in an OrthoGroup ID (OGID) file, (5) list of regulators in each species, (6) optional species-specific regulatory information, for example, from motif instances in gene promoters (Fig. 1a). Example inputs are available from the website https://github.com/Roy-lab/mrtle/tree/master/data. These are specified through various MRTLE arguments and file contents described in Subheading 3.3.

Fig. 1
figure 1

(a) Overview of the MRTLE framework. MRTLE takes the following inputs: (1) expression data for each species, (2) a phylogenetic tree relating the species, (3) branch-specific probabilities for gain or loss of a regulatory edge, (4) gene orthology relationships, and, (5) optional motif-based prior network for each species. The outputs of MRTLE are regulatory networks for each species. (b) The precision and recall curves of comparing the MRTLE-inferred network for S. cerevisiae to two gold standard networks from ChIP-chip [19] and regulators knockout [20] assays

The arguments of MRTLE are: (1) a configuration file, (2) maximum number of regulators to be considered for a given target gene, (3) the number of cross-validation, (4) the gene orthology relationship file, (5) a list of the orthogroups to be considered as potential regulators, (6) a list of orthogroups to be considered as target genes, (7) the phylogenetic tree, (8) a list of the species, (9) the sparsity parameter, and (10) the parameter that controls how strongly the motif networks are incorporated as prior. An example usage of running MRTLE using the example data is as below:

./mrtle -f data/speciesconf.txt -x300 -v1 -m data/OGid_members.txt -l data/TFs/TFs_OGs.txt -n data/genelists/AllGenes.txt -d data/yeast_tree_rates.txt -s data/specorder_allclade.txt -b -0.9 -q 4.0

Below we describe each of these arguments to the MRTLE program in detail.

  1. 1.

    -f takes the configuration file in which each row represents information of one species. See Subheading 3.3.1.

  2. 2.

    -x specifies the maximum number of regulators to be considered for a given target gene. This can be a large number, e.g., 200, to enable prediction of as many regulators as possible.

  3. 3.

    -v specifies the number of cross-validation, which can be set to 1 as well. If it is set to 1, the model is trained on the entire data.

  4. 4.

    -m specifies the name of the OGID file, describing the gene orthology relationships. See Subheading 3.3.7 for the description of how to make this file.

  5. 5.

    -l specifies the list of IDs of the orthogroups to be considered as potential regulators. The ID number is the ID in the OGID file (specified by -m). See Subheading 3.3.9 for details.

  6. 6.

    -n specifies the list of IDs of the orthogroups to be considered as potential targets. This option can be used as a way to run MRTLE in parallel on separate gene sets (see Note 4). The ID number is the ID in the file describing the orthology relationships (specified by -m). See Subheading 3.3.10 for details.

  7. 7.

    -d takes the file of the phylogenetic tree. This file provides the phylogenetic relationship of the species, and branch-specific gain and loss probabilities. See Subheading 3.3.6 for how to prepare the tree file.

  8. 8.

    -s specifies the list of all the species present in the file describing the orthology relationships (specified by -m). See Subheading 3.3.8 for how to prepare the species list file.

  9. 9.

    -b specifies the β0 parameter which controls the sparsity of the network, a penalty for adding new edges. In Note 3, we discuss how to choose the β0 parameter.

  10. 10.

    -q specifies the β1 parameter which controls how strongly motif instances are incorporated as prior. A higher value will result in motif network being valued more strongly. See Note 3 about how to choose the β1 parameter.

3.5 MRTLE Outputs and their Interpretation

Assuming MRTLE is run according to the usage provided in Subheading 3.4, the MRTLE outputs will be available in the out/ directory where the command is executed. These results include (1) inferred regulatory network for each species within its own species-specific folder: var_mb_pw_k299.txt, (2) model parameters: modelparams.txt, (3) the data likelihood of the MRTLE model: scoreFile.txt. Most of the interpretation is done on the network file.

In the regulatory network file, the first column denotes the regulator, the second column denotes the target gene, and the third column denotes the regression edge weight capturing the strength of the interaction between the regulator and the target gene (see Subheading 3.1). Example lines from the regulatory network file var_mb_pw_k299.txt are given below:

YDR443C

Q0045

−0.0392211

YPL128C

Q0045

0.16192

In order to get reliable results, we typically run MRTLE algorithm in a stability selection framework. We randomly subsample the expression data for each species 25 times keeping 2/3 of the samples each time, and subsequently apply MRTLE algorithm on each subsample to infer a regulatory network. More subsamples could be beneficial if there are more measurements in the expression data. A confidence score is then calculated for each edge as the fraction of how frequently that edge is present in the inferred networks across all subsamples.

MRTLE outputs can be interpreted using several downstream analyses:

  1. 1.

    Compare MRTLE inferred regulatory networks to gold standard networks, such as transcription factor knockout network and ChIP-chip-based regulatory network (Fig. 1b).

  2. 2.

    Identify target genes of some TFs of interest, visualize the regulatory subnetworks in each species, and compare the dynamics of targets (Fig. 2).

  3. 3.

    Compare MRTLE inferred regulatory networks across species to get patterns of conservation and divergence across a phylogenetic tree (Fig. 3a).

  4. 4.

    Visualize the inferred regulatory networks in each species to identify important hubs (Fig. 3b).

  5. 5.

    Identify transcriptional modules using algorithms such as Arboretum [18] and combine MRTLE inferred regulatory networks with gene modules to get context-specific networks associated with different gene modules.

  6. 6.

    Assess the role of a regulator as an activating or repressive factor by computing the significance of overlap between the targets of this regulator in MRTLE inferred networks and the targets in activating module versus repressive module based on hypergeometric distribution. We demonstrate the first four analyses as the examples in this chapter.

Fig. 2
figure 2

(a) The Area Under the Precision Recall Curve (AUPR) comparing the MRTLE-inferred targets of MCM1 to the ChIP-chip MCM1 targets (left). The F-score comparing the top 500, 1000 predicted MCM1 targets to the ChIP-chip MCM1 targets (middle). The fold enrichment of MCM1 ChIP-chip gene targets in the MRTLE-inferred MCM1 targets among the top 30k edges (right). (b) The predicted targets of MCM1 in S. cerevisiae, K. lactis, and C. albicans. The top 100 target genes of MCM1 are shown and nodes that are also ChIP-chip hits are labeled with common names. If a common name does not exist, it is replaced with the gene name of S. cerevisiae. For genes with more than one orthologs, the one with the maximum confidence score among all orthologs is shown. (c) The patterns of how MRTLE-inferred MCM1 targets change across species. Shown are the F-score similarities between the top predicted MCM1 targets for every two species

Fig. 3
figure 3

(a) The patterns of conservation and divergence of MRTLE inferred regulatory networks across a phylogenetic tree. Shown are the F-score similarities for all pairs of species’ networks. (b) The MRTLE inferred networks for S. cerevisiae and C. glabrata. Shown are the top 1k edges in each species. The size of the node is proportional to the number of targets. Gene names of top 10 hubs are shown in the network

First, we evaluate how well MRTLE inferred networks recover edges from gold standard networks, derived from large-scale experiments in S. cerevisiae as well as smaller regulator specific datasets in other species. We constructed a ChIP-chip-based regulatory network using datasets from MacIsaac et al. [19] and a knockout network obtained by deleting regulators and assessing the effects on downstream expression levels from Hu et al. [20]. We compared the MRTLE-inferred network for S. cerevisiae to these two gold standard networks, computing precision and recall for each confidence threshold from 0 to 1. The precision and recall curves shown in Fig. 1b visualize the area under the precision–recall curve (AUPR) for comparison to the MacIsaac et al. (0.0607) and Hu et al. (0.0206) networks [19, 20]. The greater the AUPR, the better the inferred networks recover the experimental validated edges. Here the AUPR is not high and is understood to be dominated by the small size of the two gold standard networks available for comparison. In addition, we compared the target genes inferred for specific regulators with ChIP-chip data in species other than S. cerevisiae. We obtained binding gene targets for a set of TFs measured with ChIP-chip experiments by Tuch et al. in S. cerevisiae, K. lactis, and C. albicans [21], and tested how well the MRTLE-inferred networks recapitulate these experimentally derived target sets. We compared how many targets are recovered in each species using AUPR, F-score, and fold enrichment metrics. Among 227 targets of MCM1 in S. cerevisiae, 99 target genes are recovered in S. cerevisiae MRTLE inferred network considering all edges (AUPR: 0.148, Fig. 2a). Among 562 targets of MCM1 in K. lactis, 183 target genes are found in K. lactis MRTLE inferred network (AUPR: 0.187, Fig. 2a). 225 target genes out of 680 targets of MCM1 in C. albicans are recovered in MRTLE inferred network (AUPR: 0.153, Fig. 2a). Among all predicted MCM1 targets, we picked top 500, 1000 MCM1 targets with highest confidence and obtained an F-score ranging from 0.130 to 0.196 compared to ChIP-chip MCM1 targets (Fig. 2a). Moreover, we selected the top 30k edges inferred in each species ranking by the confidence score and calculated the fold enrichment of the gene targets of MCM1 from ChIP-chip data in the predicted MCM1 targets (Fig. 2a). We found that the predicted targets were enriched for MCM1 targets from ChIP-chip data in all three species.

MRTLE inferred networks can be compared to assess the global pattern of conservation and divergence based on the similarity of the networks of each pair of species (Fig. 3a). We picked the top 30k edges based on confidence in each species, and computed the F-score of similarity for each pair of networks to quantify the variation in the networks across species. We found that the inferred networks exhibit patterns of conservation and divergence consistent with the phylogenetic tree. For example, S. cerevisiae is most similar to C. glabrata (F-score 0.578), its similarity to S. castellii (F-score 0.536), K. lactis (F-score 0.480), and C. albicans (F-score 0.279) decreases and is least similar to S. pombe (F-score 0.188). Similarly, the network similarity of C. glabrata with S. castellii, K. lactis, C. albicans, and S. pombe decreases along the phylogenetic tree. This indicates that the networks inferred by MRTLE obey a phylogenetic trend.

MRTLE networks can also be compared by visualizing the subnetworks of known and important regulators across species (e.g., in our study, regulators involved in stress response are useful to compare across species). As an example, we visualize the MRTLE subnetworks for targets genes of the transcription factor MCM1 in S. cerevisiae, K. lactis, and C. albicans (Fig. 2b). The top 100 target genes of MCM1 are shown and the nodes supported by the MCM1 targets from ChIP-chip experiments are named. To compare how the targets change across species, we picked top 1000 target genes from each inferred network and computed the F-score similarity. We see that target genes in S. cerevisiae and K. lactis are more conserved (F-score 0.405), compared to S. cerevisiae and C. albicans (F-score 0.362), indicating the phylogenetic trend is also recapitulated at a local factor-specific level (Fig. 2c). Additionally, we visualize the MRTLE networks of top 1k edges in S. cerevisiae and C. glabrata (Fig. 3b) and identify top important hubs such as MSN2, MSN4, STP1, KRE33, and BAS1 which are conserved in both the species. MSN2 and MSN4 have been found to be general stress response regulators and are key transcriptional regulators of yeast glycolysis by regulating the expression of glycolytic enzymes [22]. STP1 plays a key role in pre-tRNA processing in yeast, regulating the expression of genes encoding amino-acid permeases and controlling the uptake of several amino acids [23].

4 Notes

  1. 1.

    Pre-processing of Gene Expression Data.

    The input expression data for MRTLE can be Microarray data or RNA-sequencing (RNA-seq) data. MRTLE model assumes that the expression values are from a multivariate Gaussian distribution, so we typically perform some normalization and transformation on the expression data. For microarray data, the values are log ratio of gene expression level from an experimental sample and gene expression level in the reference sample, which can be used as input expression data without further transformation. The transcriptomic data in the six yeast species example are microarray data. For RNA-seq data, we recommend normalizing the data across all samples for each species by applying a log transformation and doing quantile normalization.

  2. 2.

    Estimating the Parameters: Branch-Specific Gain and Loss Probabilities.

    Setting the appropriate prior probability of gaining or losing an edge for each branch along the phylogenetic tree may need some tunings and knowledge about the phylogeny. For our example data with six yeasts species, the rate matrix Q was estimated using the average rate of gaining or losing a motif binding site as in the Habib et al. paper [8]. The branch-specific gain probability \( {p}_g=P\left({R}_{ij}^Z=1\right|{R}_{ij}^A=0\Big) \) and branch-specific loss probability \( {p}_l=P\left({R}_{ij}^Z=0\right|{R}_{ij}^A=1\Big) \) were then calculated based on this rate matrix and the branch length for each branch in the phylogenetic tree.

  3. 3.

    Setting the Parameters: β0 and β1.

    The β0 parameter is a sparsity prior that controls the penalty of adding of a new edge to the network, which takes a negative value (β0 < 0). We suggest trying β0 for values between −5 and −0.1. A smaller (more negative) value will add a greater cost to edge addition and end up with a sparser network. The β1 parameter controls how strongly motifs are incorporated as prior (β1 ≥ 0). We suggest setting β1 to a value between 0 and 5. A higher value will increase the weight of the motif prior network. MRTLE can be run with different configurations of β0 and β1, and one can choose the configuration that gives the highest data likelihood score or recapitulates the known phylogenetic trends.

  4. 4.

    Running MRTLE in Parallel for Multiple Gene Subsets.

    To make MRTLE run faster, we can split the target gene set into subsets, e.g., of 50 genes and run MRTLE in parallel on each set. This can be implemented by splitting the potential target orthogroups into non-overlapping sets of 50 orthogroups per set and storing the corresponding OGID numbers into AllGenes0.txt, AllGenes1.txt, etc. (Subheading 3.3.10). Next, MRTLE can be run in parallel using each of the potential target orthogroup ID file (AllGenes${i}.txt) while keeping the rest of the arguments the same. After all of the parallel runs are finished, we merge the inferred networks of each 50 target genes together as the final inferred network.