NEMix: Single-cell Nested Effects Models for Probabilistic Pathway Stimulation

Nested effects models have been used successfully for learning subcellular networks from high-dimensional perturbation effects that result from RNA interference (RNAi) experiments. Here, we further develop the basic nested effects model using high-content single-cell imaging data from RNAi screens of cultured cells infected with human rhinovirus. RNAi screens with single-cell readouts are becoming increasingly common, and they often reveal high cell-to-cell variation. As a consequence of this cellular heterogeneity, knock-downs result in variable effects among cells and lead to weak average phenotypes on the cell population level. To address this confounding factor in network inference, we explicitly model the stimulation status of a signaling pathway in individual cells. We extend the framework of nested effects models to probabilistic combinatorial knock-downs and propose NEMix, a nested effects mixture model that accounts for unobserved pathway activation. We analyzed the identifiability of NEMix and developed a parameter inference scheme based on the Expectation Maximization algorithm. In an extensive simulation study, we show that NEMix improves learning of pathway structures over classical NEMs significantly in the presence of hidden pathway stimulation. We applied our model to single-cell imaging data from RNAi screens monitoring human rhinovirus infection, where limited infection efficiency of the assay results in uncertain pathway stimulation. Using a subset of genes with known interactions, we show that the inferred NEMix network has high accuracy and outperforms the classical nested effects model without hidden pathway activity. NEMix is implemented as part of the R/Bioconductor package ‘nem’ and available at www.cbg.ethz.ch/software/NEMix.


Supporting Information
Estimating the hidden signal For a given structure Φ and fixed parameters θ, we obtain the likelihood of the observed data by marginalizing over the hidden variable Z, P (d ekc | Φ, θ e = s, Z kc = j). (1) Apart from the now fixed feature connections θ, this is equivalent to the likelihood (6), where all the summations over the possible states of Z are taken out of the products over all the observations. Maximizing this equation is computationally difficult and we therefore develop an EM algorithm [3] using the complete data likelihood Let Z (j) kc be the indicator variable that Z kc is in state j for experiment k in cell c.
Then we can rewrite (2) as P (D, Z | Φ, θ, p) = As introduced in the first section, the connections of observed features to the genes (θ) are often integrated out. However, apart from being a time consuming step, the introduction of the sum over all signaling genes into the complete data likelihood (2) makes this function hard to optimize Another possibility is to learn the probability of each edge θ e = s jointly with p using an EM algorithm. In the E step, we calculate the expected complete data log likelihood w.r.t. the posterior of the hidden variable for the current parameter setting denoted with p old , θ old . We obtain where for each cell c in experiment k γ (j) kc is the responsibility of pathway (in-)activation for observation d ekc , In fact, this expected value has to be calculated only for cells that were not infected. For infected cells, Z kc = 1 is observed and the E step is not needed.
In the M step Q(p, θ; p old , θ old ) is first maximized w.r.t. p j . This leads tô where N = Since the θ e are independent and discrete, maximization of Q w.r.t. θ is achieved by iteratively selecting the signaling gene s that maximizeŝ kc log (P (d ekc | Φ, θ e = s, Z kc = j)) . (10)

Maximum likelihood estimates for p j and θ
Given the expected hidden log-likelihood in the M step, Q(p, θ; p old , θ old ) is maximized w.r.t. each p j . This involves only the left term in (11) and the constraint p 0 + p 1 = 1 needs to be obeyed Solving for p j leads to multiplying by λ and taking the sum over j on both sides results in for the Lagrange multiplier λ. Defining N = K k=1 c k c=1 1 as the total number of observations (summing the cells of all experiments), the final solution for p j then is Since the θ e are independent and discrete, to maximize Q w.r.t. θ, we need to solve argmax θ1,...,θm∈n m This equation is maximized by iteratively finding for each feature the gene that maximizes the above term.

Unidentifiable parameters
Proposition 1. For a given NEMix model, the pathway activation probability p of its hidden pathway activity Z is not identifiable, if and only if either of the following holds 1. There are no observables from E attached downstream of Z.
Z is connected only to sub-components of a network that are always perturbed.
Proof. It needs to be shown that the likelihood of a model is independent of p,if and only if one of the two cases is true. Let S be the set of genes and partition it into subsets S d which are always perturbed and the rest S u . Then, S d contains nodes downstream of all other nodes in S, and if |S d | > 1 then S d is a fully connected component. Let furthermore K be the set of perturbation experiments and E = {E u , E d } the observed features attached to the according group of genes.
⇒: Assuming the case that Z is only connected to S d for a given NEMix with signal graph Φ and feature connections θ, the likelihood function can be partitioned in the following way: In the first part of the product, the local effect probabilities for all e ∈ E u are independent of Z since there are no edges connecting Z to features in E u . In the second part of the product, we expand the local effect likelihoods by the marginal probability of the states of the node, which the feature is attached to (as introduced in equation 7 in the paper). Since S sk ∈ S d , we have P (S sk = 0) = 1 for all k ∈ K. Together with the constraint p0 + p1 = 1, it follows which is independent of p. Given the case that there are no observables downstream of Z, all e ∈ E u and the above equation reduces to only the first product. ⇐: Assuming there is a connection from Z to e i ∈ E u , we expand equation 19 using the state variables For the gene S s , connecting Z to the feature e i , there must be at least one experiment k for which P (S sk = 0) = 1, because otherwise S s ∈ S d . Since Z ∈ pa(S s ), P (d eik | θ eis , S sk ) is depending on Z. p would only be unidentifiable if P (d eik | θ eis , S sk = 1) is equal to P (d eik | θ eis , S sk = 0), which is generally not the case.
Even if p 0 is identifiable, there exist some cases where p 0 = P (Z = 0) can only poorly be estimated. When Z is connected to genes that have only few features attached, and the true p 0 is high, there is not enough evidence in the data to recover the true states of Z. Also, if most features are downstream of almost always silenced genes, states of Z are badly recovered.

Modeling the effect likelihoods
The set of local effect likelihoods used for scoring the NEMix structures, consists of a set of matrices, one per knock-down k. Each matrix has one column per per observed cell with one entry for each measured feature, denoted with D k . These local effect likelihoods can be pre-computed from the data.
To estimate the local effect likelihoods we used log odds scores [7]. As mentioned in the simulation study, no controls are available for the experimental data. For each measured image feature, we obtained a negative control distribution, F 0 , by sampling from all cells of the plate containing the knock-down well, under the assumption that the majority of cells over a plate will not show any effects. For the effect distribution, F E , of a feature, the values of all cells within a knock-down experiment were taken. The control and effect distributions were estimated with the 'kernel cumulative function estimate' from the R CRAN package 'ks' [8]. Then, for each cell c in experiment k and each feature e, we calculate the log odds scores If a knock-down has a strong effect on a feature, then R ekc will be much larger than zero. If there is no overall effect, control and effect distribution will be very similar and R ekc ≈ 0.

Structure learning
Structure inference to find an optimal network for transitively closed networks is performed using a greedy heuristic as described in [1]. In this approach, edges are incrementally added until the likelihood score cannot be increased anymore. An algorithm already implemented in the R package nem is used for NEMix inference, with slight adjustments. The procedure is initiated with a given network structure. Then, all networks with one additional edge are built and their transitive closure is computed. Finally, the structure with the highest likelihood score is selected and the procedure is repeated. The algorithm terminates when no edge insertion can increase the likelihood anymore. Edge deletion and reversal steps are not performed since these may result in structures that are not transitively closed.
In addition to the existing procedure, our approach is restricted to structures without incoming edges into the hidden root vertex Z. We furthermore initialize the algorithm with a set of initial networks. Each of these networks consists of the empty signal graph and one of all possible combinations of connecting Z to the knock-down genes. This approach is only feasible for a small number of nodes, because it results in 2 n different starting networks. For larger networks, we only initialize with Z attached to each of the genes individually. Additionally, the out-degree of Z can be limited, where out-degree refers here only to the non-transitive edges from Z to any of the other nodes.
In our examples, we limited the out-degree of Z to two edges. This regularization reduces the search space and prevents that too many dependencies between genes are explained by Z alone.

Dataset preparation and description
Data of the HRV infection example stem from a genome wide RNAi HRV infection screen (unpublished data). We extracted the data for the 8 genes used in this study. It is available in the supplementary data set. Genes are named by their Entrez ID. The genome wide RNAi screen followed the protocol in [2], which is summarized in the following. HeLa cells were infected with HRV to study their entry mechanism and to identify host cell genes involved in HRV infection. The screen was performed using a genome wide pooled siRNA library (Dharmacon Smart Pool). On 384-well plates, approximately 1000 cells per well were transfected with siRNAs and after transfection cells were infected with HRV for 7 hours. Afterwards, an assay of automated microscopy and automated image analysis was used, taking nine images per well. In the image analysis cells were identified by the cell nuclei (measured by DAPI staining of the DNA). Viruses were identified using intensity of GFP expressed by the viruses, and the Actin skeleton of a cell was stained with Cy5. Images were analyzed using the software CellProfiler [5]. The resulting data sets consisted of 360 cell features from 9 images per knock-down experiment. They are arranged in three rows and three columns. To avoid too many out-of-focus cells, we used only the middle image, as it showed to have highest quality. In this way we avoided too many out-of-focus cells. After filtering we each knock-down had values between 200 to 300. The screen lacked reliable controls and therefore for each gene, we generated a negative control distribution as a random sample of cells, distributed over the whole plate. This is done under the assumption that the majority of cells on a plate will not show an effect in a certain feature. For the NEM model knock-down populations were compared feature wise to the control distributions, using a Wilcoxon test. From the resulting p-values effect strength was estimated using a Beta-Uniform-Mixture model [4]. For the NEMix and sc-NEM model we used log likelihood ratios for the individual cells as described in the main text.
siRNA filtering for off-targets siRNAs are known to be very prone to off-target effects as they tend to mimic microRNAs and reproduce their effects [6]. To reduce this confounding source as much as possible, we ranked the genes according to their siRNAs amount of off-target effects. The ranking was conducted using the microRNA target prediction matrix from the tool 'Target Scan' [9]. For each siRNA, it provides weights for every gene, indicating how much the siRNA contributes to the knock-down of the gene. The weights are calculated as the percentage of transcript repression that the siRNA induces for a gene. Since the screens were performed with four pooled siRNAs, we averaged over the individual siRNA vectors. From the resulting vectors of weights we calculated an L 2 siRNA score for each gene as where n is the total number of genes that can be targeted and w i the transcript repression percentage for gene i. In this way, strong off-targets are penalized by receiving a high score.

NEMix implementation in NEM package
The NEMix procedure is implemented in the existing R Bioconductor package NEM. The main function to produce a NEM takes a data matrix, a set of options and an inference type. NEMix is invoked by setting the inference type type='NEMix'. Additionally, several other options can be given. First, the input dataset D must be loaded. It needs to be a matrix with the observables as rows and each experiment (i.e., each 'cell') is a column. The 5 gene HRV infection dataset is already contained in the package.
HRV_MAPK_RNAi.llr contains the single cell log odds ratios used for the NEMix inference. In the following the NEMix inference is lined out on the HRV data.

D<-HRV_MAPK_RNAi.llr
Now the experiment names and gene names are set. In the following example, column names are the names of the knock-down genes followed by the index of the cell (e.g.  To run the 'NEMix' version, the correct inference type and a name for the hidden signal needs to be set.
Since there is more than one observation per knock-down, and there could be several knock-downs per cell, we need to map knock-downs to experiments.

control$map <-as.list(knockdowns)
Additionally, the following parameters, concerning EM-restarts, a prior for p 0 , known states for Z and maximal out-degree for Z can be given.
NEMix <-nem(D=D, inference='NEM.greedy',control=control,verbose=FALSE) The resulting list contains the results for the used initial attachments of Z. To get the optimal network on needs to look at the likelihood. Hidden-root networks can be generated exhaustively up to 4 nodes using models <-enumerate.hidden.models(Sgenes) For simulation purposes, we additionally implemented a function to attach a random hidden signal to a network with a desired out-degree.