Selection of Optimal Cell Lines for High-Content Phenotypic Screening

High-content microscopy offers a scalable approach to screen against multiple targets in a single pass. Prior work has focused on methods to select “optimal” cellular readouts in microscopy screens. However, methods to select optimal cell line models have garnered much less attention. Here, we provide a roadmap for how to select the cell line or lines that are best suited to identify bioactive compounds and their mechanism of action (MOA). We test our approach on compounds targeting cancer-relevant pathways, ranking cell lines in two tasks: detecting compound activity (“phenoactivity”) and grouping compounds with similar MOA by similar phenotype (“phenosimilarity”). Evaluating six cell lines across 3214 well-annotated compounds, we show that optimal cell line selection depends on both the task of interest (e.g., detecting phenoactivity vs inferring phenosimilarity) and distribution of MOAs within the compound library. Given a task of interest and a set of compounds, we provide a systematic framework for choosing optimal cell line(s). Our framework can be used to reduce the number of cell lines required to identify hits within a compound library and help accelerate the pace of early drug discovery.


Microscopy
Plates were imaged in confocal mode on the Operetta CLS high-content imaging system (Perkin Elmer) using a 20X water immersion lens (NA1.0), effective resolution (0.66um). 9 fields of view were captured per well using a 4.7MP 16-bit sCMOS sensor (6.5um pixel size).

Feature extraction
Harmony (version 4.9, Perkin Elmer) software was used to segment individual cells and extract features. First, images were flatfield-corrected and background subtracted. Next, individual nuclei were segmented using the Hoechst channel, and cytoplasm using the Alexa 568 channel (WGA, Phalloidin). Cells touching image borders were filtered out. We next calculated groups of features encompassing morphology, intensity and texture for each channel, totaling 77 features. For a full list of features, see Supporting Information Table 2.

Phenotypic profiles
Phenotypic profiles transform the measured features of single cells within a treated well into a population-level measure of the deviation from the negative controls contained in the same plate, using a non-parametric signed Kolmogorov-Smirnov (KS) statistic. 2

Plate position normalization
Preliminary data analysis showed positional effects for a subset of plates in our dataset.
That is, certain rows/columns of the plate showed systematically higher/lower feature levels among DMSO controls. These effects were also observable in PCA projections of DMSO controls (Supporting Information Fig. S3). To address these effects, we regressed phenotypic profiles against row and column IDs within each plate and subtracted out predicted values, effectively removing the portion of a phenotypic profile that could be explained by plate position (Supporting Information Fig. S1-S3). Following this regression, each feature was S3 centered at the median DMSO value within each plate.

Correlation feature-weighting
Our distance-based analyses assessed the similarity of samples collectively across a range of phenotypic features. To address bias introduced by highly correlated features-i.e. highly correlated features capture similar information, which will be weighted more heavily by euclidean distance as more features pick up this information-we re-weighted l2 normalized features based on the sum of their absolute correlations. That is, we re-weighted features within each cell line as where C j,j ′ denotes the Pearson correlation between features j, j ′ and X ·j denotes feature vector j for all samples from a select cell line after plate position normalization.

Replicates and multiple dose levels
In the case where the experimental data contained multiple doses of a compound, only the highest dose was taken. For each compound, we averaged position-normalized, correlationweighted KS scores across replicates at the highest dose used (when multiple doses or replicates were present in the experimental data)

Quantitative definitions of phenoactivity and phenosimilarity
For simplicity, the following notations are defined relative to a single cell line unless explicitly stated otherwise. Our full dataset can be viewed as K = 6 (cell lines) instantiations of the data described below, with the same compound library screened across each cell line. All distances described throughout this section are evaluated between phenotypic profiles as described above.

S4
Let X ∈ R n×p denote phenotypic profiling data for a given cell line, with rows X i ∈ R p being the phenotypic profile for sample i. Samples i = 1, ..., n represent different compounds from a library of interest (i.e., a single replicate of a compound perturbation, which is the same as a single well on an imaging plate). Each sample has a unique compound Our goal is to select the optimal cell line(s) S * ⊂ 1, . . . , K relative to an analytical criteria of interest. Toward this end, we consider the distance between compounds in phenotypic space, with d ij := d(X i , X j ) denoting the euclidean distance between samples i, j.
The approach described below can be used substituting another distance metric of interest with euclidean distance. Our criteria compare the distribution of distances from a query population of interest (e.g. compounds with the same MOA) to a reference population (e.g. DMSO controls). Comparing different query and reference populations allows us to address different questions. In the following sections, we show how optimal cell line selection for both phenoactivity and phenosimilarity can be framed within this context.

Phenoactivity
LetX I C denote the median (i.e. centroid; computed feature-wise) phenotypic profile for DMSO samples. For a population of samples, indexed by I ⊂ {1, . . . , n}, consider their distances to the DMSO centroid and denote the corresponding empirical cumulative distribution function (ECDF) of these distances as FĪ. We define the phenoactivity for a MOA m by comparing query: I = I m and S5 reference: I = I C populations as where h is a function measuring the deviation ECDFs. In this study, we set h to be a signed variant of the earth mover distance that measures the difference between ECDFs where P A m (k) is the phenoactivity score for MOA m in cell line k.
We cast cell line selection as an optimization problem where w m ∈ [0, 1] denotes a user-provided weight associated with MOA m. In other words, our optimization criteria is simply a weighted average of phenoactivity scores across different MOAs. Weights allow us to prioritize compound classes of interest. For instance, setting weights equal will select a "generalist" cell line that performs well across all MOAs. In contrast, setting weights higher for a particular MOAs will select cell lines that are specialists within those classes.   Figure S9: Phenotypic profiles (a) and representative images (b) for proteasome inhibitors. Features ordered based on average (across compounds and cell lines) signed KS value in proteasome inhibitors category. Images of vehicle (DMSO 0.1%) and compound treated cells (A540 or FB) after 48 hours exposure. Scale bar represents 100um. Indicators i-iv highlight features referenced in the text.