Iterative data multiplexing ( IDM ) supports elucidation of drug targets from functional genomics screening approaches

The value of conducting high throughput functional genomic screening campaigns has been the subject of some debate over recent years, where a lack of reproducibility and preclusion of the identification of genuine genes of interest under a blanket of non-specific off-target effects have undermined confidence in the technology. High hopes for RNAi-based screening technologies to illuminate a suite of drug targets within different biological scenarios have arguably fallen short of our somewhat unrealistic expectations. But in the age of ‘Big Data’, where many of us are battling with large and complex datasets, it transpires that where RNAi screening may struggle to work independently, it thrives as a member of a larger team.

technology is superseded by the realisation that it fails to deliver on our (most likely) unrealistic expectations, and we are forced to contemplate why this may be. The resulting unravelling of the origins of the arising issues is, arguably, the most important phase of technology uptake, as it provides real insight into the underlying mechanisms and an advancement of biological knowledge.
Much effort has subsequently been invested (largely by companies manufacturing large scale RNAi libraries) in improving our understanding of the mechanisms underlying RNAi and OTE [1][2][3][4], and the development and incorporation of novel features into the design of reagents for knockdown has had a substantial impact on reducing OTEs. The act of transfecting cells in itself can be considered to be damaging and intrusive, and the resulting cellular response to such an assault can be aggressive, resulting in toxicity. The innate immune component of the response can identify nucleic acids, and recognise sequences through pattern recognition receptors, including Toll-like receptors (TLR), specifically TLR3 [5,6] and TLR7-9 [7][8][9]. Those looking to develop RNAi as a therapeutic tool have embraced this dual function with the aim of exploiting its immunity effects. However, when used as an experimental tool, these effects are to be avoided. Chemical modifying the siRNA has proved a somewhat successful strategy in this regard. Modifying the guide strand of the siRNA duplex shifts the dependency for target recognition away from the interaction between the siRNA seed sequence and the target mRNA complex, towards the target-specific region of the oligonucleotide, thereby increasing the 'on target' effect. This approach can also be further extended to the passenger strand [10,11]. The concept of pooling siRNAs is most commonly associated with providing an enhanced knockdown of the target gene by incorporating three or four siRNAs along the length of the corresponding mRNA. However, this approach is also beneficial in reducing OTEs through the simple observation that, due to correlation between the concentration of siRNA used and the likelihood of generating an OTE, a total concentration comprised of lower concentrations of four individual siRNAs rather than a higher concentration of a single siRNA is likely to reduce OTEs. Therefore using pools of siRNAs may not only enhance the efficacy of the target knockdown, but simultaneously reduce putative OTEs.
Understanding the contribution of each component of the pool to the observed phenotype is an important part of evaluating genes of interest for putative OTEs, and manufacturers recommend including this step. Indeed, large scale siRNA libraries are now available in arrayed formats such that three or four individual siRNAs can be tested in a screening assay, thus providing data on multiple individual siRNAs with the same target gene. This approach works extremely well for small, bespoke libraries (e.g. <100 genes). From a large scale screening perspective, this may have been considered too much investment for return in past with regard to screening consumables, data storage and data handling. But the introduction of liquid handling technologies such as acoustic dispensing and higher density microtitre plates, means that throughput becomes less of a limitation, and screening in this format is more accessible. The contribution of each siRNA to the phenotype is typically scored, with those generating the phenotype of interest scoring conducted to different criteria. Statistical analyses used in this context include the H score [12] ((number of active siRNAs/total number of siRNAs)*100), Redundant siRNA Analysis [13] (RSA, which assesses the statistical robustness of replicate wells based on rank order of effect), or collective Strictly Standardised Mean Difference [14] (cSSMD). cSSMD is an adaptation of SSMD [15], which assesses the statistical robustness of replicate wells, but importantly, it does this in the absence of a null hypothesis. We employ H score and cSSMD for collective assessment of OTEs and replicate robustness. This is noteworthy because statistics such as RSA (and indeed the commonly implemented Z-score) assume an overall null effect hypothesis, which in many HTS screens may not hold true i.e. in targeted screens, in validation/deconvolution screens. Other computational-based approaches involve evaluating seed sequences for enrichment within all siRNAs screened: again, while useful for large scale i.e. whole genome campaigns, it less suitable for targeted approaches. Similarly, collating a database of identified OTEs has also proved ineffective, as OTEs are often dependent on the biology being evaluated. As such, while modifications to the reagents used have made progress in reducing offtarget effects, and tools have been developed to identify sequence enrichments, it is unlikely that this issue will ever be entirely resolved. We must, therefore, reconcile ourselves to fully validate our genes of interest experimentally by testing and overlaying data derived from multiple platforms, reagents and targeted mutants [16]. With our expanding working knowledge of the caveats of the system, we are now in an increasingly strengthened position to employ RNAi successfully.

Target identification from large datasets using iterative data multiplexing (IDM).
These days, when we screen, we screen big. Our libraries encompass millions of small molecules, developed drugs, natural products and functional genomic approaches, and we often conduct in vitro screening campaigns using several cell types, under multiple conditions and with combinations of agents, which multiplies the number of outputs. Our screening assays have become increasingly data-rich too. The embracing of High Content Imaging (HCI) and Analysis (HCA), albeit slow to encompass the full extent of the capability [17] has resulted in massively detailed phenotypic characterisation of our cells of choice, where a suite of parameters can be quantified from a simple single stain, and moreover multiple markers and indicators can be used to provide detailed insight into the underlying the mechanisms in question. With the implementation of low attachment plates and spheroid technologies, we are no longer restricted to a somewhat artificial 2D platform, thus further escalating the data volume. We are also not constrained within the limits of our own datasets. There is now a wealth of publically accessible data available, pertaining to genomic characterisation and its correlation with patient prognosis: this is currently most evident when engaging in cancer research. The age of "Big Data" has really infiltrated the screening community in an unprecedented and somewhat unexpected way. So how do we identify the most apposite targets from such a volume of data?
When conducting any type of high throughput screen there is always a primary output, and in target-discovery based screening approaches for cancer, this often constitutes a measure of viability, for example, cell number. The inclination with any large data set is to create a list based on rank order, and set an arbitrary or statistically defined threshold based on the overall dataset. While there may be some genes which are extremely effective in reducing cell number, they may not present the best targets for further study from both a mechanistic or putative therapeutic perspective, as they frequently encompass components essential to the viability of all cells. There are a number of structured approaches which can be iteratively applied to such datasets to define a shortlist of genes, and these are summarised in Figure 1. Three complementary approaches demonstrate utility in deriving a preliminary target 'hit' list: i) filtering based on selectivity for the disease model of interest, for example, by cross referencing effects of knocking down candidate genes in other cell lines / disease types studied. This can be achieved through exploration of the literature, or through database mining either using internal databases if screens are conducted within a dedicated facility or unit, or through mining external databases, such as GenomeRNAi [18] ii) filtering based on internal conditions included within the screen. RNAi approaches are (arguably) at their best when one asks a specific question of them. For example, instead of asking 'knockdown of which genes kill cell line X', we may ask 'knockdown of which genes kill cell line X under low oxygen conditions in the presence of drug Y?'. This could include, for example, screening in the presence or absence of an agent of interest (such as a known drug), screening in a drug sensitive cell line compared with a drug insensitive/resistant cell line, or screening under different environmental conditions (normal oxygen compared with low oxygen, with radiation treatment compared without radiation exposure).
iii) implementing pathway analysis software such as Metacore (Thomson Reuters) or Ingenuity Pathway Analysis (IPA) to identify effective genes which may indicate a known pathway is affected.
This approach can be extremely useful for identifying genes associated with a specific pathway which provides information as to whether a cell line may be dependent on a core hub or an extended network, rather than discrete genes, and may work well using a subset of genes defined from i) and ii). validating that the drug may be acting via the predicted mechanism of action. Reciprocally, a drug which is known to inhibit a gene candidate of interest may prove useful as a tool agent for exploring the underlying mechanism, whilst holding potential for development as a therapeutic candidate.
In addition to the aforementioned, inclusion of additional information on the cell line(s) themselves is a powerful approach to guiding target selection. Conducting genomic characterisation of cell lines is becoming increasingly more commonplace, and arms us with a much broader understanding of the genetic backdrop within which we are screening. This provides great insight into the relevance of identified genes and pathways of interest, even more so when any screening conditions are replicated in the genomic analysis i.e. in the presence/absence of a drug or environmental condition.
Here, we can apply our hypothesis across two independent data platforms and interrogate datasets for genes which are relevant in both contexts, for example, genes which support viability and which are upregulated under low oxygen conditions may present as promising targets. This approach is, of course, not limited to genomic characterisation, and the use of global analytical techniques to profile essential cellular processes are also hugely informative [19]. The iterative use of pathway analysis also finds utility here, whereby it can support identification of affected pathways in this wider context.
Further to identifying putative targets/pathways which may be relevant in the cell lines of study, scientists also now have access to a wealth of data to support determination of the clinical relevance. Cross-referencing candidates with literary and microarray databases [20] remain valuable and informative approaches, but in the present day, targets identified using an IDM approach can subsequently be cross referenced with large scale genomic characterisation studies, such as those published by The Cancer Genome Atlas (TCGA), the International Cancer Genome Consortium (ICGC) and the Catalogue of Somatic mutations in Cancer (COSMIC), and excellent data visualisation tools, such as cBioPortal and Oncomine. These projects have supported comprehensive genomic characterisation of thousands of patient samples across multiple cancer types, with subsequent analysis of the frequency and clinical impact of the observed alterations. As such, they provide an invaluable resource to support prioritisation of a refined list of putative candidates. Equally, these datasets can themselves be a starting point from which global bioinformatics analysis can be conducted, with functional genomic approaches providing the biological validation [21,22].
Collectively, use of IDM in the context described supports translation of multi-platform global datasets towards clinically relevant therapeutics for patient benefit.

RNAi screening approaches to target identification in predictive pre-clinical disease models.
Use of cell lines in the laboratory environment has supported fundamental discoveries of pathways and mechanisms. However, when the aim is to identify innovative therapeutics, tractability from in vitro efficacy to clinical efficacy has a disappointing success rate. Furthermore, the majority of clinical failures occur towards the latter Phase trials with between 5% (2004, [23]) and 13.4 % (2013, [24]) of cancer drugs being approved to registration. A refocusing on the investment in preclinical validation stages [25] and improving predictivity of preclinical and pharmacological models has attempted to address this shortfall, and in this scenario, it may prove more fruitful to use models using tumour mass taken directly from the patient and transplanting them into immunocompromised mice followed by serial passage (patient derived xenografts (PDXs)), and/or generate cell lines to support in vitro experimentation (Patient derived cell lines (PDCLs)). Both examples have advantages within their remit. PDXs are gaining credence as a superior model for predicting pharmacological effects in patients [26,27], and are attractive because they can recapitulate essential components of both the tumour, whereby they encompassing heterogeneity and intra-tumoural hierarchy, and interactions with the stromal microenvironment. However, they cannot exclude the effects of interactions between the murine host and the human sample, and offer little information regarding the immune response to certain agents. There is also debate as to the clinical relevance of doses which can be administered to mice. The use of Genetically Engineered Mouse Models (GEMMs), whereby use of transgenics results in the spontaneous development of tumours, goes some way to addressing some of the issues arising in PDXs, namely around the role of the immune system in testing therapeutic candidates. However, GEMMs are reliant on tumours arising due to the alteration in one or two key driver oncogenes, and it is not yet fully understood how representative of the heterogeneity and pathology observed in patient tumours these are [28].
As such, combining these two preclinical approaches may provide a complementary platform to support therapeutic validation [29].
Cell lines derived from patient tumours (PDCLs, directly from patient samples or from PDX models), and from GEMMs offer an extension to these preclinical models which can support the integration of in vitro functional genomics-based target discovery platforms into this pipeline, for example, in identifying synthetically lethal candidates, and in improving the efficacy of existing therapeutics, as lack of sensitivity, whether innate or acquired, is a fundamental failing of cancer treatments.
Traditionally, in vitro RNAi screening approaches constitute the initiating point for such endeavours, but increasingly they are finding their place alongside mouse models. Here, use of RNAi techniques can support target discovery within a more clinically tractable model, underscored by genomic and proteomic characterisation: collectively these approaches are useful for identifying targets in, for example, drug sensitive vs drug resistant tumours, and moreover for the potential identification of tumour biomarkers. Moreover shRNA approaches have incorporated genetic knockout screening into in vivo models, both through transplantation of transfected cell lines into mice [30] and pooled shRNA libraries directly into murine tumours [31], bridging a long-recognised gap between in vitro and in vivo techniques. Use of genetic knockdown/out tools can also constitute an iterative process, whereby target identification approaches can be used if a patient develops resistance or presents with disease recurrence. In an age of cheaper and more sensitive genomic sequencing [32], initiatives are underway to embrace these approaches across large scale patient numbers, with the overarching aim of identifying signatures which will identify the most apposite treatment to give that patient as a first line therapy [33][34][35]. Collective use of these preclinical tools will provide a much broader comprehensive understanding of the cancer genome through genomic characterisation, while sequencing of hundreds and thousands of different tumours across multiple cancer types builds our global understanding of tumour biology and adaptations made when challenged with therapeutic agents which may underlie drug resistance.

Conclusion.
Translating functional genomics screening campaigns towards clinically relevant therapeutics has proved extremely challenging. The discovery of off-target effects, and the extent to which they can preclude the identification of genuine targets of interest, has led to considerable effort to appreciate the underlying biology of this phenomena, while refining reagents to reduce the prevalence of such effects. While considerable progress has been made in this regard, it is unlikely that the occurrence of OTEs will be fully resolved, and stratified deconvolution of putative genes of interest will always remain an essential component of hit validation.
Identifying the most apposite targets in the context of the experimental system presents with further challenges, both through the volume of data generated from a large scale functional genomics screen, coupled with the multiple parameters which are increasingly implemented to quantify resulting cellular phenotypes. Cross-referencing with other datasets and platforms can provide a useful tool to refine lists based on selectivity for the biology or experimental conditions being studied, and this can be approached using data generated in-house, or by cross-referencing to databases collating reported phenotypic effects. The application of pathway and network mapping software is particularly useful on these preliminary 'hit' lists, where enrichment for genes within a known pathway may be evident, indicating it as a biologically-relevant avenue for further study.
Genomic characterisation of the cell lines being evaluated provides an excellent background within which to assess the relevance of genes within a refined list, especially when genomic characterisation is carried out in the presence of the experimental conditions included within the screen itself, for example oxygen level or exposure to radiation. Moreover, large efforts to conduct broad scale genomic analyses across thousands of patients with multiple cancer types has facilitated a mechanism by which identified gene candidates can be assessed for putative clinical relevance. This is a truly outstanding resource, and provides much needed insight as to which targets to progress from hit discovery into hit validation, based on those which will benefit a majority, or even a stratified group of cancer patients.
Collectively, the strategies described provide an IDM toolkit to refine lists of effective genes from functional genomics screening datasets and other large scale datasets, and provide essential information as to which genes may be the most pertinent to prioritise for further study. As such, RNAi screening approaches find their place as a player on a carefully strategized team, which provides collaborative insight into cancer biology and cancer target discovery.

Acknowledgements.
The  Multiple sequential and cyclical steps support identification of the most promising preclinical candidates from functional genomic screens. Primary screening data is filtered based on statistically defined rank order, selectivity analysis, pathway mapping and network analysis, and cross referenced with genomic characterisation of cell lines (derived from PDCLs, PDXs, or GEMMs) and additional global analytical techniques. The preclinical relevance of a refined list of putative candidates can then be evaluated against large scale, multi-patient, multi-cancer studies to ensure the progression of the most pertinent targets for subsequent validation studies.