A computational integrative approach based on alternative splicing analysis to compare immortalized and primary cancer cells

https://doi.org/10.1016/j.biocel.2017.07.010Get rights and content

Abstract

Immortalized cell lines are widely used to study the effectiveness and toxicity of anti cancer drugs as well as to assess the phenotypic characteristics of cancer cells, such as proliferation and migration ability. Unfortunately, cell lines often show extremely different properties than tumor tissues. Also the primary cells, that are deprived of the in vivo environment, might adapt to artificial conditions, and differ from the tissue they should represent. Despite these considerations, cell lines are still one of the most used cancer models due to their availability and capability to expand without limitation, but the clinical relevance of their use is still a big issue in cancer research. Many studies tried to overcome this task, comparing cell lines and tumor samples through the definition of the genomic and transcriptomic differences. To this aim, most of them used nucleotide variation or gene expression data. Here we introduce a different strategy based on alternative splicing detection and integration of DNA and RNA sequencing data, to explore the differences between immortalized and tissue-derived cells at isoforms level. Furthermore, in order to better investigate the heterogeneity of both cell populations, we took advantage of a public available dataset obtained with a new simultaneous omics single cell sequencing methodology. The proposed pipeline allowed us to identify, through a computational and prediction approach, putative mutated and alternative spliced transcripts responsible for the dissimilarity between immortalized and primary hepato carcinoma cells.

Introduction

Human-derived cancer cell lines have been for decades the elective model to study the cancer biology and to test new anti cancer therapies (Goodspeed et al., 2016). Although the rapid scientific progress, cell line-based assays still represent an important resource for pharmaceutical, chemical, medical and cosmetic industries. The lower costs, the culture methods easy to handle and the high reproducibility ensure their extensive use. Primary tumors represent a reliable but more expensive and less available resource. Furthermore, they are constituted by a highly heterogeneous cell population containing also non-cancerous cells that might affect the results of the performed experiments. However, the relevance of cell lines as tumor models strongly depends on the type of experimental approach and on how close their properties are to those of tumor tissue (Gillet et al., 2013). Thus, the investigation and the definition of this closeness is a very important issue for biologists and might lead to the development of new in vitro pre-clinical models. Many studies have been carried out with the aim to determine the differences in terms of functions between primary cells and cell lines (Ertel et al., 2006, Pan et al., 2009). The used approaches are commonly based on the comparison of gene expression (Vincent et al., 2015, Tyakht et al., 2014, Chen et al., 2015) or genomic (Domcke et al., 2013) profiles. The advent and the incessant development of high throughput technologies have provided a huge amount of omics data aimed at understanding the biological processes and functions of living organisms and the tight regulation existing among their constituent molecules. In particular, their application has led to the molecular classification of many diseases and to the identification of biomarkers and therapeutic targets involved in the mechanisms responsible of rise, progression and outcome of the pathology under study. Recent successes in research have finally defined the ”one gene, one protein, one function” dogma, postulated by Beadle and Tatum (Beadle and Tatum, 1941), as outdated. The latest GENCODE release (version 25) (Harrow et al., 2012) reports almost 80,000 transcript variants encoded by about 20,000 protein-coding genes in humans, suggesting an average of four transcripts per gene, although the number of transcripts per gene varies accordingly to the different databases, highlighting the current limitations in fully characterizing the transcriptome. More than 90% of human genes are alternatively spliced, with a role in many physiological functions. The dys-regulation of the splicing machinery has been associated with a wide variety of human diseases (Hsu and Hertel, 2009, Scotti and Swanson, 2016). The proteins translated from the alternatively spliced transcripts can have similar, different or even opposing functions. Five major alternative splicing events are distinguished: exon skipping (SE), also called cassette exon, use of alternative acceptor and/or donor sites (A5SS, A3SS), mutually exclusive exons (MXE) and intron retention (RI). How the spliceosome recognizes alternative exons and decides how to splice the mRNA still remains not fully understood. The aberrant use of alternative mRNA isoforms has been found linked to cancer formation. It is well known that several oncogenes and tumor-suppressor genes (for example, LEF1, TP63, TP73, HNF4A, RASSF1, and BCL2L1) have multiple promoters and alternative splice variants (Zhang et al., 2013, Hovanes et al., 2001, Wilhelm et al., 2010, Nekulova et al., 2011, Tomasini et al., 2008). These findings highlight the importance of focusing on the isoform level expression profiles and on the understanding of the tightly regulated splicing machinery to better define the signature of cancer cells. The huge amount of omics data, which range from DNA to RNA-sequencing and to proteomic data, allows to develop analysis pipelines based on integration approaches. One of the challenges faced by these approaches is the combination of single nucleotide polymorphisms (SNPs) and splicing events in order to identify the genetic variants affecting the splicing machinery. A large fraction of DNA variants takes place within splice site sequences at the intron-exon junction, or within enhancer and silencer sequences. As a consequence, they may alter the splicing machinery activity and its tight regulation. The dysfunctionality introduced by these nucleotide variations in pre-mRNA splicing could lead both to novel transcripts and to an abnormal ratio of alternative splicing patterns. Based on these considerations, we designed and use a bioinformatics pipeline to compare the molecular properties of immortalized and primary cell lines through an integrative approach based on the study of the alternative splicing. Furthermore, in order to take into account the high heterogeneity of primary cells, we used, as case study, a public available single cell sequencing dataset containing both hepatocellular cell line samples (HepG2) and related cancer tissue cells. DNA and RNA were sequenced contemporary through a novel technique developed by the authors called scTrio-seq (Hou et al., 2016). This work highlights the importance and the possible involvement of differentially spliced isoforms in determining the differences between immortalized and tissue-derived cell lines.

Section snippets

Data

In order to test our approach and develop an appropriate pipeline, we downloaded a publicly available dataset from Gene Expression Omnibus (GEO) portal. This dataset (GSE65364) has been selected since it was generated through a novel triple-omics sequencing protocol developed by the authors (Hou et al., 2016). It is a single cell sequencing technique, called scTrio-seq, able to analyze the genome, DNA methylome, and transcriptome simultaneously. In particular, 6 single human HepG2 cell line and

Clustering of cell populations based on transcriptomics and genomics data

On the basis of concordant pairing alignment rate (>60%) we selected 6 HepG2 and 11 HCC single cell samples from the total dataset (the list of samples under study is provided in Table S3) to perform downstream analyses.

After the normalization of transcript expression matrix by FPKM (Fragments Per Kilobase Of Exon Per Million Fragments Mapped) method, clustering analysis of these 17 samples was performed by Multi Dimensional Scaling (MDS) and Principal Components Analysis (PCA). The results

Conclusion

In conclusion, we proposed an integrative approach to investigate the closeness between immortalized and primary hepatic cancer cells based on splicing events detection. This study highlights the importance of focusing on isoform level signature to characterize cells and their behavior. By integrating the results obtained by sequencing data analysis we showed the differences, both at RNA and DNA level, between cell line and primary cells. Further computational and biological studies are needed

Acknowledgements

We would like to acknowledge the efforts of Mr. Giuseppe Trerotola for the technical support, and Dr. Gennaro Oliva for the deployment of hardware and software infrastructure used in the analysis. This work has been partially funded by MIUR PON02_00619 projects. Mario Guarracino's work has been partially conducted under ETT fellowship from Graduate School of Evolution, Institute for Evolution and Biodiversity of University of Münster.

References (40)

  • S. Domcke et al.

    Evaluating cell lines as tumour models by comparison of genomic profiles

    Nat. Commun.

    (2013)
  • D.M. Endres et al.

    A new metric for probability distributions

    IEEE Trans. Inf. Theory

    (2003)
  • A. Ertel et al.

    Pathway-specific differences between tumor cell lines and normal and tumor tissue cells

    Molec. Cancer

    (2006)
  • S. Gao et al.

    BS-SNPer: SNP calling in bisulfite-seq data

    Bioinformatics

    (2015)
  • J.-P. Gillet et al.

    The clinical relevance of cancer cell lines

    J. Natl. Cancer Inst.

    (2013)
  • L. Goff et al.

    cummerbund: Analysis, exploration, manipulation, and visualization of cufflinks high-throughput sequencing data, R package version 2

    (2013)
  • A. Goodspeed et al.

    Tumor-derived cell lines as molecular models of cancer pharmacogenomics

    Molec. Cancer Res.

    (2016)
  • I. Granata et al.

    Var2go: a web-based tool for gene variants selection

    BMC Bioinformatics

    (2016)
  • B.R. Graveley

    Sorting out the complexity of SR protein functions

    RNA

    (2000)
  • J. Harrow et al.

    Gencode: the reference human genome annotation for the encode project

    Genome Res.

    (2012)
  • Cited by (3)

    • Exploiting single-cell RNA sequencing data to link alternative splicing and cancer heterogeneity: A computational approach

      2019, International Journal of Biochemistry and Cell Biology
      Citation Excerpt :

      To the best of our knowledge, there are no studies which have considered transcriptional isoforms to compare immortalized and primary cell lines. We addressed this issue in previous work and highlighted the presence of alternative splicing events and possible causative nucleotide variants which likely determine the distance between hepatocellular carcinoma cells and HepG2 cell line (Tripathi et al., 2017). Nonetheless, in our opinion, increasing the resolution of the analysis can help identify the right model for the specific condition, rather than weaken the possibility to use such widespread models.

    1

    Equal contributors.

    View full text