of Computer Science & Systems Biology

Abstract


Abstract Background
Gene therapy is a form of molecular medicine which treats genetic diseases by replacing a defective gene, responsible for the pathology, with a functional one. The basic principle is to introduce a piece of genetic material into cells via a virus which represents the vector for gene therapy. The virus integrates with the cell DNA and thus delivers the genetic material into the cell nucleus. This process is called integration and may alter the host cell's DNA. Recent studies based on cellular and animal models (Bushman, 2005) reported empirical evidence of preference for certain retroviral vectors, i.e. those deriving from Moloney Murine Leukemia Virus (MLV), to integrate near the start of transcriptional units, whereas others (like Simian Immunodefi-ciency Virus (SIV)-and Human Immunodeficiency Virus (HIV)-based vectors) did not show the same tendency. The mutation may alter the expression of genes in the vicinity of the insertion or, when inserted within a gene, alter the gene product. When the affected gene is a cancer gene (either a proto-oncogene or a tumor suppressor gene), activation of the proto-oncogene or inactivation of the tumor-suppressor gene can cause uncontrolled proliferation (cell division) of cells. Eventually this may give rise to tumors. These cancer-causing insertions are referred to as insertional mutagenesis or oncogenic integration. A tumor could develop when an accumulation of oncogenic insertions causes uncontrolled proliferation of a cell. This has been seen both in In gene therapy the integration process of the viral DNA genome into the host cell genome is a necessary step for virus integration. Just few years ago, retrovirus integration was believed to be random and the chance of accidentally activating a gene was considered remote. It has been seen that this process is not random and different viruses may show different preferences to integrate in some specific areas of the genome. Tumorigensis associated to some studies in gene therapy is suspected to be caused by insertion process. Depending on whether the provirus integrates into or in the vicinity of genes (Transcription Start Sites , TSS), normal trascription can be enhanced or disrupted thus inducing oncogenic mutations. This is called "insertional mutagenesis". Investigating whether an area over the genome could be favoured by retrovirus integration is a crucial aspect in gene therapy. These area are called "Common Integration Sites"(CIS)or "hotspots". In the paper we stressed the importance of developing statistical procedures leading to a unique definition of CIS rather than a "problem related" definition. We here propose some statistical solutions for the search of hotspots based on the "Peaksheight distribution", which account within the null hypothesis for the possible non-random behaviour of the integrations.  animal as well as human models. The related problem of safety of a vector is a major hurdle (Montini et al., 2006). It has been observed that in retroviral integration different vectors show distinct target site preferences, thus finding a unique statistical criteria to detect accumulation of integration is a fundamental tool within the debate on safety of a vector. (Recchia et al., 2006, Cassani et al., 2006. Some approaches provided statistical and mathematical modelling to to test the hypothesis of randomness (Abel et al., 2007;Ambrosi et al., 2008). Moreover in the recent literature (Cattoglio et al., 2007) it has been proved that analysis of MLV integration patterns in natural or experimentally induced leukemias/lymphomas showed the existence of insertion sites recurrently associated with a malignant phenotype. These "common insertion sites" (CIS), also called "hotspots" which include proto-oncogenes or other genes associated with cell growth and proliferation, may present when activated a causal relationship with the establishment and/or progression of cancer. The definition of hotspot CIS is however not unique and crucially "problem related". A first model to define a hotspot CIS has been provided by Suzuki et al., (2002) and compares the mapped locations of the proviruses in the isolated tumors to randomly generated integrations from 100,000 Monte Carlo trials.This was done to determine cutoffs for defining when two or more integrations in close proximity were significant enough to assume that it didn't happen by chance. Basically the cutoffs were within 30 kb for 2 integrations, 50 kb for 3 insertions or 100 kb for 4 integrations. In terms of the null hypothesis for hotspot CIS analysis this is problematic. Definition of hotspots is based on a comparison to a random set, but there is a clear preference in integration that should be taken into account in the null hypothesis.

J Comput
Wu et al., (2005) showed that pre-established role in cancer is not sufficient support for the efficacy of the Suzuki CIS technique. For instance expression level in MLV inte gration may also play a role since MLV integrations are biased towards genes with higher expression levels. A first interesting way to account for this non-randomness in the null hypothesis is the Wu et al., model which all ows 75% of the integrations to occur randomly and 25% to integrate in a Poisson distribution T5 kb around the transcriptional start site.  This introduction was aimed at highlighting how statistics and probability must play a fundamental role in establishing a criteria to detect "accumulation" sites and preferences of integration for a better understanding of "how safe" is a vector in gene therapy. To do this we address the following question: how can we distinguish the preferences of viruses to integrate close to TSS from their "accumulation" due to some other reason (for instance to the presence of some particular gene).

Data and Experimental Design
In this paper we analyzed data derived from retroviral transduction in T cells from leukemic patients treated with allogeneic stem cell transplantation and donor lymphocytes genetically modified with a suicide gene (HSV-TK). Retroviral vectors integrate preferentially within or near transcribed regions of the genome, with a preference for sequences around promoters and for genes active in T cells at the time of transduction. For details on the whole data set see Ambrosi et al., (2008). The following information are reported: • nucleotide (integration position) • chromosomes • integration distance from the TSS of genes in a window of 100 kb • expression data for genes involved in the integration (hotspots and all) • gene density in 1Mb neighbourhood In this paper we provide some statistical proposals to investigate the real distributional "nature" of a hotspot. As mentioned above we want to address the following ques- To explore this idea we compare the integration density distribution with the TSS density distribution (which reflects the gene density) like in Figure 3 (referred to the first chromosome). Since the focus of this paper is on the statistical procedure we illustrate our analysis with a small sample of integrations on chromosome 1 only.
It can easily be observed that the two distributions show similarities. This is a natural consequence of the fact that this virus integrates preferably close to the TSS and thus more frequently in high gene density areas. We next focus on those areas that attract insertions of the retrovirus even when no high gene density is revealed.

Statistical Procedure: the Peaks-Hight (P-H) Method
A natural way to provide a statistical approach for the identification of CIS in distributional terms is a kernel estimation procedure (Ridder et al., 2006) to find the regions in the genome that show a significant increase in insertion density.
For any position over the genome, an estimate of the number of insertions is obtained by summing all the kernel functions. (rectangular , Barlett-Epanechnikov, etc.).
Actually, the basic idea is to model non parametrically the probability that an observation x will fall into a certain region, that is with F a smoothed (or aver-aged) version of the density function f(x). The kernel density estimator f b (x) for the estimation of the density value f(x) at target point x is a local average smoother that, for random variable x i in a prediction space calculate an average of the observations in a neighbourhood of the target point: where k(⋅) denotes the Kernel function and b is the bandwidth parameter which determines how large a neighbourhood of the target point is. A large bandwidth generates a smoother curve, while a small bandwidth generates a wigglier curve, thus the choice of b being fundamental and much more important than the choice of kernel (Hastie and Tibshirani, 1990). We use here a standard Gaussian kernel.
Analys is is based on discrete data points indicating the integration position. We are trying to establish how unusual is the spatial patterning of these points. By turning the discrete points into a continuous surface using Kernel estimation, the data can then be explored. In particular we focus on the maximum of the observed integration distribution estimated with by means of a Gaussian kernel. This is now our new random variable, X indicating integration peaks height (P-H).
In the same estimation context we set the null hypothesis, H 0 : "integrations occur randomly over the genome" Correction for multiple comparisons is then applied, and significant P-H values are extracted. The neighbourhood of these identifies the "P-H hotspot". In this contribution we present results on Chromosome 1 only. Results reported in Table 1 lead to identify nucleotide positions where a "real" hotspot occurs (that is based on P-H based definition of Hotspot). 7 hotspots can be identified after corrections for multiple testing. This is a first step that could deal   p-values to examine in terms of expression and properties the corresponding genomic areas.

Final Remarks
The goal of this contribution was to provide some statistical considerations on the real nature of a hotspot. Statistical criteria for the identification of regions which are favoured by integrations (CIS or hotspots) are needed. We approached this problem by considering CIS not just like an area with very close integrations but like an area with very high integration density. Thus, we provide the null hypothesis based on kernel estimation of the P-H distribution, when integration are uniformly distributed over the genome. This criteria can be extended by considering a rectangular kernel, which better resemble the finite support of the integration distribution. Moreover, based on the proposed criteria, we can com-pare the "P-H hotspot" with the transcription start site distribution to distinguish which "P-H hotspot" reflects the high gene density areas, and which can really be thought as a real "hotspot" and thus leading to genetic investigations.