Illumina next generation sequencing data and expression microarrays data from retinoblastoma and medulloblastoma tissues

Retinoblastoma (Rb) is a pediatric intraocular malignancy and probably the most robust clinical model on which genetic predisposition to develop cancer has been demonstrated. Since deletions in chromosome 13 have been described in this tumor, we performed next generation sequencing to test whether recurrent losses could be detected in low coverage data. We used Illumina platform for 13 tumor tissue samples: two pools of 4 retinoblastoma cases each and one pool of 5 medulloblastoma cases (raw data can be found at http://www.ebi.ac.uk/ena/data/view/PRJEB6630). We first created an in silico reference profile generated from a human sequenced genome (GRCh37p5). From this data we calculated an integrity score to get an overview of gains and losses in all chromosomes; we next analyzed each chromosome in windows of 40 kb length, calculating for each window the log2 ratio between reads from tumor pool and in silico reference. Finally we generated panoramic maps with all the windows whether lost or gained along each chromosome associated to its cytogenetic bands to facilitate interpretation. Expression microarrays was done for the same samples and a list of over and under expressed genes is presented here. For this detection a significance analysis was done and a log2 fold change was chosen as significant (raw data can be found at http://www.ncbi.nlm.nih.gov/geo/accession number GSE11488). The complete research article can be found at Cancer Genetics journal (Garcia-Chequer et al., in press) [1]. In summary here we provide an overview with visual graphics of gains and losses chromosome by chromosome in retinoblastoma and medulloblastoma, also the integrity score analysis and a list of genes with relevant expression associated. This material can be useful to researchers that may want to explore gains and losses in other malignant tumors with this approach or compare their data with retinoblastoma.

human sequenced genome (GRCh37p5). From this data we calculated an integrity score to get an overview of gains and losses in all chromosomes; we next analyzed each chromosome in windows of 40 kb length, calculating for each window the log 2 ratio between reads from tumor pool and in silico reference. Finally we generated panoramic maps with all the windows whether lost or gained along each chromosome associated to its cytogenetic bands to facilitate interpretation. Expression microarrays was done for the same samples and a list of over and under expressed genes is presented here. For this detection a significance analysis was done and a log 2 fold change was chosen as significant (raw data can be found at http://www.ncbi.nlm.nih.gov/geo/accession number GSE11488). The complete research article can be found at Cancer Genetics journal (Garcia-Chequer et al., in press) [1]. In summary here we provide an overview with visual graphics of gains and losses chromosome by chromosome in retinoblastoma and medulloblastoma, also the integrity score analysis and a list of genes with relevant expression associated. This material can be useful to researchers that may want to explore gains and losses in other malignant tumors with this approach or compare their data with retinoblastoma.

Value of the data
Retinoblastoma tissues naïve to treatment are very rare and scarce and thus difficult to obtain and study.
With the visual graphics presented, the data provides an overview of the retinoblastoma and medulloblastoma genomes structure easy to comprehend, analyze and compare.
This data facilitates the merge of cytogenetic information and vocabulary with next generation sequencing information and vocabulary.
The format of data presentation with the pooling experimental design facilitates navigation through complex data, allowing exploration of recurrent events in an intuitive manner.
Simplicity of these analyses can guide researchers about interesting genomic areas of consistent recurrent losses for further study in these genomes.
The data can help in the design of future analysis like cytogenetics, selecting genes to study by immunohistochemistry, gene hunting etc.
The data can help to easily compare chromosomal changes with other type of tumors.

Data
Next generation sequencing was done for retinoblastoma and medulloblastoma samples. These sequences or "reads" were mapped and compared to a human reference genome and low coverage was determined. This comparison allowed the construction of chromosomal maps with positions and association to cytogenetic bands for an overview of the data. An integrity score was used for a panoramic view of the sequencing data. Expression microarray was also done for the same samples and lists of over and under expressed genes are presented related to regions of gains and losses [1].

Experimental design
Tumor tissue samples from 8 retinoblastoma patients and from 5 medulloblastoma patients were collected. These 13 samples were deep sequenced as three pools, two for retinoblastoma and one for medulloblastoma. The in silico references were generated from a whole human genome sequence GRCh37p5 from the Genome Reference Consortium.

Patients and DNA extraction
Eight patients diagnosed with retinoblastoma (Rb) and five with medulloblastoma (Mb) were included. DNA was isolated from snap frozen tumor tissues. Two DNA pools from four Rb patients each were considered biological replicas and one DNA pool containing five Mb patients considered a biological contrast were sequenced. The original and raw data are uploaded at the European Nucleotide Archive and can be found in http://www.ebi.ac.uk/ena/data/view/PRJEB6630. All patients were newly diagnosed and tumor samples were collected at time of surgery, prior to any adjuvant therapy. Tumor tissues were collected under informed written consent from their parents and as part of studies approved by the Scientific and Ethics Review Boards from each participating institution.

Illumina sequencing
Sequencing was performed at the sequencing core facility of the National University of Mexico (UNAM) located at the Biotechnology Institute in Cuernavaca, Mexico using the Illumina Genome Analyzer IIx (Illumina). For sequencing 5 mg of DNA, 1.2 μg DNA from each retinoblastoma per pool, and 1 mg DNA from each meduloblastoma per pool were used to make approximately two hundred base-pair sized libraries using Illumina's Genomic DNA sample prep kit.

In silico genome references
A reference profile of sequencing data for each sample was obtained using an in silico technique. The whole human genome sequence GRCh37p5 was downloaded from the Genome Reference Consortium. A random sampling of this reference genome was done for the generation of reference sets of reads.

Reads mapping and symmetric relation treatment
The treatment applied to both sequencing datasets and in silico datasets was the same. After filtering out low quality and low complexity reads, both in silico and experimental sets were then mapped to the GRCh37.p5 genome sequence with Bowtie2 [2]. A Perl program was written for extracting the reads that map only once to the Human Genome. For the comparison between the in silico reads and in vitro reads already mapped, log 2 ratios values were used. log ratio ¼ log 2 Number of reads in Sample X Number of reads in Referece X The log 2 ratio gives a symmetric relation between two sets of values facilitating the interpretation of the data. For example, a 2 value will be equivalent to a À 2, that will correspond to a relation of 4:1 and 1:4 respectively or a fold change of 4.

ChromDraw
ChromDraw.exe is a computational program that draws each chromosome and all its cytogenetic bands. This graphic visualization software was written by one of the authors of the present work (Méndez-Tenorio, PhD) with the purpose of facilitating identification of losses and gains in each chromosome by analyzing the comparison between one sample and a reference using the log 2 ratio value. This software is available upon request.

Integrity score
The integrity score was calculated as a relation between the number of reads in a chromosome and the length of that chromosome, in units of reads/kilobase. The integrity score allow us an overview of losses and gains in all the chromosomes. This score is calculated as follows: Score Chri ¼ Number of reads in chromosome i Length of chromosome i ½ ¼ reads kB With this score is possible to explore linearity of sequencing and in silico data by plotting the size of each chromosome in millions of bp against the number of reads obtained for each chromosome. Because we found a linear relationship between filtered reads and chromosome size we developed the integrity score. Using this integrity score is possible to determine with low resolution if there is any chromosome with less or more reads compared to its reference. The integrity score was calculated for all the chromosomes in the six data samples, three for the reference samples and three for the pooled tumor samples. Figs. 1 and 2 show the plot for all the references and samples considering or not the Chromosome Y respectively. The r 2 value can be seen in each individual plot. Another figure shows the Integrity Score plotted as bars and a marked difference for chromosome Y is observed in all tumors and reference samples (Fig. 3).

Chromosomes analyses by windows
Each chromosome sequence was divided into fragments of the same length (40 kb) called 'windows'. Mapped reads in each window for both the in silico reference and the tumor samples were measured and using ChromDraw data is plotted as log g ratios against the cytogenetic maps (See Supplemental Maps File) ( Table 1).

Expression microarrays
Expression microarray data was generated at the microarray facility of the National University of Mexico (UNAM) located at the Instituto de Fisiología Celular in Mexico City. Total RNA was obtained using Trizol reagent by Ambion according to manufacturer directions, from short term primary cultures established from enucleated eyes and propagated for seven days in RPMI 15% FBS. In house double channel platform was used and microarrays were printed with a human 50-mer oligo library set "A" from MWG BiotechOligo Gene expression was determined as significant over or under expressed using the Significance Analysis of Microarray (SAM) algorithm from the TM4 suite of programs [3]. A log fold change value larger than 2 or lower than -2 was used for selecting the over and under expresses genes respectively and a false discovery rate of 0.05 [4]. A list with the over and under expressed genes is shown in Table 2.