Comprehensive omics analyses profile genesets related with tumor heterogeneity of multifocal glioblastomas and reveal LIF/CCL2 as biomarkers for mesenchymal subtype

Rationale: Around 10%-20% patients with glioblastoma (GBM) are diagnosed with more than one tumor lesions or multifocal GBM (mGBM). However, the understanding on genetic, DNA methylomic, and transcriptomic characteristics of mGBM is still limited. Methods: In this study, we collected nine tumor foci from three mGBM patients followed by whole genome sequencing, whole genome bisulfite sequencing, RNA sequencing, and immunohistochemistry. The data were further examined using public GBM databases and GBM cell line. Results: Analysis on genetic data confirmed common features of GBM, including gain of chr.7 and loss of chr.10, loss of critical tumor suppressors, high frequency of PDGFA and EGFR amplification. Through profiling DNA methylome of individual tumor foci, we found that promoter methylation status of genes involved in detection of chemical stimulus, immune response, and Hippo/YAP1 pathway was significantly changed in mGBM. Although both CNV and promoter methylation alteration were involved in heterogeneity of different tumor foci from same patients, more CNV events than promoter hypomethylation events were shared by different tumor foci, implying CNV were relatively earlier than promoter methylation alteration during evolution of different tumor foci from same mGBM. Moreover, different tumor foci from same mGBM assumed different molecular subtypes and mesenchymal subtype was prevalent in mGBM, which might explain the worse prognosis of mGBM than single GBM. Interestingly, we noticed that LIF and CCL2 was tightly correlated with mesenchymal subtype tumor focus in mGBM and predicted poor survival of GBM patients. Treatment with LIF and CCL2 produced mesenchymal-like transcriptome in GBM cells. Conclusions: Together, our work herein comprehensively profiled multi-omics features of mGBM and emphasized that components of extracellular microenvironment, such as LIF and CCL2, contributed to the evolution and prognosis of tumor foci in mGBM patients.

High-Fidelity DNA polymerase, Universal PCR primers and Index (X) Primer. At last, PCR products were purified (AMPure XP system) and library quality was assessed on the Agilent Bioanalyzer 2100 system.

(2) Clustering and sequencing
The clustering of the index-coded samples was performed on a cBot Cluster Generation System using TruSeq PE Cluster Kit v3-cBot-HS (Illumia) according to the manufacturer's instructions. After cluster generation, the library preparations were sequenced on an Illumina Hiseq platform and 125 bp/150 bp paired-end reads were generated.

(3) Quality control
Raw data (raw reads) of fastq format were firstly processed through in-house perl scripts. In this step, clean data (clean reads) were obtained by removing reads containing adapter, reads containing ploy-N and low-quality reads from raw data.
At the same time, Q20, Q30 and GC content the clean data were calculated. All the downstream analyses were based on the clean data with high quality.

(4) Reads mapping to the reference genome
Reference genome and gene model annotation files were downloaded from genome website directly. Index of the reference genome was built using Bowtie v2.2.3 and paired-end clean reads were aligned to the reference genome using Hisat2 v2.0.5. We selected Hisat2 as the mapping tool for that Hisat2 can generate a database of splice sets based on the gene model annotation file and thus a better mapping result than other non-splice mapping tools.

(5) Quantification of gene expression level
HTSeq was used to count the reads numbers mapped to each gene. And then FPKM of each gene was calculated based on the length of the gene and reads count mapped to this gene. FPKM, expected number of Fragments Per Kilobase of transcript sequence per Millions base pairs sequenced, considers the effect of sequencing depth and gene length for the reads count at the same time, and is currently the most used method for estimating gene expression levels [1].

(6) Differential expression analysis
Differential expression analysis was performed using the edgeR package. edgeR is one of the most popular Bioconductor packages for assessing differential expression in RNA-seq data. It is based on the negative binomial (NB) distribution and it models the variation between biological replicates through the NB dispersion parameter. This method is immediately able to handle complex experimental designs. Genes with an adjusted P-value <0.05 found by DESeq were assigned as differentially expressed [2]. Differential expression analysis of two conditions was performed using the DEGSeq R package. The P values were adjusted using the Benjamini & Hochberg method. Corrected P-value of 0.005 and log2(Fold change) of 1 were set as the threshold for significantly differential expression.

Whole Genome Sequencing (1) DNA Quantification & Qualification
The quality of isolated genomic DNA was verified by using these two methods in combination: DNA degradation and contamination were monitored on 1% agarose gels. DNA concentration was measured by Qubit® DNA Assay Kit in Qubit® 2.0 Flurometer (Invitrogen, USA).

(2) Library Preparation
A total amount of 0.5 μg DNA per sample was used as input material for the DNA library preparations. Sequencing library was generated using Truseq Nano DNA HT Sample Prep Kit (Illumina USA) following manufacturer's recommendations and index codes were added to each sample. Briefly, genomic DNA sample was fragmented by sonication to a size of 350 bp. Then DNA fragments were endpolished, A-tailed, and ligated with the full-length adapter for Illumina sequencing, followed by further PCR amplification. After PCR products were purified (AMPure XP system), libraries were analyzed for size distribution by Agilent 2100 Bioanalyzer and quantified by real-time PCR (3nM).

(3) Clustering & Sequencing
The clustering of the index-coded samples was performed on a cBot Cluster Generation System using Hiseq X PE Cluster Kit V2.5 (Illumina) according to the manufacturer's instructions. After cluster generation, the DNA libraries were sequenced on Illumina Hiseq platform and 150 bp paired-end reads were generated.

(4) Quality Control
The original fluorescence image files obtained from Hiseq platform are transformed to short reads (Raw data) by base calling and these short reads are recorded in FASTQ format, which contains sequence information and corresponding sequencing quality information. Sequence artifacts, including reads containing adapter contamination, low-quality nucleotides and unrecognizable nucleotide (N), undoubtedly set the barrier for the subsequent reliable bioinformatics analysis. Hence quality control is an essential step and applied to guarantee the meaningful downstream analysis.
The steps of data processing were as follows:  Discard paired reads if either one read contains adapter contamination (>10 nucleotides aligned to the adapter, allowing ≤ 10% mismatches);  Discard paired reads if more than 10% of bases are uncertain in either one read;  Discard paired reads if the proportion of low quality (Phred quality <5) bases is over 50% in either one read.
All the downstream bioinformatics analyses were based on the high-quality clean data, which were retained after these steps. At the same time, QC statistics including total reads number, raw data, raw depth, sequencing error rate and percentage of reads with Q30 (the percent of bases with phred-scaled quality scores greater than 30) were calculated and summarized.

(4) Reads Mapping to Reference Sequence
Valid sequencing data was mapped to the reference human genome (UCSC hg19) by Burrows-Wheeler Aligner (BWA) software [3] to get the original mapping results stored in BAM format. If one or one paired read(s) were mapped to multiple positions, the strategy adopted by BWA was to choose the most likely placement. If two or more most likely placements presented, BWA picked one randomly. Then, SAMtools [4] and Picard (http://broadinstitute.github.io/picard/) were used to sort BAM files and do duplicate marking, local realignment, and base quality recalibration to generate final BAM file for computation of the sequence coverage and depth. Mapping step was very difficult due to mismatches, including true mutation and sequencing error, and duplicates resulted from PCR amplification. These duplicate reads were uninformative and shouldn't be considered as evidence for variants. We used Picard to mark these duplicates for follow up analysis.

(6) Functional Annotation
Functional annotation was very important because the link between genetic variations and diseases would be clarified in this step. ANNOVAR [7] was performed to do annotation for VCF (Variant Call Format) obtained in the previous effort.
dbSNP, 1000 Genome and other related existing databases were applied to characterize the detected variants. Given to the significance of exonic variants, gene transcript annotation databases, such as Consensus CDS, RefSeq, Ensembl and UCSC, were also included to determine amino acid alternation. Annotation content contained the variant position, variant type, conservative prediction, etc. These annotation results would help to locate disease causal mutant. The details of annotation were exhibited in supplemented material.

Whole genome bisulfite sequencing (1) DNA quantification and qualification
Genomic DNA degradation and contamination was monitored on agarose gels. DNA purity was checked using the NanoPhotometer® spectrophotometer (IMPLEN, CA, USA). DNA concentration was measured using Qubit® DNA Assay Kit in Qubit® 2.0 Flurometer (Life Technologies, CA, USA).

(2) Library preparation and quantification
A total amount of 100 nanogram genomic DNA spiked with 0.5 ng lambda DNA were fragmented by sonication to 200-300bp with Covaris S220. These DNA fragments were treated with bisulfite using EZ DNA Methylation-GoldTM Kit (Zymo Research), and the library kit was Accel -NGS Methyl -Seq DNA Library Kit for illumina. Library concentration was quantified by Qubit® 2.0 Flurometer (Life Technologies, CA, USA) and quantitative PCR, and the insert size was assayed on Agilent Bioanalyzer 2100 system.

(3) Quality control
The library preparations were sequenced on an Illumina Hiseq XTen or Novaseq platform and 125 bp/150 bp paired -end reads were generated. Image analysis and base calling were performed with Illumina CASAVA pipeline, and finally 125bp/150bp paired-end reads were generated. First, we use FastQC (fastqc_v0.11.5) to perform basic statistics on the quality of the raw reads. Then, those reads sequences produced by the Illumina pipleline in FASTQ format were pre-processed through Trimmomatic (Trimmomatic-0.36) software use the parameter (SLIDINGWINDOW: 4:15 ; LEADING:3, TRAILING:3 ; ILLUMINACLIP: adapter.fa: 2: 30: 10 ; MINLEN:36) The remaining reads that passed all the filtering steps was counted as clean reads and all subsequent analyses were based on this. At last, we use FastQC to perform basic statistics on the quality of the clean data reads.

(4) Reference data preparation before analysis
Before the analysis, we prepare the reference data for the species we study, which includes the reference sequence fasta file, the annotation file in gtf format, the GO annotation file, the description file and the gene region file in bed format. As for the bed files, we predict repeats through RepeatMasker, followed by getting CGI track from a genome use cpg IslandExt.

(5) Reads mapping to the reference genome
Bismark software (version 0.16.3) [10] was used to perform alignments of bisulfite-treated reads to a reference genome (-X 700 --dovetail). The reference genome was firstly transformed into bisulfite-converted version (C-to-T and G-to-A converted) and then indexed using bowtie2 [11]. Sequence reads were also transformed into fully bisulfite-converted versions (C-to-T and G-to-A converted) before they are aligned to similarly converted versions of the genome in a directional manner. Sequence reads that produce a unique best alignment from the two alignment processes (original top and bottom strand) are then compared to the normal genomic sequence and the methylation state of all cytosine positions in the read is inferred. The same reads that aligned to the same regions of genome were regarded as duplicated ones.
The sequencing depth and coverage were summarized using deduplicated reads. The results of methylation extractor (bismark_methylation_extractor, -no_overlap) were transformed into bigWig format for visualization using IGV browser.
The sodium bisulfite non-coversion rate was calculated as the percentage of cytosine sequenced at cytosine reference positions in the lambda genome.

(6) Estimating methylation level
To identify the methylation site, we modeled the sum Mc of methylated counts as a binomial (Bin) random variable with methylation rate r:

mC ~ Bin (mC + umC * r)
In order to calculate the methylation level of the sequence, we divided the sequence into multiple bins, with bin size is 10kb. The sum of methylated and unmethylated read counts in each window were calculated. Methylation level (ML) for each window or C site shows the fraction of methylated Cs, and is defined as:

ML(C) = reads(mC) / [reads(mC) + reads(C)]
Calculated ML was further corrected with the bisulfite non-conversion rate according to previous studies [12]. Given the bisulfite nonconversion rate r, the corrected ML was estimated as:

ML (corrected) = (ML -r) / (l -r) (7) Differentially methylated analysis
Differentially methylated regions (DMRs) were identified using the DSS software [13][14][15]. According to the distribution of DMRs through the genome, we defined the genes related to DMRs as genes whose gene body region (from TSS to TES) or promoter region (upstream 2kb from the TSS) have an overlap with the DMRs.

Cell culture
Human GBM cell line LN18 and LN229 was obtained from the ATCC (www.ATCC.org) and maintained at 37°C in a 5% CO2 incubator DMEM/F12 plus 10% fetal bovine serum. The cell line was characterized as mycoplasma negative and validated by STR DNA fingerprinting using the AmpFLSTR Identifiler kit (#4322288, ThermoFisher) according to the manufacturer's instructions semiannually. The STR profiles were compared with known ATCC fingerprints and matched known DNA fingerprints. Addition of LIF and CCL2 was performed according to product datasheet.

TMZ IC 50 calculation
5,000 cells were seeded in 96-well plate for overnight growth and then TMZ with series concentrations was added into the wells with the presence of LIF/CCL2 or PBS as vehicle. After 72 h growth, the cell viability was measured through CCK-8 assay and the IC 50 was calculated using GraphPad Prism 5 software.

ELISA for serum LIF and CCL2 from GBM patients
The kits were purchased from R&D systems (Quantikine ® ELISA Human LIF Immunoassay #DLF00B and Quantikine ® ELISA Human CCL2/MCP-1 Immunoassay #DCP00), and the serum from seven patients (mGBM1, mGBM2, and five uGBM patients) (Table S1) were pre-treated followed by ELISA according to the manufacturer's instructions. Standard curve was established with recombinant human LIF and recombinant human CCL2.

Immunohistochemistry
The immunohistochemistry (IHC) was performed according to our previous study [16]. The evaluation of IHC staining was performed as below. Five most characteristic high-power fields (× 400 magnification) per tissue section were manually selected using an Olympus BX51 microscope (Olympus, Tokyo, Japan). The staining results were calculated using ImageJ software based on H-Score method, i.e., signal intensity × positive cell percentage. Anti-YAP (#14074) was from