Prediction and Identification of Signature Genes Expressed in Different Brain Regions through RNA-Seq Data Analysis

.

The Brain is one of the most complex, multifunctional, sensitive organs in the human body that coordinates with billions of nerves and cell and respond to different signals like environmental, stress etc. Brain has very complex structure that is composed of Cerebrum, cerebellum and brain stem which is protected by skull 1 . To understand the gene expression in brain in its various region is very important to uncover complex functioning of brain at the molecular level. A brain tumor emerges when there is uncontrolled division of cells shaping a strange gathering of cells around or inside the cerebrum. 2 That gathering of cells can influence the typical usefulness of the cerebrum action and decimate the solid cells. Cerebrum tumors ordered to generous or second rate (grade I and II) and harmful tumors or high (grade III and IV). Benign tumors are non-progressive (non-cancerous) so considered to be less aggressive, they originated in the brain and grows gradually. Nonetheless, threatening tumors are dangerous and develop quickly with vague limits they can be started in the brain itself which is called essential malignant tumor, or it is initiated somewhere else in the body and spread to the cerebrum which is called auxiliary harmful tumor. 3 Primary brain tumors represent more than 100 diverse tumor types with broadly unique sciences and clinical results, however these neoplasms often present comparable difficulties to neuro-oncologists. Dangerous gliomas are the most well-known sort of essential characteristic cerebrum tumor in grown-ups and remain incredibly deadly. Current norm of-care treatments for these diseases incorporate a medical procedure, radiation and palliative cytotoxic, which have critical results and restricted adequacy. 4 Advances in our comprehension of the atomic underpinnings of malignancy have prompted focused on sub-atomic treatments that may allow improvement in helpful adequacy and diminished harmfulness; these treatments, nonetheless, still face numerous difficulties. 5 Signal transduction pathways that are improperly directed in cerebrum diseases incorporate development factors and their receptors (for example epidermal development factor receptor, vascular endothelial development factor receptor and platelet-determined development factor receptor), which manage cell co-operations with the microenvironment and intracellular oncogenic pathways. 6 Monoclonal antibodies may have more prominent explicitness, yet face conveyance limitations preferential tumor delivery of chemotherapies, conjugated toxins and radioisotopes has been achieved through convection-enhanced delivery, intra tumoral implants and intra-arterial infusion. 7 Regardless of these advances, not many atomically focused on treatments have shown critical antineoplastic action for an expansive scope of patients conceivably because of tumor and patient heterogeneity. It is plausible, in any case, that focused treatments will be best in blend either with each other or with cytotoxic treatments. 8 Despite ongoing advances in molecular diagnostics and current individual treatment, most patients experiencing glioma actually face restricted endurance. Wide proof exists that other than age, adjuvant radio-chemotherapy, the degree of surgical tumor resection significantly affects patients' endurance. 9 In any case, patients experiencing gliomas that attack the corpus callosum infrequently go through careful tumor resection. These tumors are viewed as more forceful because of corpus callosum attack and dissemination to both hemispheres. 10 Neurodegenerative illnesses are the maximum predominate brain disorders around the globe and the affected populations are hastily increasing. lately, these diseases had been addressed the usage of the facts acquired from RNAsequencing technology to reveal the changes in gene/transcript expression, impact of variants, and pathways concerned in sickness mechanisms. 11 The corpus callosum is the biggest inter hemispheric commissure giving an association between homologous cortical territories. Consequently, authorized coordination of cerebral handling between the two halves of the globe is ascribed to the corpus callosum, upheld by the perception of expanded corpus callosum size in people equipped for finishing complex undertakings and in people with less social laterality. Recently, left mind arranged people, being believed to be more lateralized because of the regular ipsilateral area of language. 12 High-throughput sequencing technology is swiftly turning into the exceptional method for measuring RNA expression stages also referred to as RNA-seq. The advent of fast sequencing technology together with reduced prices has enabled detailed profiling of organic phenomenon ranges, impacting almost each area in lifestyles sciences and is now being adopted for scientific use. 13 RNA-seq technology permits the specific identity of gene isoforms, translocation events, and nucleotide versions and put up-transcriptional base changes. One many of the most dreams of these experiments are to spot the differentially expressed genes in two or more situations. 14 Differential gene expression analysis of RNA-seq records usually includes 3 components: normalization of counts, parameter estimation of the statistical version and checks for differential expression. Finding genes which are differentially expressed among situations is a critical part of understanding the molecular basis of phenotypic version. From decades, DNA microarrays were used notably to quantify the abundance of mRNA corresponding to different genes but with advancement in sequencing technologies and computational algorithms to solve complex biological problem excessive-throughput sequencing of cDNA (RNA-seq) has emerged as a robust competitor 15 . Since, in Microarray technology quantification of gene expression is measured based on fluorescent signals, there is huge chance of accumulation of errors in microarray result. Comparisons between different microarray result is also difficult since variation in gene expression can arise due to biasness in fluorescent dyes, laser scanners etc. Along with these limitation, computational analysis of microarray data is also very tedious and time taking, and several samples cannot be compared.
With the emergence of technique where transcripts can directly be measured without there dependency on fluorescent probes and laser scanners, has made RNAseq technology as most promising tool to study transcriptome. RNAseq technique can be used to compare several samples and to identify differentially expressed genes without any biasness. Most important factor that makes RNAseq as more efficient tool to study gene expression than traditional approach is that RNAseq can be used to identify novel genes and even minimally expressed genes can also be identified. Since, high throughput technologies like genome seq, exome seq, ChIpSeq, RNAseq are dependent on sophisticated and sensitive software's efficient tools has been developed to analyze these results. Huge number of computational techniques have been evolved for studying differential gene expression in RNA-seq records. In current research R and Bioconductor software's, Galaxy tools along with functional databases like Gene Ontology database, KEGG pathways database, were used for extensive RNAseq analysis of samples under study. These tools and software's were used for current research that is discussed further in materials and method section.

RNA-Seq Data retrieval
The RNA-Seq data was retrieved in the fastq format from European Nucleotide Archive (ENA) database (https://www.ebi. ac.uk/ena/browser/home ). The samples were of human brain from frontal cortex (fc) and corpus callosum (cc) (Accession ID-PRJNA294929 https://www.ebi.ac.uk/ena/browser/view/ PRJNA294929?show=reads ). Five samples were selected for current study in which 4 samples were of corpus callosum (cc) and 1 sample was of frontal cortex (fc) selected samples along with their run accession has been shown in table 1. RNA-seq used in this data has been done using illumina platform. Current data has been selected based on several factors like importance of RNAseq in understanding brain function, availability of data, recent trends in research in transcriptomics studies.

Software's/ Tools and Databases
R software version 4.0.2 16 along with Bioconductor packages were used for RNA-Seq analysis. R is a programming language most widely used for statistical computing and graphical visualization of large datasets. It is user friendly programming language which can be used for high throughput data analysis of various genomics and proteomics experiment. Bio-conductor (http:// bioconductor.org/) provides various packages that are most widely used for experimental data analysis like microarray, RNA Seq, NGS analysis etc. Bioconductor is a project based on and extending R that provides tools for bioinformatics analysis, visualization, comparison, statistical analysis etc. of huge datasets. In this study various packages like Rsamtools version 3.12, Genomic Alignments version 3.12, Variant Annotations version 3.12, DESeq2 version 3.12, Annotate version 3.12, GO.db version 3.12 were used at different steps of RNA-Seq analysis. [17][18][19] Galaxy tools (https://usegalaxy.org/) were also used for RNA-Seq analysis. Various tools like FastQC, Trimmomatic, RNA Star, MULTIQC, Free Bayes, Feature counts, DESeq2, Deeptools, plotPCA, compute Matrix, plot heatmap, AnnotateMyId were used for prediction of Differentially expressed genes and functional annotation of genes 20 .
Firstly, quality control analysis was done using FastQC tool of Galaxy server that generates several plots and quality estimation values which used to verify suitability of data for current study. Secondly, preprocessing of samples was done using trimmomatic tool which is used to trim reads sequence to remove adapter contamination and improve the quality of reads files for further analysis. Reads sequence were mapped to reference genome using RNA star tool and variants were also identified using Free Bayes tools of Galax server. Lastly, Differential gene expression analysis was done using DESeq2 Bioconductor package and DESeq2 tool of Galaxy also for verification and confirmation of results that is explained in detailed in result section. Functional analysis of predicted genes was done using online databases and genes were studied using references and literatures. Different databases like Gene Ontology, David database, KEGG pathway databases were used for functional enrichment of genes.
Functional Enrichment of Differentially Expressed Genes was done using Database for Annotation, Visualization, and Integrated Discovery (DAVID) database (https://david. ncifcrf.gov/) and Gene Ontology (GO) database (http://geneontology.org/). Galaxy server tools and R software were used to find the differentially expressed genes in the brain tumor samples and results were compared for final analysis and interpretation of results. The methodology used for RNA-Seq analysis has been shown in figure 1 this methodology has been partly adopted from workflow for RNA-Seq analysis provide by Galaxy server (https://training.galaxyproject.org/trainingmaterial/topics/transcriptomics/tutorials/ref-based/ tutorial.html).
The raw RNA-seq data was downloaded and uploaded at galaxy server in fastq format.
To estimate the quality control of the sequence FASTQC tool 21 was performed. Trimming of raw files was done before any statistical and mapping analysis to increase the quality and reliability of the results. Trimming filters, the low-quality regions of the reads sequence and preserves the high-quality portions of the reads sequence after FastQC Trimmomatic tool was used for trimming.
Alignment of read sequence to reference genome (mapping) For mapping the read sequence to the reference genome (hg19) RNA Star 22 tool was used. After performing RNA Star as an output, STAR log file, splice juctions.bed file and mapped. bam file was generated. To aggregate the RNA Star results MultiQC was performed and visual analysis was done using Integrative Genomics Viewer (https://software.broadinstitute.org/software/igv/ ) and the UCSC genome browser (https://genome. ucsc.edu/) . Then VCF file (Variant Calling Format) was generated using Free Bayes 23 tool. Similarly, analysis was also done using Rsamtools package 24 which reads the bam file in R software for further analysis. Variant Annotation package of R and Bioconductor was used for annotation of variants obtained from VCF file. Further annotations of these variants were done using the vcf file generated through Free Bayes tool.
Genomic Alignments package was used for alignment of RNA-Seq data with genomic sequence. It provides efficient alignemtn file for storing and manipulating short genomic alignments (typically obtained by aligning short reads to a reference genome). The result file includes, read counting, coverage value and junction detection, between RNA-Seq and genomic sequence Feature Count of the bam file the alignment produces binary alignment map file (BAM) where there is read alignment for each file. The mapped reads (mapped.bam) were used to identify number of same reads sequence using feature Counts tool and this tool is also used for annotation by using in-built annotations for human (hg19). It produces a count, summary and a length file that describes the number of reads sequence that are aligned to same genes etc. After doing feature counts of the bam file MultiQC was performed for combining all the counts result into one file 25 .
Differentially expressed genes (DEG) analysis between corpus callosum (cc) and frontal cortex (fc): Differentially expressed genes are the genes that show observable difference in their expression levels/ index or in read counts. This comparison can be used between the set of experimental conditions and DEGs can be identified that have statistical and biological significance. DESeq2 was used for DEG prediction and statistical analysis. FeatureCounts file was used as an input for DESeq2 tool. After DESeq2 tool deeptools tool like PCA and heatmap plot were generated for visualization and statistical analysis of DEGs. Heatmap was also generated by using SummaryBam tool and plot heatmap tool of galaxy 26 .
The heatmap gives an overview of similarities and dissimilarities between samples: the color represents the distance between the samples. Also, the DEG's are expressed in the plot. PCA plot is useful for visualizing the overall effect of experimental covariates and batch effects.
DESeq2 package of R and Bioconductor was also used for the prediction of DEGs and results were compared with the Galaxy tools for verification and confirmation of list of DEGs. Functional enrichment of DEG's was done for biological intervention and pathway analysis. Annotation of the DESeq2 result was done using annotateMyId tool. Likewise, in R software annotations of the identified DEG's was done using Annotate and GO.db package. Annotated list was then searched in the DAVID and Gene Ontology database for functional analysis .

RESULTS AND DISSCUSSION
Quality control analysis the quality of RNA-Seq data plays a crucial role particularly in sequence assembly and gene expression studies. The FastQC tool provides a raw data and a HTML output file which include basic statistics, sequence count, per sequence quality scores, sequence length distribution, mean quality scores, status check, per base N content, sequence duplication level, sequence GC content, overrepresented sequence, adaptor content.
FastQC compares actual and estimated scores of all parameters and highlights problematics area, pass parameters, and area of concern. All samples' files used for this study were subjected to FastQC tools with default parameters as in Galaxy server. All the samples' files pass the quality parameters when compared to actual parameters as mentioned in FastQC tool. This result shows that all files can be used for further analysis. To compare FastQC all results were compiled and compared using MultiQC tool.
Binary Alignment Mapping Analysis (BAM) and BAI (index) file of cc and fc was read in R software through which output file was obtained. Result genomic alignment positions across all the files across human genome. Genome visualization of the binary alingment map file and variant file was done using the UCSC browser and Integrative Genomics Viewer as shown in figure 2 for all the corpous callosum samples. Figure 2 shows the variation in the samples, red lines mean alternative allele whereas blue lines mean reference allele. The light blue line in the figure 2 means homozygous variants and dark blue lines means heterozygous variants. Similarly, VCF analysis was also done for fc samples.  Variation in PHGDH gene was identified in the VCF results analysis and visualized using UCSC genome browser. This alignment shows that the variation of PHGDH gene was located on chromosome 1(chr1). 3 set of variation was shown, out of which CAAGAAGG-GAATTCTC and GATCAT-TCTCGG was synonymous variant. Variant alignment of NRCAM gene was also identified which shows that the variation of NRCAM gene located on chromosome 7(chr7) only one set of variation was shown A-T which was a missense variant. Variant alignment of NKX3-1 gene was also identified which shows variation of NKX3-1 gene located on chromosome 8(chr8) one variation was identified CTC-ATT which was a missense variant.
Variant analysis was also done using Variant Annotation package of Bioconductor. VCF text files contain meta-information lines, a header line with names of column, data lines with information about a genome position, and optional genotype information on samples for each position. Data was imported and explored by using various commands of Variant Annotation package. Header information was extracted by using header () for the genomic positions rowRanges () was performed, which gives the information of the VCF file such as chromosome, position, ID fields. Location of the variants in and around the genes was done using locate Variant () method of Variant Annotation package. Result shows that 93 reads sequences were found from hg 19 in all the regions using All Variants () command.  Bam file was used to count both gDNAseq and RNA-seq reads for genomic features in in SAM/BAM files. Aggregate results of all the count files were unmapped or unassigned. In frontal cortex sample, there is large number of unmapped genes whereas compared to all other corpus callosum sample, they have comparatively less unmapped genes.
DESeq2 tool was used for DEG analysis, count table generated from the feature Count was   Figure 3 shows the estimation of variance-mean dependence from high throughput RNA-Seq data. The blue colored dot shown in the plot represents the corpus callosum dataset whereas the red colored dot represents the dataset of frontal cortex. DESeq2 tool generates three output i) normalized count for each gene in the sample ii) A graphical summary of the results and iii) a summary file for each gene which includes gene ID, log of fold change, mean of normalized counts, the pvalue, p-value adj. Figure 4 shows the heatmap of sample-to-sample distance matrix that was generated using normalized count file. Heatmap gives the overall similarities and dissimilarities between the sample. The dark blue color represents shorter distance and dark blue color shows that maximum distance in gene expression between the samples.
Dispersion plot was also visualized to study the amount of variation between the samples as shown in figure 5. The amount of shrinkage can be analyzed depending on the size of the sample, the number of coefficients, the mean of the row and the variability of the gene-wise estimates.
The MA plot (M: log ratio and A: mean average) was shown in figure 6 MA plot represents the log of fold change v/s mean of the normalized count. MA plot analysis shows that no significant genes were regulated in the range -2 to +2. The prominent result was shown in the range -4 to -8 for down regulating genes represented by the red dots in the lower half of the plot. Whereas number of up-regulating genes was present in the range +4 to +6 represented by the red dots in the upper half of the plot. Figure 7 is a PCA plot in which the colored triangular symbols shown in figure 7(a) represents sample datasets for corpus callosum, and the black encircled red dot symbol in the extreme top represents the frontal cortex sample dataset. The presence of the sample to the extreme ends of the PCA plot suggest that the sample (a) shows upregulation of the genes whereas sample (b) present at the extreme end showing negative i.e., the down regulation of the genes. Table 2 shows that list of DEGs and with statistical value. The top ten differentially expressed genes as mentioned in table 2 Table 3 shows the up-regulating genes out of which according to the research (26227) PHGDH, (4504) MT3, TUBB4A (10382), were the gene that were related to the cancer-causing pathway.
The PHGDH gene provides instructions for making the parts (subunits) that make up the phosphoglycerate dehydrogenase enzyme. The enzyme converts a substance called 3-phosphoglycerate to 3-phosphohydroxypyruvate in the first step in serine production. Serine is necessary for the development and function of the brain and spinal cord (central nervous system) 27 .
Serine is a part of chemical messengers called neurotransmitters that transmit signals in the nervous system. Proteins that form cell membranes and the fatty layer of insulation (myelin) that surrounds many nerves also contain serine. It can Fig. 8. Heatmap generated using DESeq2 package in the R be obtained from the diet, but brain cells must produce their own serine because dietary serine cannot cross the protective barrier that allows only certain substances to pass between blood vessels and the brain (the blood-brain barrier). 28 The lack of serine likely prevents the production of proteins and neurotransmitters in the brain and impairs the formation of normal cells and myelin. These disruptions in normal brain development lead to microcephaly, severe developmental delay, and the other signs and symptoms of phosphoglycerate dehydrogenase deficiency. 29 In the mutations TUBB4A gene, which provides instructions for making a protein called beta-tubulin (â-tubulin). The TUBB4A gene is found primarily in the brain, particularly in the putamen, cerebellum, and white matter. [30] The researchers suspect that problems with microtubules impair neuronal migration or the transport of important substances within neurons, leads to dysfunction and loss of these cells in the brain, particularly in the putamen, cerebellum, and white matter. Abnormalities in these brain regions underlie the movement, speech, and learning problems that can occur in TUBB4A-related leukodystrophy. 31 The DEG's were also analyzed using R and Bio-conductor package. The package used here is the DESeq2 package. This package provides methods to test for the differential expressed using negative binomial generalized linear model. As the input, it expects the count file which is generated using the galaxy feature count tool.
For the analysis of DEG's the count file was taken as an input and DESeqDataSet was created by specifying gene count data frame. After creating DESeqDataSet, the algorithm of DESeq2 was performed and the result was extracted. The result shows the base mean value, p-value, log (FC), padj value and the geneIDs.
In this figure 8, heatmap plot enlists the top 20 set of genes that were expressed in the samples. By analyzing the plot, it was seen that from gene 4538 to 284076 (shown in figure 8) shows down regulation of genes in the cc samples, whereas the same set of genes up regulated in the fc807 sample. From the gene 406936 to 2670 variation was seen between cc804 and fc807 sample.

Functional annotation and data enrichment of differentially expressed genes
For the intervention and pathway analysis on the differentially expressed genes annotate package was used along with the GO.db package for describing the entire gene ontology. Annotate package was used to associate the experimental data with the available metadata. The GO.db is a set of annotation maps describing the entire Gene Ontology assembled using data from GO. It provides detailed information about the latest version of the Gene Ontologies.
Annotation of the genes was done using the DAVID and Gene Ontology database. The geneID were annotated in DAVID database to retrieve gene symbols. Then gene symbols were further annotated using GO Database. Pathway enrichment was done to annotate which gene was involved in tumor/ cancer causing pathways. Pathway enrichment was performed over the differentially expressed gene. Out of which most effective genes are listed in Table 4. The genes expressed were ND4, COX3, B3GNT5, NKX3-1, HSPA2, PHGDH, DAAM, NRCAM and GATM. All the gene were studied in the GO database for the pathway enrichment. The table 4 shows the pathway involved in DE Gene.
Gene ontology and pathway analysis shows that 9 genes have function in cancer and these genes are associated with cancer causing pathway. This cancer-causing pathway were metabolic pathway, pathway in cancer, MAPK signalling pathway, glycine, serine and threonine metabolism (P02776) and Wnt signaling pathway. According to various studies (geneID-26227) PHGDH 32 gene plays major role in cancer metabolism, (gene Id-4538), ND4 33 gene mutation also has been identified in a small number of people with Leigh syndrome, a progressive brain disorder that typically appears in infancy or early childhood and (geneID-3306) HSPA2 34 gene is expressed in various tumors. These results highlight important genes that can be taken into consideration to explore the biology of brain tumor. Important genes that can be further used for wet lab verification are NADH, PHGDH, DAAM2, NRCAM and GATM. These genes have function in neurodevelopment, neuronal signaling, neuron development and are associated with neurological disorders like Alzheimer's disease, Parkinson's, schizophrenia, cognitive disorder, and muscular dystrophy 35 .

CONCLUSION
The formation of abnormal groups of cells inside the brain can trigger and lead to the activation of a brain tumor. The abnormal cells instantaneously affect the processing of the brain and health of a patient. In this study analysis was done to find the differentially expressed genes using R and bioconductor packages over the count file. As a result 20 genes were found that were differentially expressing and out of 20 genes, 9 genes were found involved in cancer causing pathway. By the final analysis of the results, it was seen that genes i.e., PHGDH, and HSPA2 were found involved in the cancer-causing pathways. From the pathway enrichment done using DAVID and GO database and it was found that these genes have significant role in causing brain tumor. It has been seen that the PHGDH, TUBB4A, and MTT3 gene also involved in regulation of gene expression in the samples. Finally, by comparative analysis between both R and Bioconductor results and Galaxy results, it was found that the PHGDH, COX3, MT3 gene were differentially expressed in brain tumor. Functional enrichment of selected genes and extensive literature study highlights important genes that can be considered for further wet lab experiments. Important genes that have function in neurological related disorder are, ND4 (NADH dehydrogenase 4) gene that is encoded by mitochondrial genome and has function in electron transport chain, and it is hereditary 36 . ND4 gene is found to be associated with neurological disorders like Parkinson's, schizophrenia, muscular dystrophy, neurological diseases, and it is also used as Biomarker of Alzheimer's disease. PHGDH (phosphoglycerate dehydrogenase) gene it is involved in amino acid synthesis since this gene encodes enzymes that is used to synthesize L-serine in cells. Mutation in PHGDH is associated with Neurological disorder, microcephaly, and cognitive disorders 37 .
DAAM2 (dishevelled associated activator of morphogenesis 2) have important function in development of nervous system and in Wnt signaling pathway. DAAM2 gene is involved in neurological related disorders 38 , NRCAM (neuronal cell adhesion molecule) it is involved in neuron signaling and have function in development of neuron, cell communication and allelic variant this gene is linked with autism 39 . GATM (glycine amidinotransferase) is related to the biosynthesis of creatine and any error in functioning of this gene is related to several neuro and muscular related diseases 40 . Result shows that NADH, PHGDH, DAAM2, NRCAM and GATM have important functions in brain function. Further with wet lab verification these genes can be used as drug targets, biomarkers or can have important role in understanding brain tumor biology and can have application in terms of clinical perspectives.