OpenPBTA: The Open Pediatric Brain Tumor Atlas

Summary Pediatric brain and spinal cancers are collectively the leading disease-related cause of death in children; thus, we urgently need curative therapeutic strategies for these tumors. To accelerate such discoveries, the Children’s Brain Tumor Network (CBTN) and Pacific Pediatric Neuro-Oncology Consortium (PNOC) created a systematic process for tumor biobanking, model generation, and sequencing with immediate access to harmonized data. We leverage these data to establish OpenPBTA, an open collaborative project with over 40 scalable analysis modules that genomically characterize 1,074 pediatric brain tumors. Transcriptomic classification reveals universal TP53 dysregulation in mismatch repair-deficient hypermutant high-grade gliomas and TP53 loss as a significant marker for poor overall survival in ependymomas and H3 K28-mutant diffuse midline gliomas. Already being actively applied to other pediatric cancers and PNOC molecular tumor board decision-making, OpenPBTA is an invaluable resource to the pediatric oncology community.


In brief
The OpenPBTA is a global, collaborative open-science initiative that brought together researchers and clinicians to genomically characterize 1,074 pediatric brain tumors and 22 patient-derived cell lines. Shapiro et al. create over 40 opensource, scalable modules to perform cancer genomics analyses and provide a richly annotated somatic dataset across 58 brain tumor histologies. The OpenPBTA framework can be used as a model for large-scale data integration to inform basic research, therapeutic target identification, and clinical translation.
INTRODUCTION Pediatric brain and spinal cord tumors are collectively the second most common malignancy in children after leukemia, repre-senting the leading disease-related cause of death in children. 1 Five-year survival rates vary widely across different histologic and molecular classifications of brain tumors. For example, most high-grade gliomas carry a universally fatal prognosis, while children with pilocytic astrocytoma have an estimated 10-year survival rate of 92%. 2 Recent estimates suggest that children and adolescents aged 0-19 with brain tumors in the United States lose an average 47,631 years of life. 3 The low survival rates for some pediatric tumors are multifactorial, explained partly by our lack of comprehensive understanding of ever-evolving brain tumor molecular subtypes, by difficulty drugging these tumors, and by the shortage of drugs specifically labeled for pediatric malignancies. Historically, fatal inoperable brain tumors, such as diffuse intrinsic pontine gliomas (DIPGs), were not routinely biopsied due to perceived biopsy risks and the paucity of therapeutic options. Thus, combined with rare incidences of pediatric tumors in the first place, limited availability of tissue for developing patient-derived cell lines and mouse models has hindered research.
To address these barriers, multiple national and international consortia have collaborated to uniformly collect clinically annotated surgical biosamples and associated germline materials through both observational and interventional clinical trials. The Pediatric Brain Tumor Atlas (PBTA) initiative, established in 2018 by the Children's Brain Tumor Network (CBTN, https:// cbtn.org) 4 and the Pacific Pediatric Neuro-Oncology Consortium (PNOC, https://pnoc.us), built upon 12 years of enrollment, sample collection, and clinical followup across over 30 institutions. Just as cooperation accelerates specimens and data sharing, collaboration among computational researchers, bench scientists, clinicians, and pathologists is critical for rigorous genomic analysis.
Although there has been significant progress elucidating genomic bases of pediatric brain tumor formation and progression, translating therapeutic agents to phase II or III clinical trials and subsequent FDA approvals have not kept pace. Within the last 20 years, the FDA has approved only seven targeted agents for treating pediatric brain tumors. 5 This is partly due to pharmaceutical company priorities, posing challenges for researchers to obtain therapeutic agents for pediatric clinical trials. Critically, since August 2020, an amendment to the Pediatric Research Equity Act called the ''Research to Accelerate Cures and Equity (RACE) for Children Act'' mandates that all new adult oncology drugs also be tested in children when the molecular target exists in a childhood cancer. The RACE Act, coupled with genomics advances to identify putative molecular targets in pediatric cancers, will accelerate identification of previously overlooked but effective therapeutic options for pediatric diseases.
We anticipated that a model of open collaboration would enhance the PBTA's value and provide a framework for ongoing analysis of pediatric brain tumor datasets. Leveraging diverse scientific and analytical expertise, we established the OpenPBTA, which employs an open science model with features such as analytical code review 6,7 and continuous integration, 7,8 thereby ensuring reproducibility throughout the project's lifetime. Through the OpenPBTA, we present a comprehensive, collaborative, open genomic analysis of 1,074 tumors and 22 cell lines, comprised of 58 distinct brain tumor histologies from 943 patients. The data and containerized infrastructure of the OpenPBTA have already supported discovery and translational research studies, [9][10][11][12] are actively integrated into PNOC molecular tumor board decision-making, and have provided a foundational layer for the Childhood Cancer Data Initiative's (CCDI) recently established pediatric Molecular Targets Platform (https://moleculartargets.ccdi.cancer.gov/). We anticipate that the OpenPBTA will continue to be invaluable to the pediatric oncology community.

Crowd-sourced somatic analyses to create an open PBTA
We previously performed whole-genome sequencing (WGS), whole-exome sequencing (WXS), and RNA sequencing (RNAseq) on matched tumor/normal tissues and selected cell lines 13 from 943 patients from the PBTA, consisting of 911 patients from the CBTN 4 and 32 patients from PNOC 10,14 ( Figure 1A) across various histologies phrases of therapy ( Figure 1B). We harnessed and extended the benchmarking efforts of the Gabriella Miller Kids First Data Resource Center to develop robust and reproducible data analysis workflows within the CAVATICA platform for comprehensive somatic analyses ( Figure S1; STAR Methods) of the PBTA.
A key innovative feature of the OpenPBTA is its open contribution framework used for analytical code and manuscript writing. We created a public GitHub analysis repository (https:// github.com/AlexsLemonade/OpenPBTA-analysis) to hold all analysis code downstream of Kids First workflows and a GitHub manuscript repository (https://github.com/AlexsLemonade/Open PBTA-manuscript) with Manubot 15 integration to enable realtime manuscript creation. As all analyses and manuscript writing were conducted in public repositories, any researcher in the world could contribute to the OpenPBTA following the process outlined in Figure 1C. First, a potential contributor proposed an analysis by filing an issue in the GitHub analysis repository. Next, project organizers or other contributors with expertise provided feedback about the proposed analysis ( Figure 1C). The contributor formally requested to include their analytical code and results-written in their own copy (fork) of repository-in the OpenPBTA analysis repository by filing a GitHub pull request (PR). All PRs underwent peer review to ensure scientific accuracy, maintainability, and readability of code and documentation ( Figures 1C and 1D).
Beyond peer review, we implemented additional checks to ensure consistent results for all collaborators over time (Figure 1D). To provide a consistent software development environment, we created a monolithic image with all OpenPBTA dependencies using Docker 16 and the Rocker project. 17 We used the continuous integration (CI) service CircleCI to run analytical code in PRs on a test dataset before formal code review, allowing us to detect code bugs or sensitivity to data release changes.
We followed a similar process in our Manubot-powered 15 repository for proposed manuscript additions ( Figure 1C); peer reviewers ensured clarity and scientific accuracy, and Manubot performed spellchecking.

Molecular subtyping of OpenPBTA CNS tumors
Since 2000, neuro-oncology experts and the WHO have collaborated to iteratively redefine central nervous system (CNS) tumor classifications. 18,19 In 2016, 20 molecular subtypes driven by genetic alterations were integrated into these classifications. Since CBTN specimen collection began in 2011, most tumors lacked molecular subtype information when tissue was collected. Moreover, the PBTA does not yet feature methylation arrays, which are increasingly used to inform molecular subtyping and cancer diagnosis. Therefore, we created analysis modules to systematically consider key genomic features of tumors described by the WHO in 2016 or Ryall and colleagues. 21 Coupled with clinician and pathologist review, we generated high-confidence, research-grade integrated diagnoses for 60% (644/1,074) of tumors (Table S1) without methylation data, a major innovation of this project. We then aligned OpenPBTA specimen diagnoses with whom classifications (e.g., tumors formerly ascribed primitive neuro-ectodermal tumor [PNET] diagnoses) discovered rarer tumor entities (e.g., H3-mutant ependymoma, meningioma with YAP1::FAM118B fusion), as well as identified and corrected data entry errors (e.g., an embryonal tumor with multilayer rosettes (ETMRs) incorrectly entered as a medulloblastoma) and histologically mis-identified specimens (e.g., Ewing sarcoma sample labeled as a craniopharyngioma). Uniquely, we used transcriptomic classification to subtype 122 medulloblastomas into SHH, WNT, group 3, or group 4 with MedulloClassifier 22 and MM2S, 23 with 95% (41/43) and 91% (39/43) accuracy, respectively.
We assessed the contributions of eight adult CNS-specific mutational signatures from the RefSig database 40 across tumors ( Figures 3E and S4A). Signature 1, which reflects normal spontaneous deamination of 5-methylcytosine, predominated in stage 0 and/or 1 tumors characterized by low TMBs ( Figure S2H) such as pilocytic astrocytomas, gangliogliomas, other LGGs, and craniopharyngiomas ( Figure S4A). Signature 1 exposures were generally higher in tumors sampled at diagnosis (pretreatment) compared with tumors from later phases of therapy (Figure S4B). This trend may have emerged from therapy-induced mutations that produced additional signatures (e.g., temozolomide treatment has been suggested to drive signature 11 41 ), subclonal expansion, and/or acquisition of additional driver mutations during tumor progression, leading to detection of additional signatures. We observed the CNS-specific signature N6 in nearly all tumors. Signature 18 drivers (TP53, APC, NOTCH1; found at https://signal.mutationalsignatures.com/explore/ referenceCancerSignature/31/drivers) are also canonical medulloblastoma drivers, and indeed, signature 18 had the highest signature weight in medulloblastomas. Finally, signatures 3, 8, 18, and MMR2 were prevalent in HGGs, including DMGs. selection (n = 58). Since batch correction was not feasible (see limitations of the study and Figure S7A), the following transcriptomic analyses considered only stranded samples.

Prediction of TP53 oncogenicity and telomerase activity
We applied a TCGA-trained classifier 42 to calculate a TP53 score, a proxy for TP53 gene or pathway dysregulation, and subsequently infer tumor TP53 inactivation status. We identified ''true positive'' TP53 alterations from high-confidence SNVs, CNVs, SVs, and fusions in TP53, annotating tumors as ''activated'' if they harbored one of the p.R273C or p.R248W gainof-function mutations 43 or ''lost'' if (1) the patient had a Li-Fraumeni syndrome (LFS) predisposition diagnosis, (2) the tumor harbored a known hotspot mutation, or (3) the tumor contained two hits (e.g., both SNVs and CNVs), suggesting that both alleles were affected. If the TP53 mutation did not reside within the DNA-binding domain or no alterations in TP53 were detected, we annotated the tumor as ''other,'' indicating an unknown TP53 alteration status. The classifier achieved a high accuracy (area under the receiver operating characteristic curve [AUROC] = 0.86) for rRNA-depleted, stranded tumors, but it did not perform as well on the poly-A tumors in this cohort (AUROC = 0.62; Figure S5A). We observed that ''activated'' and ''lost'' tumors had similar TP53 scores ( Figure 4B; Wilcoxon p = 0.92), contrasting our expectation that ''lost'' tumors would have higher TP53 scores. This difference suggests that classifier scores >0.5 may actually represent an oncogenic, or altered, TP53 phenotype rather than solely TP53 inactivation, as interpreted previously. 42 However, ''activated'' tumors showed higher TP53 expression compared with those with TP53 ''loss'' mutations (Wilcoxon p = 0.006; Fig-ure 4C). DMGs, medulloblastomas, HGGs, DNETs, ependymomas, and craniopharyngiomas, all known to harbor TP53 mutations, had the highest median TP53 scores ( Figure 4D). By contrast, gangliogliomas, LGGs, meningiomas, and schwannomas had the lowest median scores.
We hypothesized that tumors (n = 10) from patients with LFS (n = 8) would have higher TP53 scores, which we indeed observed for 8/10 tumors (Table S3). Although two tumors had low TP53 scores (BS_DEHJF4C7 at 0.09 and BS_ZD5HN296 at 0.28), pathology reports confirmed that both patients were diagnosed with LFS and harbored a TP53 pathogenic germline variant. These two LFS tumors also had low tumor purity (16% and 37%, respectively), suggesting that accurate classification may require a certain level of tumor content. We suggest that this classifier could be generally applied to infer TP53 function in the absence of a predicted oncogenic TP53 alteration or DNA sequencing.
We used gene expression data to predict telomerase activity using expression-based telomerase enzymatic activity detection (EXTEND) 44 as a surrogate measure of malignant potential, 44,45 where higher EXTEND scores indicate higher telomerase activity. Aggressive tumors such as DMGs, other HGGs, and medulloblastoma (MB) had high EXTEND scores ( Figure 4D), and lowgrade lesions such as schwannomas, gangliogliomas (GNGs), DNETs, and other LGGs had among the lowest scores (Table S3), supporting previous reports that aggressive tumor phenotypes have higher telomerase activity. 46-49 While EXTEND scores were not significantly higher in tumors with TERT promoter (TERTp) mutations (n = 6; Wilcoxon p value = 0.1196), scores were significantly correlated with TERC (R = 0.619, LGG, NF1-somatic 2 2 LGG, NF1-somatic, FGFR 1 1 LGG, NF1-somatic, NF1-germline, CDKN2A/B 1 1 LGG, other MAPK 11 12 LGG, RTK 8 10 LGG, RTK, CDKN2A/B 1 1 LGG, wild type 33 34  Figure 4E). Five tumors were HGGs and one was a brain metastasis of an MYCN non-amplified neuroblastoma tumor. Signature 11, which is associated with exposure to temozo-lomide plus MGMT promoter and/or mismatch repair deficiency, 51 was indeed present in tumors with previous exposure to the drug (  (B) Co-occurrence and mutual exclusivity of mutated genes. The co-occurrence score is defined as IðÀlog 10 ðPÞÞ where P is Fisher's exact test and I is 1 when mutations co-occur more often than expected or À1 when exclusivity is more common.   Table S2). This mutational process plasticity highlights the importance of careful genomic characterization and model selection for preclinical studies. Signature 18, which has been associated with high genomic instability and can induce a hypermutator phenotype, 40 was uniformly represented among hypermutant solid tumors. Additionally, all hypermutant HGG tumors or cell lines had dysfunctional TP53 (Table 2), consistent with previous findings that tumors with high genomic instability signatures require TP53 dysregulation. 40 With one exception, hypermutant and ultra-hypermutant tumors had high TP53 scores (>0.5) and telomerase activity. Interestingly, none of the hypermutant tumors showed evidence of signature 3 (present in homologous recombination-deficient tumors), 8 (arises from double nucleotide substitutions/unknown etiology), or N6 (a universal CNS tumor signature). The mutual exclusivity of signatures 3 and MMR2 corroborates previous suggestions that tumors do not generally feature both deficient homologous repair and MMR. 42 Next, we asked whether transcriptomic classification of TP53 dysregulation and/or telomerase activity recapitulate these oncogenic biomarkers' known prognostic influence. We identified several expected trends, including a significant overall survival benefit following full tumor resection (hazard ratio [HR] = 0.35, 95% CI = 0.2-0.62, p < 0.001) or if the tumor was an LGG (HR = 0.046, 95% CI = 0.0062-0.34, p = 0.003), and a significant risk if the tumor was an HGG (HR = 6.2, 95% CI = 4.0-9.5, p < 0.001) ( Figure 4F; STAR Methods). High telomerase scores were associated with poor prognosis across brain tumor histologies (HR = 20, 95% CI = 6.4-62, p < 0.001), demonstrating that EXTEND scores calculated from RNA-seq are an effective rapid surrogate measure for telomerase activity. Higher TP53 scores were associated with significant survival risks (Table S4) within DMGs (HR = 6436, 95% CI = 2.67-1.55e7, p = 0.03) and ependymomas (HR = 2003, 95% CI = 9.9-4.05e5, p = 0.005). Given this result, we next assessed whether different HGG molecular subtypes carry different survival risks if stratified by TP53 status. We found that DMG H3 K28 tumors with TP53 loss had significantly worse prognosis (HR = 2.8, CI = 1.4-5.6, p = 0.003) than those with WT TP53 (Figures 4G and 4H), recapitulating results from two recent restrospective analyses of DIPG tumors. 10,53 Histologic and oncogenic pathway clustering Uniform manifold approximation and projection (UMAP) visualization of gene expression variation across brain tumors (Figure 5A) showed expected histological clustering of brain tumors. We further observed that, except for three outliers, C11orf95:: RELA (ZFTA::RELA) fusion-positive ependymomas fell within distinct clusters ( Figure S6A). MB tumors clustered by molecular subtype, with WNT and SHH in distinct clusters and groups 3 and 4 showing some expected overlap ( Figure S6B). Notably, two MB tumors annotated as SHH did not cluster with the other MB tumors, and one clustered with group 3/4 tumors, suggesting potential subtype misclassification or different underlying biology of these two tumors. BRAF-driven LGGs ( Figure S6C) fell into three separate clusters, suggesting additional shared biology within each cluster. Histone H3 G35-mutant HGGs generally clustered together and away from K28-mutant tumors ( Figure S6D). Interestingly, although H3 K28-mutant and H3 WT tumors have different biological drivers, 54 they did not form distinct clusters. This pattern suggests that these subtypes may be driven by common transcriptional programs or have other, much stronger biological drivers than their known distinct epigenetic drivers or that we lack power to detect transcriptional differences.
We performed GSVA for Hallmark cancer gene sets ( Figure 5B) and quantified immune cell fractions using quanTIseq ( Figures 5C and S6E), results from which recapitulated previously described tumor biology. For example, HGG, DMG, MB, and ATRT tumors are known to upregulate MYC, 55 which in turn activates E2F and S phase. 56 Indeed, we detected significant (Bonferroni-corrected p < 0.05) upregulation of MYC and E2F targets, as well as G2M (cell cycle phase following S phase) in MBs, ATRTs, and HGGs compared with several other cancer groups. In contrast, LGGs showed significant downregulation (Bonferroni-corrected p < 0.05, multiple cancer group comparisons) of these pathways. Schwannomas and neurofibromas, which have an inflammatory immune microenvironment of T and B lymphocytes and tumor-associated macrophages (TAMs), are driven by upregulation of cytokines such as interferon g (IFNg), interleukin-1 (IL-1), IL-6, and tumor necrosis factor a (TNF-a). 57 GSVA revealed significant upregulation of these cytokines in hallmark pathways (Bonferroni-corrected p < 0.05, multiple cancer group comparisons) ( Figure 5B), and monocytes dominated these tumors' immune cell repertoire ( Figure 5C). We also observed significant upregulation of pro-inflammatory cytokines IFNa and IFNg in both LGGs and craniopharyngiomas when compared with either MBs or ependymomas (Bonferronicorrected p < 0.05) ( Figure 5B). Together, these results support previous proteogenomic findings that aggressive MBs and ependymomas have lower immune infiltration compared with BRAF-driven LGGs and craniopharyngiomas. 58 Although CD8 + T cell infiltration across all cancer groups was minimal ( Figure 5C), we observed a signal in specific cancer molecular subtypes (group 3 and 4 MBs) as well as outlier tumors (BRAF-driven LGG, BRAF-driven and WT GNG, and CNS embryonal tumors, not otherwise specified (NOS); Figure S6E). Surprisingly, the classically immunologically cold HGGs and DMGs 59,60 contained higher overall fractions of immune cells, primarily monocytes, dendritic cells, and natural killer (NK) cells (Figure 5C). Thus, quanTIseq might have actually captured microglia within these immune cell fractions.
While we did not detect notable prognostic effects of immune cell infiltration on overall survival in HGGs or DMGs, we found that high levels of macrophage M1 and monocytes were   Figure 5D). We further reproduced previous findings (Figure 5E) that MBs typically have low expression of CD274 (PD-L1). 61 We also found that higher expression of CD274 was significantly associated with improved overall prognosis for MB tumors, although marginal (HR = 0.0012, 95% CI = 7.5eÀ06 to 0.18, p = 0.008, multivariate Cox) ( Figure 5D). This result may be explained by the higher expression of CD274 observed in WNT subtype tumors by us and others, 62 as this diagnosis carries the best prognosis of all MB subgroups ( Figure 5E). We additionally explored the ratio of CD8 + to CD4 + T cells across tumor subtypes. This ratio has been associated with better immunotherapy response and prognosis following PD-L1 inhibition in non-small cell lung cancer or adoptive T cell therapy in multiple stage III or IV cancers. 63,64 While adamantinomatous craniopharyngiomas and group 3 and 4 MBs had the highest ratios ( Figure S6F), very few tumors had ratios greater than 1, highlighting an urgent need to identify novel therapeutics for pediatric brain tumors with poor prognosis.
Finally, we explored the potential influence of tumor purity by repeating selected transcriptomic analyses restricted to only samples with high tumor purity (see STAR Methods). Results from these analyses were broadly consistent ( Figures S7D-S7I) with results derived from all stranded RNA-seq samples.

DISCUSSION
The CBTN released the PBTA raw genomic data in September 2018 without embargo, allowing researchers immediate access to begin making discoveries on behalf of children with CNS tumors everywhere. Since this publication, the CBTN has approved over 200 data research projects 4 from 69 different institutions, with 60% from non-CBTN sites. We created the OpenPBTA as an open, real-time, reproducible analysis framework to genomically characterize pediatric brain tumors, bringing together basic and translational researchers, clinicians, and data scientists. We provide reusable code and data resources, paired with cloudbased availability of source and derived data resources, to the pediatric oncology community, encouraging interdisciplinary collaboration. To our knowledge, this initiative represents the first large-scale, collaborative, open analysis of genomic data coupled with open manuscript writing, wherein we comprehensively analyzed the PBTA cohort. Using available WGS, WXS, and RNA-seq data, we generated high-confidence consensus SNV and CNV calls, prioritized putative oncogenic fusions, and established over 40 scalable and rigorously reviewed modules to perform common downstream cancer genomics analyses. We detected expected patterns of genomic lesions, mutational signatures, and aberrantly regulated signaling pathways across multiple pediatric brain tumor histologies.
Assembling large, pan-histology cohorts of fresh frozen samples and associated clinical phenotypes and outcomes requires a multiyear, multiinstitutional framework, like those provided by CBTN and PNOC. As such, uniform clinical molecular subtyping was largely not performed for this cohort at the time of sample collection. Since DNA methylation data for these samples were not yet available to classify molecular subtypes, we created RNA-and DNA-based subtyping modules aligned with molecularly defined diagnoses. We worked closely with pathologists and clinicians to assign research-grade integrated diagnoses for 60% of tumors while discovering incorrectly diagnosed or misidentified samples in the OpenPBTA cohort. For example, we subtyped MB tumors, of which only 35% (43/122) had prior subtype information from pathology reports, using MMS2 or Me-dulloClassifier 22,23 and subsequently applied the consensus of these methods to subtype all MBs.
We advanced the integrative analyses and cross-cohort comparison via a number of validated modules. We used an expression classifier to determine whether tumors have dysfunctional TP53 42 and the EXTEND algorithm to determine their degree of telomerase activity using a 13-gene signature. 44 Interestingly, we found that hypermutant HGGs universally displayed TP53 dysregulation, unlike adult cancers like colorectal cancer and gastric adenocarcinoma, where TP53 dysregulation in hypermutated tumors is less common. 65,66 Furthermore, high TP53 scores were a significant prognostic marker for poor overall survival for patients with tumor types including H3 K28-mutant DMGs and ependymomas. We also show that EXTEND scores are a robust surrogate measure for telomerase activity in pediatric brain tumors. By assessing TP53 and telomerase activity prospectively from expression data, information usually only attainable with DNA sequencing and/or qPCR, we incorporated oncogenic biomarker and prognostic knowledge, thereby expanding our biological understanding of these tumors.
We identified enrichment of hallmark cancer pathways and characterized the immune cell landscape across pediatric brain tumors, demonstrating that tumors in some histologies, such as schwannomas, craniopharyngiomas, and LGGs, may have an inflammatory tumor microenvironment. Notably, we observed upregulation of IFNg, IL-1, IL-6, and TNFa in craniopharyngiomas, tumors difficult to resect due to their anatomical location and critical surrounding structures. Neurotoxic side effects have been reported in response to IFNa immunotherapy, 67,68 leading researchers to propose additional immune vulnerabilities, such as IL-6 inhibition and immune checkpoint blockade, as cystic adamantinomatous craniopharyngiomas therapies. [69][70][71][72][73] Our results support this endeavor. Finally, we reproduced the overall known poor infiltration of CD8 + T cells and general low expression of CD274 (PD-L1) in pediatric brain tumors, highlighting that we urgently need novel therapeutic strategies for tumors unlikely to respond to immune checkpoint blockade therapy.
While large-scale collaborative efforts may take a longer time to complete, adoption an open science framework substantially mitigated this concern. By maintaining all data, analytical code, and results in public repositories, we ensured that such logistics did not hinder progress in pediatric cancer research. Indeed, the OpenPBTA is already a foundational data analysis and processing layer for several discovery research and translational projects and will continue to add other genomic modalities and analyses, including germline, epigenomic, single-cell, splicing, imaging, and model drug response data. For example, the OpenPBTA RNA fusion filtering module led to the development of the R package annoFuse 74  Dang and colleagues showed that SHH tumors are enriched with monocyte and microglia-derived macrophages, which may accumulate following radiation therapy. 9 Expression and CNV analyses demonstrated that GPC2 is a highly expressed and copy-numbergained immunotherapeutic target in ETMRs, MBs, choroid plexus carcinomas, H3 WT HGGs, and DMGs. Foster and colleagues therefore developed a chimeric antigen receptor (CAR) directed against GPC2, which shows preclinical efficacy in mouse models. 11 Another study harnessed the OpenPBTA to integrate germline variants, discovering that pediatric patients with HGG with alternative telomere lengthening are enriched for pathogenic or likely pathogenic germline variants in the MMR pathway, possess oncogenic ATRX mutations, and have increased TMB. 12 Moreover, the OpenPBTA has enabled a framework to support real-time integration of clinical trial subjects as they enrolled in the PNOC008 HGG clinical trial 75 or the PNOC027 MB clinical trial, 76 allowing researchers and clinicians to link tumor biology to translational impact through clinical decision support during tumor board discussions. Finally, as part of the NCI's CCDI, the OpenPBTA was recently expanded into OpenPedCan, a pan-pediatric cancer effort (https://github.com/ PediatricOpenTargets/OpenPedCan-analysis) that enabled the creation of the pediatric Molecular Targets Platform (https:// moleculartargets.ccdi.cancer.gov/) in support of the RACE Act. An additional, large-scale cohort of >1,500 tumor samples and associated germline DNA is undergoing harmonization as part of CBTN CCDI-Kids First NCI and Common Fund project (https:// commonfund.nih.gov/kidsfirst/2021X01projects#FY21_Resnick) and will be immediately integrated with OpenPBTA data through OpenPedCan. The OpenPBTA has paved the way for new modes of collaborative data-driven discovery using open, reproducible, and scalable analyses that will continue to grow over time. We anticipate that this foundational work will have an ongoing, longterm impact for pediatric oncology researchers, ultimately accelerating translation and leading to improved outcomes for children with cancer.
All code and processed data are openly available through GitHub, CAVATICA, Zenodo, and PedcBioPortal (see STAR Methods).

Limitations of the study
Notably, PBTA brain tumor samples were collected over decades, and RNA samples were prepared using two distinct library preparations (stranded or poly-A; Figure S7A) by multiple sequencing centers. While we noted a strong library preparation batch effect ( Figure S7B) and a possible sequencing center batch effect ( Figure S7C), cancer groups are highly unbalanced across library preparations ( Figure S7A). We did not perform batch correction because removing batch effects across unbalanced groups may induce false differences among groups. 77,78 Instead, we circumvent batch effects by grouping only stranded RNA-seq expression data, which comprise the vast majority of the PBTA cohort, for the transcriptomic analyses presented in Figures 4 and 5. As the batch correction strategy depends highly on research goals, 78 we provide library preparation-specific expression matrices in the OpenPBTA data release for others to adapt to their needs. A second potential limitation is that performing analyses with all samples, rather than samples with high tumor purity, might result in loss of information, such as subclonal variants or low-level oncogenic pathway expression. To this end, we reperformed transcriptomic analyses using only samples with high tumor purity (see STAR Methods for details), and indeed, results were broadly consistent with those derived from the full cohort ( Figures S7D-S7I). To enable more robust statistical analysis and presentation of results, we randomly selected one independent specimen from patients with duplicate sequenced samples per tumor event rather than combining the data. This practice did not induce notable differences if the selected specimen changed over time, e.g., with a new data release. Finally, because this initial PBTA cohort mostly contains samples collected at diagnosis from one tumor section/punch, we could not reliably perform systematic intratumoral and/or longitudinal analyses, though we expect nearly 100 paired longitudinal tumors from the NIH X01 CA267587-01 pediatric brain tumor cohort to be released through OpenPedCan for future exploration.

ACKNOWLEDGMENTS
We graciously thank the patients and families who have donated tumors to the CBTN and/or the PNOC, without which this research would not be possible. Philanthropic support has ensured the CBTN's ability to collect, store, manage, and distribute specimens and data. The authors thank the following collaborators who contributed or supervised analyses present in the analysis repository that were not included in the manuscript: William Amadio, Holly

INCLUSION AND DIVERSITY
We worked to ensure ethnic or other types of diversity in the recruitment of human subjects. We worked to ensure sex balance in the selection of non-human subjects. We worked to ensure diversity in experimental samples through the selection of the cell lines. We worked to ensure diversity in experimental samples through the selection of the genomic datasets. One or more of the authors of this paper self-identifies as an underrepresented ethnic minority in their field of research or within their geographical location. One or more of the authors of this paper self-identifies as a gender minority in their field of research. One or more of the authors of this paper self-identifies as a member of the LGBTQIA+ community. One or more of the authors of this paper self-identifies as living with a disability. One or more of the authors of this paper received support from a program designed to increase minority representation in their field of research.

Lead contact
Requests for access to OpenPBTA raw data and/or specimens may be directed to and will be fulfilled by Jo Lynne Rokita (rokita@ chop.edu).

Materials availability
This study did not create new, unique reagents.
Data and code availability d Raw and harmonized WGS, WXS, and RNA-Seq data derived from human samples are available within the KidsFirst Portal 79 upon access request to the CBTN (https://cbtn.org/) as of the date of the publication. In addition, merged summary files are openly accessible at https://cavatica.sbgenomics.com/u/cavatica/openpbta or via download script in the https:// github.com/AlexsLemonade/OpenPBTA-analysis repository. Summary data are visible within PedcBioPortal at https:// pedcbioportal.kidsfirstdrc.org/study/summary?id=openpbta. Associated DOIs are listed in the key resources table. Data underlying manuscript figures are available on Zenodo. 81 d All original code was developed within the following repositories and is publicly available as follows. Primary data analyses can be found at https://github.com/d3b-center/OpenPBTA-workflows. Downstream data analyses can be found at https:// github.com/AlexsLemonade/OpenPBTA-analysis. Manuscript code can be found at https://github.com/AlexsLemonade/ OpenPBTA-manuscript. Associated DOIs are listed in the key resources table. Software versions are documented in Table S5 as an appendix to the key resources table. d Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request. d Data releases: We maintained a data release folder on Amazon S3, downloadable directly from S3 or our open-access CAVATICA project, with merged files for each analysis (See data and code availability section). As we produced new results (e.g., tumor mutation burden calculations) that we expected to be used across multiple analyses, or identified data issues, we created new data releases in a versioned manner. We reran all manuscript-specific analysis modules with the latest data release (v23) prior to submission and subsequently created a GitHub repository-tagged release to ensure reproducibility.

EXPERIMENTAL MODEL AND STUDY PARTICIPANT DETAILS
The Pediatric Brain Tumor Atlas specimens are comprised of samples from Children's Brain Tumor Network (CBTN) 4

DNA WGS alignment
We used BWA-MEM 84 to align paired-end DNA-seq reads to the version 38 patch release 12 of the Homo sapiens genome reference, obtained as a FASTA file from UCSC (see key resources table). Next, we used the Broad Institute's Best Practices 85 to process Binary Alignment/Map files (BAMs) in preparation for variant discovery. We marked duplicates using SAMBLASTER, 86 and we merged and sorted BAMs using Sambamba 87 We used the BaseRecalibrator submodule of the Broad's Genome Analysis Tool Kit GATK 88 to process BAM files. Lastly, for normal/germline input, we used the GATK HaplotypeCaller 89 submodule on the recalibrated BAM to generate a genomic variant call format (GVCF) file. This file is used as the basis for germline calling, described in the SNP calling for B-allele frequency (BAF) generation section. We obtained references from the Broad Genome References on AWS bucket with a general description of references at https://s3. amazonaws.com/broad-references/broad-references-readme.html.

Quality control of sequencing data
To confirm sample matches and remove mis-matched samples from the dataset, we performed NGSCheckMate 83 on matched tumor/normal CRAM files. Briefly, we processed CRAMs using BCFtools to filter and call 20k common single nucleotide polymorphisms (SNPs) using default parameters. We used the resulting VCFs to run NGSCheckMate. Per NGSCheckMate author recommendations, we used % 0.61 as a correlation coefficient cutoff at sequencing depths >10 to predict mis-matched samples. We determined RNA-Seq read strandedness by running the infer_experiment.py script from RNA-SeQC 90 on the first 200k mapped reads. We removed any samples whose calculated strandedness did not match strandedness information provided by the sequencing center. We required that at least 60% of RNA-Seq reads mapped to the human reference for samples to be included in analysis. During OpenPBTA analysis, we identified some samples which were mis-identified or potentially swapped. Through collaborative analyses and pathology review, these samples were removed from our data releases and from the Kids First portal. Sample removal and associated justifications were documented in the OpenPBTA data release notes.

Germline variant calling SNP calling for B-allele frequency (BAF) generation
We performed germline haplotype calls using the GATK Joint Genotyping Workflow on individual GVCFs from the normal sample alignment workflow. Using only SNPs, we applied the GATK generic hard filter suggestions to the VCF, with an additional requirement of 10 reads minimum depth per SNP. We used the filtered VCF as input to Control-FREEC and CNVkit (below) to generate B-allele frequency (BAF) files. This single-sample workflow is available in the D3b GitHub repository. References can be obtained from the Broad Genome References on AWS bucket, and a general description of references can be found at https://s3.amazonaws.com/ broad-references/broad-references-readme.html. Assessment of germline variant pathogenicity For patients with hypermutant samples, we first added population frequency of germline variants using ANNOVAR 91 and pathogenicity scoring from ClinVar 92 using SnpSift. 93 We then filtered for variants with read depth R 15, variant allele fraction R 0.20, and which were observed at < 0.1% allele frequency across each population in the Genome Aggregation Database (see key resources  table). Finally, we retained variants in genes included in the KEGG MMR gene set (see key resources table), POLE, and/or TP53 which were ClinVar-annotated as pathogenic (P) or likely pathogenic (LP) with review status of R 2 stars. All P/LP variants were manually reviewed by an interdisciplinary team of scientists, clinicians, and genetic counselors. This workflow is available in the D3b GitHub repository.

Somatic mutation calling SNV and indel calling
We used four variant callers to call SNVs and indels from paired tumor/normal samples with Targeted Panel, WXS, and/or WGS data: Strelka2, 94 Mutect2, 95 Lancet, 96 and VarDictJava. 97 VarDictJava-only calls were not retained since $39M calls with low VAF were uniquely called and may be potential false positives. ($1.2M calls were called by Mutect2, Strelka2, and Lancet and included consensus CNV calling as described below.) We used only Strelka2, Mutect2 and Lancet to analyze WXS samples from TCGA. TCGA samples were captured using various WXS target capture kits and we downloaded the BED files from the GDC portal. The manufacturers provided the input interval BED files for both panel and WXS data for PBTA samples. We padded all panel and WXS BED files were by 100 bp on each side for Strelka2, Mutect2, and VarDictJava runs and by 400 bp for the Lancet run. For WGS calling, we utilized the non-padded BROAD Institute interval calling list wgs_calling_regions.hg38.interval_list, comprised of the full genome minus N bases, unless otherwise noted below. We ran Strelka2 94 using default parameters for canonical chromosomes (chr1-22, X,Y,M), as recommended by the authors, and we filtered the final Strelka2 VCF for PASS variants. We ran Mutect2 from GATK according to Broad best practices outlined from their Workflow Description Language (WDL), and we filtered the final Mutect2 VCF for PASS variants. To manage memory issues, we ran VarDictJava 97 using 20 kb interval chunks of the input BED, Cell Genomics 3, 100340, July 12, 2023 e4 Resource ll OPEN ACCESS padded by 100 bp on each side, such that if an indel occurred in between intervals, it would be captured. Parameters and filtering followed BCBIO standards except that variants with a variant allele frequency (VAF) R 0.05 (instead of R 0.10) were retained. The 0.05 VAF increased the true positive rate for indels and decreased the false positive rate for SNVs when using VarDictJava in consensus calling. We filtered the final VarDictJava VCF for PASS variants with TYPE=StronglySomatic. We ran Lancet using default parameters, except for those noted below. For input intervals to Lancet WGS, we created a reference BED from only the UTR, exome, and start/stop codon features of the GENCODE 31 reference, augmented as recommended with PASS variant calls from Strelka2 and Mutect2. We then padded these intervals by 300 bp on each side during Lancet variant calling. Per recommendations for WGS samples, we augmented the Lancet input intervals described above with PASS variant calls from Strelka2 and Mutect2 as validation. 98

VCF annotation and MAF creation
We normalized INDELs with bcftools norm on all PASS VCFs using the kfdrc_annot_vcf_sub_wf.cwl subworkflow, release v3 (See Table S5). The Ensembl Variant Effect Predictor (VEP), 99 reference release 93, was used to annotate variants and bcftools was used to add population allele frequency (AF) from gnomAD. 100 We annotated SNV and INDEL hotspots from v2 of Memorial Sloan Kettering Cancer Center's (MSKCC) database (See key resources table) as well as the TERT promoter mutations C228T and C250T. 101 We annotated SNVs by matching amino acid position (Protein_position column in MAF file) with SNVs in the MSKCC database, we matched splice sites to HGVSp_Short values in the MSKCC database, and we matched INDELs based on amino acid present within the range of INDEL hotspots values in the MSKCC database. We removed non-hotspot annotated variants with a normal depth less than or equal to 7 and/or gnomAD allele frequency (AF) greater than 0.001 as potential germline variants. We matched TERT promoter mutations using hg38 coordinates as indicated in ref. 101 : C228T occurs at 5:1295113 is annotated as existing variant s1242535815, COSM1716563, or COSM1716558, and is 66 bp away from the TSS; C250T occurs at Chr5:1295135, is annotated as existing variant COSM1716559, and is 88 bp away from the TSS. We retained variants annotated as PASS or HotSpotAllele=1 in the final set, and we created MAFs using MSKCC's vcf2maf tool.

Gather SNV and INDEL hotspots
We retained all variant calls from Strelka2, Mutect2, or Lancet that overlapped with an SNV or INDEL hotspot in a hotspot-specific MAF file, which we then used for select analyses as described below. Consensus SNV calling Our SNV calling process led to separate sets of predicted mutations for each caller. We considered mutations to describe the same change if they were identical for the following MAF fields: Chromosome, Start_Position, Reference_Allele, Allele, and Tumor_ Sample_Barcode. Strelka2 does not call multinucleotide variants (MNV), but instead calls each component SNV as a separate mutation, so we separated MNV calls from Mutect2 and Lancet into consecutive SNVs before comparing them to Strelka2 calls. We examined VAFs produced by each caller and compared their overlap with each other ( Figure S2). VarDictJava calls included many variants that were not identified by other callers (Figure S2C), while the other callers produced results that were relatively consistent with one another. Many of these VarDictJava-specific calls were variants with low allele frequency ( Figure S2B). We therefore derived consensus mutation calls as those shared among the other three callers (Strelka2, Mutect2, and Lancet), and we did not further consider VarDictJava calls due to concerns it called a large number of false positives. This decision had minimal impact on results because VarDictJava also identified nearly every mutation that the other three callers identified, in addition to many unique mutations.

Somatic copy number variant calling (WGS samples only)
We used Control-FREEC 102,103 and CNVkit 104 for copy number variant calls. For both algorithms, the germline_sex_estimate (described below) was used as input for sample sex and germline variant calls (above) were used as input for BAF estimation. Control-FREEC was run on human genome reference hg38 using the optional parameters of a 0.05 coefficient of variation, ploidy choice of 2-4, and BAF adjustment for tumor-normal pairs. Theta2 105 used VarDictJava germline and somatic calls, filtered on PASS and strongly somatic, to infer tumor purity. Theta2 purity was added as an optional parameter to CNVkit to adjust copy number calls. CNVkit was run on human genome reference hg38 using the optional parameters of Theta2 purity and BAF adjustment for tumornormal pairs. We used GISTIC 106 on the CNVkit and the consensus CNV segmentation files to generate gene-level copy number abundance (Log R Ratio) as well as chromosomal arm copy number alterations using the parameters specified in the (run-gistic analysis module in the OpenPBTA Analysis repository).

Consensus CNV calling
For each caller and sample, we called CNVs based on consensus among Control-FREEC, 102,103 CNVkit, 104 and Manta. 107 We specifically included CNVs called significant by Control-FREEC (p value <0.01) and Manta calls that passed all filters in consensus calling. We removed sample and consensus caller files with more than 2,500 CNVs because we expected these to be noisy and derive poor quality samples based on cutoffs used in GISTIC. 106 For each sample, we included the regions in the final consensus set: 1) regions with reciprocal overlap of 50% or more between at least two of the callers; 2) smaller CNV regions in which more than 90% of regions are covered by another caller. We did not include any copy number alteration called by a single algorithm in the consensus file. We defined copy number as NA for any regions that had a neutral call for the samples included in the consensus file. We merged CNV regions within 10,000 bp of each other with the same direction of gain or loss into single region. We filtered out any CNVs that overlapped 50% or more with immunoglobulin, telomeric, centromeric, segment duplicated regions, or that were shorter than 3000 bp.

Somatic structural variant calling (WGS samples only)
We used Manta 107 for structural variant (SV) calls, and we limited to regions used in Strelka2. The hg38 reference for SV calling used was limited to canonical chromosome regions. We used AnnotSV 108 to annotate Manta output. All associated workflows are available in the workflows GitHub repository.

Gene expression Abundance estimation
We used STAR 109 to align paired-end RNA-seq reads, and we used the associated alignment for all subsequent RNA analysis. We used Ensembl GENCODE 27 ''Comprehensive gene annotation'' (see key resources table) as a reference. We used RSEM 110 for both FPKM and TPM transcript-and gene-level quantification. Gene expression matrices with unique HUGO symbols To enable downstream analyses, we next identified gene symbols that map to multiple Ensembl gene identifiers (in GENCODE v27, 212 gene symbols map to 1866 Ensembl gene identifiers), known as multi-mapped gene symbols, and ensured unique mappings (collapse-rnaseq analysis module in the OpenPBTA Analysis repository). To this end, we first removed genes with no expression from the RSEM abundance data by requiring an FPKM >0 in at least 1 sample across the PBTA cohort. We computed the mean FPKM across all samples per gene. For each multi-mapped gene symbol, we chose the Ensembl identifier corresponding to the maximum mean FPKM, using the assumption that the gene identifier with the highest expression best represented the expression of the gene. After collapsing gene identifiers, 46,400 uniquely-expressed genes remained in the poly-A dataset, and 53,011 uniquely-expressed genes remained in the stranded dataset.

Gene fusion detection
We set up Arriba 111 and STAR-Fusion 112 fusion detection tools using CWL on CAVATICA. For both of these tools, we used aligned BAM and chimeric SAM files from STAR as inputs and GRCh38_gencode_v27 GTF for gene annotation. We ran STAR-Fusion with default parameters and annotated all fusion calls with the GRCh38_v27_CTAT_lib_Feb092018.plug-n-play.tar.gz file from the STAR-Fusion release. For Arriba, we used a blacklist file blacklist_hg38_GRCh38_2018-11-04.tsv.gz from the Arriba release to remove recurrent fusion artifacts and transcripts present in healthy tissue. We provided Arriba with strandedness information for stranded samples, or we set it to auto-detection for poly-A samples. We used FusionAnnotator on Arriba fusion calls to harmonize annotations with those of STAR-Fusion. The RNA expression and fusion workflows can be found in the D3b GitHub repository. The FusionAnnotator workflow we used for this analysis can be found in the D3b GitHub repository.

QUANTIFICATION AND STATISTICAL ANALYSIS
All p values are two-sided unless otherwise stated. Z-scores were calculated using the formula z = ðx--mÞ=s where x is the value of interest, m is the mean, and s is the standard deviation.
Tumor purity (tumor-purity-exploration module) Estimating tumor fraction from RNA directly is challenging because most assume tumor cells comprise all non-immune cells, 113 which is not a valid assumption for many diagnoses in the PBTA cohort. We therefore used Theta2 (as described in the ''somatic copy number variant calling section'' method details section) to infer tumor purity from WGS samples, further assuming that co-extracted RNA and DNA samples had the same tumor purity. We then created a set of stranded RNA-Seq data thresholded by median tumor purity of the cancer group to rerun selected transcriptomic analyses: telomerase-activity-prediction, tp53_nf1_score, transcriptomic-dimension-reduction, immune-deconv, and gene set enrichment analysis. Note that these thresholded analyses, which only considered stranded RNA samples that also had co-extracted DNA, were performed in their respective OpenPBTA analyses modules (not within tumor-purity-exploration).
Recurrently mutated genes and co-occurrence of gene mutations (interaction-plots analysis module) Using the consensus SNV calls, we identified genes that were recurrently mutated in the OpenPBTA cohort, including nonsynonymous mutations with a VAF >5% among the set of independent samples. We used VEP 99 annotations, including ''High'' and ''Moderate'' consequence types as defined in the R package Maftools, 114 to determine the set of nonsynonymous mutations. For each gene, we then tallied the number of samples that had at least one nonsynonymous mutation.
For genes that contained nonsynonymous mutations in multiple samples, we calculated pairwise mutation co-occurrence scores. This score was defined as IðÀlog 10 ðPÞÞ where I is 1 when the odds ratio is >1 (indicating co-occurrence), and À1 when the odds ratio is <1 (indicating mutual exclusivity), with P defined by Fisher's Exact Test. Focal copy number calling (focal-cn-file-preparation analysis module) We added the ploidy inferred via Control-FREEC to the consensus CNV segmentation file and used the ploidy and copy number values to define gain and loss values broadly at the chromosome level. We used bedtools coverage 115 to add cytoband status using the UCSC cytoband file 116 (See key resources table). The output status call fractions, which are values of the loss, gain, and callable fractions of each cytoband region, were used to define dominant status at the cytoband-level. We calculated the weighted means of each status call fraction using band length. We used the weighted means to define the dominant status at the chromosome arm-level.
Cell Genomics 3, 100340, July 12, 2023 e6 Resource ll OPEN ACCESS A status was considered dominant if more than half of the region was callable and the status call fraction was greater than 0.9 for that region. We adopted this 0.9 threshold to ensure that the dominant status fraction call was greater than the remaining status fraction calls in a region.
We aimed to define focal copy number units to avoid calling adjacent genes in the same cytoband or arm as copy number losses or gains where it would be more appropriate to call the broader region a loss or gain. To determine the most focal units, we first considered the dominant status calls at the chromosome arm-level. If the chromosome arm dominant status was callable but not clearly defined as a gain or loss, we instead included the cytoband-level status call. Similarly, if a cytoband dominant status call was callable but not clearly defined as a gain or loss, we instead included gene-level status call. To obtain the gene-level data, we used the IRanges package in R 117 to find overlaps between the segments in the consensus CNV file and the exons in the GENCODE v27 annotation file (See key resources table). If the copy number value was 0, we set the status to ''deep deletion''. For autosomes only, we set the status to ''amplification'' when the copy number value was greater than two times the ploidy value. We plotted genome-wide gains and losses in ( Figure S3C) using the R package ComplexHeatmap. 118 Breakpoint density (WGS samples only; chromosomal-instability analysis module) We defined breakpoint density as the number of breaks per genome or exome per sample. For Manta SV calls, we filtered to retain ''PASS'' variants and used breakpoints from the algorithm. For consensus CNV calls, if |log2 ratio| > log2(1), we annotated the segment as a break. We then calculated breakpoint density as: breakpoint density = N breaks Size in Mb of effectively surveyed genome Chromothripsis analysis (WGS samples only; chromothripsis analysis module) Considering only chromosomes 1-22 and X, we identified candidate chromothripsis regions in the set of independent tumor WGS samples with ShatterSeek, 119 using Manta SV calls that passed all filters and consensus CNV calls. We modified the consensus CNV data to fit ShatterSeek input requirements as follows: we set CNV-neutral or excluded regions as the respective sample's ploidy value from Control-FREEC, and we then merged consecutive segments with the same copy number value. We classified candidate chromothripsis regions as high-or low-confidence using the statistical criteria described by the ShatterSeek authors. Immune profiling and deconvolution (immune-deconv analysis module) We used the R package immunedeconv 120 with the method quanTIseq 121 to deconvolute various immune cell types in tumors using collapsed FPKM RNA-seq, with samples batched by library type and then combined. The quanTIseq deconvolution method directly estimates absolute fractions of 10 immune cell types that represent inferred proportions of the cell types in the mixture. Therefore, we utilized quanTIseq for inter-sample, intra-sample, and inter-histology score comparisons. Gene Set Variation Analysis (gene set enrichment analysis analysis module) We performed Gene Set Variation Analysis (GSVA) on collapsed, log2-transformed RSEM FPKM data for stranded RNA-Seq samples using the GSVA Bioconductor package. 122 We specified the parameter mx.diff=TRUE to obtain Gaussian-distributed scores for each of the MSigDB hallmark gene sets. 123 We compared GSVA scores among histology groups using ANOVA and subsequent Tukey tests; p values were Bonferroni-corrected for multiple hypothesis testing. We plotted scores by cancer group using the Complex-Heatmap R package ( Figure 5B). 118 Transcriptomic dimension reduction (transcriptomic-dimension-reduction analysis module) We applied Uniform Mani-fold Approximation and Projection (UMAP) 124 to log2-transformed FPKM data for stranded RNA-Seq samples using the umap R package (See key resources table). We considered all stranded RNA-Seq samples for this analysis, but we removed genes whose FPKM sum across samples was less than 100. We set the UMAP number of neighbors parameter to 15. Fusion prioritization (fusion_filtering analysis module) We performed artifact filtering and additional annotation on fusion calls to prioritize putative oncogenic fusions. Briefly, we considered all in-frame and frameshift fusion calls with at least one junction read and at least one gene partner expressed (TPM >1) to be true calls. If a fusion call had a large number of spanning fragment reads compared to junction reads (spanning fragment minus junction read greater than ten), we removed these calls as potential false positives. We prioritized a union of fusion calls as true calls if the fused genes were detected by both callers, the same fusion was recurrent within a broad histology grouping (>2 samples), or the fusion was specific to the given broad histology. If either 5 0 or 3 0 genes fused to more than five different genes within a sample, we removed these calls as potential false positives. We annotated putative driver fusions and prioritized fusions based on partners containing known kinases, oncogenes, tumor suppressors, curated transcription factors, 125 COSMIC genes, and/or known TCGA fusions from curated references. Based on pediatric cancer literature review, we added MYBL1, 126 SNCAIP, 127 FOXR2, 128 TTYH1, 129 and TERT [130][131][132][133] to the oncogene list, and we added BCOR 128 and QKI 134 to the tumor suppressor gene list.

Oncoprint figure generation (oncoprint-landscape analysis module)
We used Maftools 114 to generate oncoprints depicting the frequencies of canonical somatic gene mutations, CNVs, and fusions for the top 20 genes mutated across primary tumors within broad histologies of the OpenPBTA dataset. We collated canonical genes from the literature for low-grade gliomas (LGGs), 25  Mutational signatures (mutational-signatures analysis module) We obtained weights (i.e., exposures) for signature sets using the deconstructSigs R package function whichSignatures() 147 from consensus SNVs with the BSgenome.Hsapiens.UCSC.hg38 annotations (see key resources table). Specifically, we estimated signature weights across samples for eight signatures previously identified in the Signal reference set of signatures (''RefSig'') as associated with adult central nervous system (CNS) tumors. 40 These eight RefSig signatures are 1, 3, 8, 11, 18, 19, N6, and MMR2. Weights for signatures fall in the range zero to one inclusive. deconstructSigs estimates the weights for each signature across samples and allows for a proportion of unassigned weights referred to as ''Other'' in the text. These results do not include signatures with small contributions; deconstructSigs drops signature weights that are less than 6%. 147 We plotted mutational signatures for patients with hypermutant tumors ( Figure 4E) using the R package ComplexHeatmap. 118 Tumor mutation burden (snv-callers analysis module) We consider tumor mutation burden (TMB) to be the number of consensus SNVs per effectively surveyed base of the genome. We considered base pairs to be effectively surveyed if they were in the intersection of the genomic ranges considered by the callers used to generate the consensus and where appropriate, regions of interest, such as coding sequences. We calculated TMB as: TMB = # of coding sequence SNVs Size in Mb of effectively surveyed genome We used the total number coding sequence consensus SNVs for the numerator and the size of the intersection of the regions considered by Strelka2 and Mutect2 with coding regions (CDS from GENCODE v27 annotation, see key resources table) as the denominator.
Clinical data harmonization WHO classification of disease types Table S1 contains a README, along with sample technical, clinical, and additional metadata used for this study.

Molecular subtyping
We performed molecular subtyping on tumors in the OpenPBTA to the extent possible. The molecular_subtype field in pbtahistologies.tsv contains molecular subtypes for tumor types selected from pathology_diagnosis and pathology_free_text_diagnosis fields as described below, following World Health Organization 2016 classification criteria. 20 We further categorized broad tumor histologies into smaller groupings we denote ''cancer groups.'' Medulloblastoma (MB) subtypes SHH, WNT, Group 3, and Group 4 were predicted using the consensus of two RNA expression classifiers: MedulloClassifier 22 and MM2S 23 on the RSEM FPKM data (molecular-subtyping-MB analysis module). The 43 ''true positive'' subtypes were manually curated from pathology reports by two independent reviewers.
corresponding chromosomal location for reads to the X and Y chromosomes. We used the fraction of total normalized X and Y chromosome reads that were attributed to the Y chromosome as a summary statistic. We manually reviewed this statistic in the context of reported gender and determined that a threshold of less than 0.2 clearly delineated female samples. We marked fractions greater than 0.4 as predicted males, and we marked samples with values in the inclusive range 0.2-0.4 as unknown. We performed this analysis through CWL on CAVATICA. We added resulting calls to the histologies file under the column header germline_sex_estimate.
Selection of independent samples (independent-samples analysis module) Certain analyses required that we select only a single representative specimen for each individual. In these cases, we identified a single specimen by prioritizing primary tumors and those with whole-genome sequencing available. If this filtering still resulted in multiple specimens, we randomly selected a single specimen from the remaining set.
Quantification of telomerase activity using gene expression data (telomerase-activity-prediction analysis module) We predicted telomerase activity of tumor samples using the recently developed EXTEND method, 44 with samples batched by library type. Briefly, EXTEND estimates telomerase activity based on the expression of a 13-gene signature. We derived this signature by comparing telomerase-positive tumors and tumors with activated alternative lengthening of telomeres pathway, a group presumably negative of telomerase activity.

Survival models (survival-analysis analysis module)
We calculated overall survival (OS) as days since initial diagnosis and performed several survival analyses on the OpenPBTA cohort using the survival R package. We performed survival analysis for patients by HGG subtype using the Kaplan-Meier estimator 163