Damage-Net: A program for DNA repair meta-analysis identifies a network of novel repair genes that facilitate cancer evolution

Highlights • Damage-Net is a downloadable program for quick and easy meta-analysis of large datasets.• Users can use the datasets in the curated database of DNA-repair based studies and also upload their own data for analysis.• Built in pan-cancer mutation and survival analysis allows genes of interest to be queried for DNA-repair roles in cancer.• Analysis of the database found a novel group of DNA repair genes and a major role for DNA-repair in adrenocortical carcinoma.


Introduction
A variety of DNA damaging agents assault our genomes every day, leading to the formation and subsequent repair of a range of DNA lesions. These different lesions are repaired by various repair pathways, each optimised for the resolution of a different form of DNA damage and each required for the successful maintenance of genomic integrity.
The major DNA repair pathways include base excision repair (BER), nucleotide excision repair (NER), single-strand break repair (SSBR) and double-strand break repair (DSBR) [1,2]. BER repairs non-bulky DNA adducts, such as oxidised nucleotides [3], NER repairs bulky, helix distorting DNA adducts [4], SSBR repairs single-strand cuts in the double-helix [5] and DSBR functions via multiple downstream repair pathways, predominantly non-homologous end-joining (NHEJ) and homologous recombination (HR) [6]. Despite these highly complex and efficient repair pathways, DNA repair is imperfect and can therefore lead to the formation of mutations throughout our genomes [7,8]. The accumulation of mutations can disrupt cellular functions ultimately leading to disease and is most notably the central driver of carcinogenesis [9,10]. These mutations can alter protein function, disrupting normal cellular function and, in some cases, acquiring characteristics which drive tumorigenesis. Mutagenesis can be accelerated via mutation of the genes encoding DNA repair factors, leading to impaired protein function and therefore reduced repair fidelity [11,12]. This mechanism constitutes the basis for cancer pre-disposing conditions, such as hereditary breast and ovarian cancer syndrome [11]. A broad understanding of the processes governing DNA repair is therefore critical for the development of new therapeutic strategies for cancer.
High-throughput investigations are often used to identify novel factors involved in DNA repair. These studies include proteomics of chromatin associated proteins [13,14], interactomes of known components of the DNA repair machinery [15,16] and the study of protein modifications following exposure to DNA damaging agents [17,18]. These methods have proven effective at identifying new participants in a variety of DNA repair pathways and there is therefore a wealth of published data available utilising these techniques. However, only a small fraction of the proteins identified have been investigated in depth meaning there is a substantial amount of information left unused in these datasets. In addition, the integration of multiple datasets investigating the same processes magnifies their power as successful hits can be identified with greater confidence, outlier removal is more accurate and a far greater understanding of the results can be obtained [19]. Despite Abbreviations: TCGA, The Cancer Genome Atlas; ACC, adrenocortical carcinoma; DNet-genes, Damage-Net identified genes; ACC-Net genes, ACC specific Damage-Net identified genes. these benefits, and the abundance of available data, integration of multiple published datasets or even comparisons between new data and published data are rare. We believe this is likely due to the barriers of searching for data in the literature and the bioinformatic analysis of the results.
To address this, we wrote a graphical program named Damage-Net (All downloads and information available at www.damage-net.co.uk), that simplifies the integration of large datasets, mainly proteomics and genetic screens, allowing for comparisons to be made between datasets and for commonalities to be found. Users can easily investigate the curated database of 39 published results that investigate DNA damage and upload their own results to interrogate. In addition, data from The Cancer Genome Atlas (TCGA) database has been analysed and incorporated into Damage-Net to allow users to easily determine if genes of interest correlate with mutational burden or alter survival in 33 cancer subtypes compiled by the TCGA. In this study, we showcase the powerful capabilities of Damage-Net by using it to conduct large scale analysis on all 39 of its datasets, identifying a group of genes that are potentially novel DNA repair factors. We then conduct further analysis on these genes, finding a network of genes cooperating in adrenocortical carcinoma (ACC) that associates with mutational burden and substantially reduce patient survival.

An overview of Damage-Net functionality and features
Damage-Net was written in the Python coding language with the database backend written in SQL and was compiled to function as an independent program on Windows, Linux and Mac OS. It currently contains the results of 39 publications that investigated various aspects of DNA repair on a large-scale (Supplementary Table 1) , as well as the gene expression and mutation data for 33 cancer sub-types from the TCGA database. The proteomic datasets were all selected from peer-reviewed publications that investigated aspects of the DNA damage-response. These datasets had to be generated in a way that identified DNA-damage specific results, e.g. via comparing a DNA damage treated sample to an untreated sample or immunoprecipitation of a known repair factor. In addition, we chose to not impose selection criteria to prevent bias within the database, instead it is designed to grant easier access to currently available data. We will continue to search for and add datasets to the database, those currently available are all the datasets we have identified so far. We designed Damage-Net with simplicity and ease of use as our primary focus, to make data availability and analysis as accessible as possible. To achieve this, there are four main functions; "Gene search", "Pan-Cancer Mutation and Survival Analysis", "Top Hits" and "Compare" (Fig. 1A).
The "Gene search" function takes a single gene symbol, e.g. LIG3, and returns a list of all studies in the database that identified this gene, or its encoded protein, as a significant hit. In the case of ligase 3, encoded by LIG3, we find interactomes for MDC1, REV7, KU70 and the NHEJ complex, a study on chromatin binding in response to UV and a number of modification studies (Fig. 1B). This function can instead be used to search for protein families by searching the common identifier of the gene family you wish to search, e.g. BRCA, LIG, SF3B etc. The result is a list of all studies that identified members of this family followed by which members were identified (Fig. 1C). For the investigation of a specific gene, the user can also conduct pan-cancer mutation and survival analysis. This function opens a results window containing a heatmap of the association of the gene's expression with point mutations, insertions and deletions across the genomes of all cancer sub-types (Fig. 1D). It also creates a table of survival statistics for all cancer subtypes, highlighting the significant results, and generates Kaplan-Meier curves, the sub-type of which can be switched using a dropdown selection (Fig. 1D).
Both the "Top Hits" and "Compare" functions require a list of at least 2 datasets to compare which can either be selected from the dropdown lists or be copied over from the "Gene search" results ( Fig. 1E). Analysing a group of datasets with "Top Hits" opens a results window containing a table of 3 columns: Rank, Rate and Gene, which correspond to the rank of the gene in the results, the number of the studies the gene was identified in and the gene symbol respectively (Fig. 1F). The "Compare" function determines the overlap in the results of all possible combinations of the datasets searched and opens a results window containing a table and a Venn diagram (Fig. 1G). The table has 3 columns: comparison, overlap and genes, corresponding to the datasets being compared, the number of genes in the overlap between the compared datasets and the genes in this overlap respectively. The Venn diagram is only computed for up to 6 studies as greater than this becomes very difficult to plot. The results of "Cancer mutation and survival analysis", "Top Hits" and "Compare" can all be saved in multiple table formats and all plots can be saved in a variety of image formats.
Although simple in their design, these functions provide a very effective basis of analysis for the investigation of both specific targets and large datasets. No other available tools provide genomic mutation association based on gene expression, instead other tools such as cBio-Portal only provide analysis of mutations within a gene of interest, not across the genome. Other tools do provide survival curve creation, such as kmplot, however at the point of this publication these tools have a more limited set of cancer sub-types. Importantly, as well as these additional features, no tool currently provides a database similar to Damage-Net allowing users to quickly and easily query published results. The integration of different datasets can be used to uncover additional aspects of the role a particular protein may play in DNA repair, for example by finding its interacting partners and by determining the modifications that are made to it. The mutation analysis made possible by the integration of TCGA data can also significantly expand on this by providing the types of mutation associated with this gene and the cancer sub-types/tissues in which this occurs. Correlating expression of a gene with the frequency of mutations across the genome is a rarely used approach for helping to define a proteins involvement in DNA repair but one of significant benefit, both inside and outside the context of cancer.

Damage-Net identifies a group of potentially novel DNA repair factors
Initially we tested the capability of Damage-Net to selectively identify DNA repair factors. To do this we took 3 groups of genes; a group of classical DNA repair factors, a group of randomly selected genes and a group of DNA repair genes identified since 2013 to represent newly discovered DNA repair genes that would be the target of these investigations. Each group consisted of 40 genes which were searched using "Gene Search" for the number of datasets that identified them and the average number of datasets were then compared ( Fig. 2A). This showed that both classical DNA repair and newly discovered DNA repair genes are similarly likely to be identified in ~7 studies, whereas random genes on average were identified in <1 study. This suggests that the gene search function is selective for DNA repair factors while unrelated genes are not enriched.
Comparing the results of multiple published studies also has significant advantages for data analysis. By combining studies, we can reduce off-target hits that are often specific to a particular methodology, while true hits can be enriched [17,20]. Multiple DNA-repair studies have previously conducted and combined several proteomic approaches in order to elucidate results shared between the approaches [17,20,25]. Target genes by these approaches tend to be more accurate as they have been consistently identified across multiple approaches [38][39][40][41][42]. In addition, the combination of different methodologies can provide additional information as to the processes these genes may be involved in with different experimental phenotypic outputs, resulting in more in-depth conclusions. For example, 53BP1 and BRCA1 have an antagonistic relationship to promote either non-homologous end-joining (NHEJ) or homologous recombination (HR) repair of DSBs respectively. However, interactomes of both 53BP1 and BRCA1 identify each other as well as the 53BP1 co-operator RIF1 and the BRCA1 co-operator BARD1. From this, it is not possible to determine the relationship between these proteins or their interactors, however by comparing these results with that of a genetic screen investigating positive regulators of HR, only BRCA1 and BARD1 are identified as common between all three studies ( Supplementary Fig. 1A). This indicates that BRCA1 and BARD1 interact to promote HR and although BRCA1 interacts with 53BP1 and RIF1 they are dispensable for the process of HR. Therefore, by combining datasets of differing methodologies, a more in depth understanding of the results can be obtained. Interestingly, the other genes identified in all three of the studies in this example included the HR factors BRCA2 and RPA2 alongside HMBOX1 and ZNF207, both of which have recently been showed to have roles associated with the DNA damage response [43][44][45][46].
To demonstrate the capabilities of Damage-Net, we chose to integrate all 39 datasets together to identify a high-confidence group of novel DNA repair factors. Initially, we ranked all genes by the number of studies they were identified in to find the most enriched genes across all datasets. This found that the genes with the strongest enrichment in these studies were a variety of well known DNA repair factors, such as BRCA1, MSH6 and the full MRN complex (MRE11, RAD50, NBN) (Fig. 2B). We believe that in conjunction with our previous investigation of the gene search function, this suggests that Damage-Net is effective at selectively identifying DNA repair genes and that DNA repair genes are enriched within the database. However, we also noticed that whereas over 9000 genes were identified in total, over 60 % of these were identified in only one of the 39 studies investigating DNA repair (Fig. 2B). There are likely a number of contributors to this, though since Damage-Net contains such a wide variety of datasets, a wide range of results is to be expected. To further analyse this phenomenon, we paired datasets that used comparable approaches to see if this variation was due to the inherent differences between the designs of our 39 studies. Comparing two studies that used Stable Isotope Labeling by/with Amino acids in Cell culture (SILAC) to identify proteins binding to chromatin in response to UV irradiation found only a 5-8 % overlap and comparing two BRCA1 interactome studies found only a 13-21 % overlap (Supplementary Fig. 1B). Pairing studies that investigated proteins that are either phosphorylated, ubiquitylated or acetylated in response to DNA damage fared slightly better, though still only had overlaps of 15-38 % ( Supplementary Fig. 1C). A pair of genetic screens showed the lowest overlap of only 1-2 % ( Supplementary Fig. 1D), however this is likely due to the different targets screened by these two studies since both derived their targets through separate processes. These results suggest that differing experimental approaches partly contribute to the results of Fig. 2B, though there may also be an element of random variation from false-positive results since the paired studies also showed significant variability (Supplemental Fig. 1B-D). However, it should be noted that even these paired studies, although examining similar processes, can have a number of experimental differences such as cell lines, treatment types and doses, incubation periods and sample preparation. This further supports our hypothesis for integrating multiple datasets, as by comparing analogous studies we can isolate the high-confidence, commonly identified results and exclude those that are cell-type or method specific.
Therefore to extract a high confidence group of hits from these datasets, we categorised each study into three groups; DNA-damage response (DDR)-related, which comprises chromatin association and repair factor interactome studies, modified, which comprises all damage induced modification studies, and genetic screens, which includes genome-wide CRISPR sensitivity screens and RNAi based reporter assay screens (Fig. 2C). To remove the large number of low confidence hits, each group of genes was characterised as the genes identified in at least 2 studies of the groups from Fig. 2C.
Comparing these gene groups found that there is a significant overlap between the DDR-related and modified groups, but minimal overlap with the genetic screens group (Fig. 2D). In addition, correlation analysis of the frequency each gene is identified in these groups found a significant correlation between the DDR-related and modified groups, but not between either of these groups and the genetic screens group LGG, LUAD, MESO, PAAD, SARC, SKCM and UCEC, in the cross-section of (G). Statistical testing was done using an unpaired, directional Wilcox test with Bonferroni correction, ns indicates p-value > 0.05, *** indicates p-value < 0.001. (Fig. 2E). Due to these low correlations and the lack of consistent results with this group and the others we decided not to include genetic screens in the analysis, as they limit our ability to combine datasets. However, it should be noted that this reduced overlap with the other groups is likely due to the significant experimental differences between them and that these genetic screens do produce accurate results. For example, comparing the group of genetic screen results with the DNA repair gene ontology gene group found that ~26 % of the genetic screen results were known DNA repair genes ( Supplementary Fig. 1E).
We then took the comparison of the DDR-related and the modified groups and to only focus on the genes that are novel repair genes, we compared these with the DNA-repair gene ontology group (Fig. 2F). We found a reassuring overlap of 101 genes between the 680 Damage-Net identified genes and the 483 known DNA repair genes. We therefore carried forward these 579 genes in the overlap between the DDR-related and modified groups for further analysis. For simplicity, we termed this group of Damage-Net identified genes "DNet-genes".

DNA repair factors and DNet-genes are associated with high mutational burden and poor-prognosis in cancer
We first conducted gene ontology enrichment to understand the composition of the 579 DNet-Genes. This found a significant enrichment for genes related to various chromatin organisation and RNA-related processes (Fig. 3A). This suggests that the genes may be more peripheral to the repair process, facilitating chromatin remodelling around damage sites rather than specifically contributing to DNA processing.
To investigate possible DNA repair roles for these DNet-genes in cancer we conducted a mutation association study. To do this for a gene we first split patients into two groups; low and high expression for the gene of interest, then calculate the log2 fold change of the average mutation frequency in the high/low expression group. This allows us to determine how expression of a gene correlates with the mutational burden in a given cancer sub-type and can be very mechanistically informative. For example, breast cancer is commonly deficient in homologous recombination (HR) resulting in increased mutations due to repair of double-strand breaks via the more error-prone non-homologous end joining (NHEJ) [47,48]. When we look at the association of the HR factor RAD50 with deletions in breast cancer, we find a strong negative association indicating that loss of RAD50 promotes mutations whereas the NHEJ factor KU80 shows a strong positive association indicating higher KU80 expression promotes mutation (Fig. 3B). This analysis is a powerful tool in investigating DNA repair in cancer as we can also characterise tissue and mutation specific mechanisms of repair factors, e.g. in contrast to our previous example, the UV-response gene RAD18 associates positively with point mutations in skin cutaneous melanoma whereas P53 is negatively associated (Fig. 3C).
We calculated the average association of all DNet-genes with tumour mutational burden in all cancer sub-types and compared this to the averages for canonical DNA-repair genes and a randomly selected group of genes (Fig. 3D). This found that both the canonical DNA repair and DNet-gene groups were associated with increased tumour mutational burden in most cancer sub-types, whereas random genes showed few significant changes. To get an overall view on the effect these genes are having on tumour mutational burden, we compared the effect of each gene on mutational burden in each of our 3 gene groups (Fig. 3E). Compared to random genes, both DNA repair and DNet-genes show a significant increase in tumour mutational burden association. Interestingly, comparing DNet-genes to DNA repair genes shows no statistical difference, suggesting the DNet-genes cause a comparable increase in tumour mutational burden to canonical DNA repair genes. An association between DNA-repair gene activity and altered tumour mutational burden is well documented [49,50], though interestingly this is commonly identified as reduced activity increasing mutations, unlike our observation here.
Next, we used a similar analysis to investigate the effect of DNet-genes on patient survival. Comparing the average survival change for each group across the different cancer sub-types found that there is significant variation between the sub-types, but that DNA-repair genes and DNet-genes are more commonly associated with an average decrease in survival (Fig. 3F). An overall analysis, similar to that done previously, found that both DNA repair genes and DNet-genes generally show a significant decrease in survival whereas the random control group shows a subtle skew to increased survival (Fig. 3G). Again, we found no statistical difference between the repair and DNet-genes, suggesting that these groups have very similar effects on both mutational burden and patient survival. All mutation and survival associations for all gene groups are available Supplementary Table 2.
Interestingly, comparing the sub-types where DNA-repair genes cause increased mutational burden with those in which they cause decreased survival found a large overlap between the two groups ( Fig. 3H-I). This indicates that these two features may be linked, such that the increased mutational burden drives the decreased survival.

A sub-group of DNet-genes lower cancer prognosis by increasing mutational burden
To investigate this link, we used an integrated analysis of all DNetgenes in all cancer sub-types for their effect on mutational burden and survival simultaneously. This allowed us to characterise 4 groups of coupled change: group 1 where genes associate with reduced mutations and increased survival, group 2 where genes associate with increased mutations and increased survival, group 3 where genes associate with decreased mutations and decreased survival and group 4 where genes associate with increased mutations and decreased survival (Fig. 4A). The DNet-genes show a distinct shift into group 4, demonstrating that the increase in tumour mutational burden and the decrease in patient survival shown previously are directly linked. The survival curves for a group 4 example, KIF23 in adrenocortical carcinoma, is shown in Supplementary Fig. 2C and for a group 2 example, USP34 in rectal adenocarcinoma, is shown in Supplementary Fig. 2D. When conducting this analysis on randomly selected genes, it showed very few hits in all 4 groups (Supplementary Fig. 2A), whereas the DNA repair genes showed a very significant shift into group 4 ( Supplementary Fig. 2B). This follows our previous results that found DNet-genes and DNA repair genes to have comparable effects of significantly increased mutation association and significantly decreased survival association (Fig. 3E, G). This result therefore suggests that both DNA repair and DNet-gene expression are commonly associated with increased tumour mutational burden that drives a decrease in patient survival in cancer. Additionally, the strong similarity between the effects of DNA repair genes and DNet-genes support the hypothesis that DNet-genes are previously uncharacterised DNA repair genes, either directly involved in DNA processing or are factors associated with the repair process.
Calculating the size of each of the 4 groups from Fig. 4A showed that compared to random genes, DNA repair and DNet-genes have a substantial enrichment for group 4 (Fig. 4B). A notable feature of the DNetgene hits in group 4 is that they appear to be dominated by hits in adrenocortical carcinoma (ACC) (Fig. 4C) which was also found to be one of the most significantly changing sub-types with DNet-genes for both mutational burden and survival (Fig. 3D, F). This means that there are large groups of DNet-genes whose expression collectively correlates with both mutational burden and decreased survival in specific cancer sub-types. We therefore conducted functional analysis on the group 4 ACC genes using gene-ontology enrichment and network analysis to determine their biological roles. The group 4 ACC genes primarily consist of genes relating to cytoskeleton/chromatin organisation and gene expression/ribosomes (Fig. 4D) which we found particularly interesting in light of the burgeoning fields of RNA binding proteins and chromatin remodellers in DNA repair [51][52][53][54][55][56][57].
The group 4 enrichment of canonical DNA repair genes (Supplementary Fig. 2B) suggests that it is common for these genes to associate with high mutational burden that drives decreases in survival. Given that the group 4 DNet-genes seem to almost entirely consist of an individual cancer sub-type and a specific set of genes, we chose to investigate this gene network further.

Network analysis in ACC discovers DNet-genes are part of a larger co-expression network
Since the group 4 ACC genes all show similar results in our analysis, we used hierarchical clustering on their expression in ACC to determine their co-expression (Fig. 5A). This found that over half of these genes strongly co-express in ACC, which we were able to validate via principal component analysis (Fig. 5B). The tight co-expression of these genes suggests they form a network in ACC that could encompass genes beyond the original set identified by Damage-Net. Since we originally identified these genes through proteomic investigations, we may have only identified a small fraction of this ACC network due to the identification biases of these proteomic approaches [38][39][40][41][42]58]. It is therefore possible that this network is much larger and consists of a significant number of genes that are elusive to proteomic investigations.
To address this, we used the same hierarchical clustering approach as before to assess the co-expression of our genes with the entire protein coding transcriptome (Fig. 5C). This indeed identified a surprising number of genes that strongly co-express with our correlating group 4 ACC genes. We filtered this matrix for the strongest correlations to gain a group of 204 genes which, combined with the 34 genes with strong correlations from the group 4 ACC genes, gave us 238 genes in this network, termed ACC-Net. It is possible that this network collectively coordinates the high mutational burden and low survival that we observed in ACC. To determine if this co-expression network is only a feature of ACC or is a more general biological feature, we calculated the average co-expression between ACC-Net genes in all cancer sub-types (Fig. 5D). This showed that these genes co-express to varying levels in all sub-types, but that their co-expression is strongest in ACC, suggesting they are part of a consistent biological pathway that is of particular importance in ACC.
To gain a general understanding of the representation of DNA repair genes in the ACC-Net group, we calculated the percentage of ACC-Net genes that are in the DNA repair gene ontology group (Fig. 5E). Over 25 % of the ACC-Net genes were found to be known DNA repair genes compared to <2% of the total transcriptome. Gene ontology of the 238 genes found that most genes were associated with cell cycle regulation with some associated with DNA conformation/biosynthesis and a substantial enrichment for DNA double-strand break repair (Fig. 5F). The large enrichment and variety of cell-cycle related processes was particularly striking and so a more in-depth gene ontology investigation was conducted to further understand this. This found that the cell-cycle related genes were primarily associated with the processes of DNA replication, chromatin organisation and cytoskeletal/mitotic spindle regulation, rather than regulation of cell-cycle progression or  Fig. 3A-B). This gives us valuable insight into the possible functions this network may be facilitating, suggesting they may be a part of DNA repair processes specifically regarding DNA replication and via regulating chromosome segregation.

The ACC-Net gene network strongly associates with mutational burden and decreased survival in ACC
To assess the effect of the ACC-Net gene expression on ACC we employed a similar analysis to that used for the DNet-genes to investigate mutational burden and survival. Comparing the mutational burden association of these genes with that of all DNA repair genes and our randomly selected control group we found that in ACC, DNA repair genes cause a significant increase in mutational burden, but that the ACC-Net genes cause an even greater increase (Fig. 6A). In fact, the expression of every gene in this group correlated with an increase in mutational burden. Conducting a similar comparison to assess changes in ACC patient survival showed that every ACC-Net gene is associated with significantly decreased survival and overall, they show an even greater decrease in survival than DNA repair genes (Fig. 6B). This unsurprisingly results in all ACC-Net genes being simultaneously associated with increased mutational burden and reduced survival; however, their association appears to be more continuous than we have seen previously as there is a significant negative correlation between mutational burden and survival (Fig. 6C). In comparison, the results from canonical DNA repair genes showed more variable results and our randomly selected control genes only had a few very variable hits ( Supplementary Fig. 4A-B). Expanding our analysis to all cancer subtypes found that ACC-Net genes still showed a strong simultaneous association with increased mutational burden and decreased survival ( Supplementary Fig. 4C), and was far greater than that found for DNetgenes (Fig. 4A) or DNA-repair genes (Supplementary Fig. 3B). Combined with our previous finding that these genes co-express in all sub-types (Fig. 5D), this suggests that although high expression of this gene network is most significant in ACC, that it still has an important role in general and also supports the accuracy of our findings in ACC. ACC-Net mutation and survival associations are available in Supplementary  Table 2.
These results are indicative of the individual effect of ACC-Net genes; however, since we believe these genes to be operating in conjunction with each other, we needed to assess the impact of their collective expression. To determine the cumulative effect of ACC-Net genes, we grouped ACC patients into high and low expression groups based on the cumulative expression of all these genes (Fig. 6D). These groups allow us to assess the overall effect expression of the whole network has on ACC. High expression of the network results in a highly significant drop in survival, reducing the 5-year survival rate from ~80 % in low expression patients to ~30 % in high expression patients (Fig. 6E). This is a greater decrease than compared to the cumulative effect of DNA repair gene expression (Supplementary Fig. 4D) and overall expression of our random control genes showed no change in survival rate (Supplementary Fig. 4E). In addition, overall ACC-Net expression strongly associates with increased rates of point mutations, insertions and deletions (Fig. 6F), an increase that was slightly greater than the overall effect of all DNA repair genes. As a final piece of analysis, we asked the question of what effect ACC-Net gene expression has on tumour stage. High expression of ACC-Net genes strongly skews towards late stage, metastatic ACC, whereas low expression skews towards early-stage ACC (Fig. 6G). This indicates that ACC-Net gene expression not only reduces survival, but also advances cancer progression to metastasis.
We then tested these findings by replicating this analysis using an alternative dataset that is independent of the TCGA [59]. This dataset includes gene expression quantification via microarray, mutation quantification via exome sequencing and patient outcome data. We found strikingly similar results with this dataset as with the TCGA data (Supplemental Fig. 4F-H); increased expression of individual ACC-Net genes commonly associates with increased mutational burden and reduced patient survival (Supplemental Fig. 4F). In additional, patients with a combined high expression of all ACC-Net genes showed significantly reduced survival (Supplemental Fig. 4G), comparable to that found in the TCGA dataset, and showed increased association with point mutations, insertions and deletions (Supplemental Fig. 4H).
As an experimental validation of ACC-Net genes playing a role in the DNA damage response, we conducted apoptosis assays in response to DNA damage along with siRNA mediated depletion of two ACC-Net target genes, KIF4A and MKI67. BJ-5ta fibroblasts were treated with 25μM etoposide, a common ACC chemotherapeutic, after knockdown of our targets alongside a negative control scramble siRNA and to a positive control KU70 siRNA (Fig. 6H). Both KIF4A and MKI67 knockdown resulted in significantly increased apoptosis in response to etoposide treatment over the control siRNA (Fig. 6I). Interestingly, KIF4A depletion showed an increase in apoptosis similar to that of KU70 depletion, whereas MKI67 depletion resulted in an even greater increase than this, however these differences may be due to depletion efficiency (Fig. 6H). In addition, we repeated these experiments using 2μM cisplatin instead of etoposide as this is also a common ACC chemotherapeutic. This experiment gave very similar results to the etoposide treatment, with all siRNA showing significantly increased apoptosis compared to control siRNA and MKI67 siRNA showing an even greater increase than KIF4A or KU70 siRNA (Fig. 6I).
These results demonstrate that in ACC, this network of genes coordinates processes that lead to an increase in genomic mutations being maintained in the cancer genomes, which leads to progression of the cancer and, ultimately, an extraordinary decrease in patient survival. In addition, the corroboration of results for canonical DNA repair gene with those for the ACC-Net genes further supports that the ACC-Net genes operate via a DNA repair based mechanism.

Discussion
Here we showed the various features of Damage-Net, a program we built for meta-analysis and DNA repair investigation, along with further analysis that highlighted its effectiveness at identifying novel DNA repair genes. In addition, we found that there is an inherent aspect of irreproducibility in many datasets. The reason for this could simply be the large variation between experiments in different labs, as even results using very similar methodologies commonly give significantly different results. However, some publications have shown that even between biological replicates there is significant variability. For example, a recent publication conducted 3 replicates of affinity-purification massspec (AP-MS) and BioID of the Ku70 interactome and found a surprising lack of overlap between the AP-MS replicates [20]. Their BioID results were significantly more reproducible and identified a far greater number of significant hits suggesting that the variation in the AP-MS results may be method specific. This highlights the necessity of data integration for the selection of significant hits, as comparing your own datasets to comparable publishes datasets can filter out a majority of the irreproducible noise. Damage-Net provides a simple way to conduct such a comparison as well as other analytical techniques to aid the user's projects.
To delve deeper into our database and to further assess the capabilities of Damage-Net, we integrated every Damage-Net dataset to identify a high confidence group of potential DNA repair genes which we termed DNet-genes. We found these genes were mainly associated with chromatin organisation and RNA-related processes, indicating that they may be required for chromatin remodelling in response to DNA repair, facilitating access to damage sites for repair factors. Expression of these genes was associated with increased tumour mutational burden and reduced cancer survival. These features were found to also be associated with canonical DNA repair gene expression and that in these features DNet-genes were very comparable to canonical repair genes. Given the expanding research implicating RNA processes and chromatin organisation genes in DNA repair, we hypothesise that these genes represent a large group of novel DNA repair genes that are either integral to canonical repair processes, or form distinct sub-processes that function alongside canonical repair processes. In addition, association with high mutational burden and poor cancer prognosis, indicates expression of these facilitates mutational burden via the repair of high levels of DNA damage. These mutations then confer an evolutionary advantage to the cancer cells, advancing their progression and providing the genetic heterogeneity necessary to adapt to environmental challenges, such as therapeutics, thus reducing patient survival.
Further investigation concluded that a majority of DNet-genes responsible for these changes do so primarily in adrenocortical carcinoma (ACC). We used hierarchical clustering to identify a co-expressing group of these ACC related genes and subsequently identified a network of genes that co-express with this entire group in ACC, termed ACC-Net genes. These genes all strongly co-express and contain cell-cycle and chromatin regulators as well as over 25 % being DNA repair factors. Every ACC-Net gene was associated with increased mutational burden and reduced survival in ACC and overall the expression of these genes had a greater effect on these features than canonical repair genes. Survival analysis found that overall expression of all these genes has a strong effect in ACC as high expression reduced 5-year survival from ~80 % to ~30 %, while increasing the average mutational burden by almost 2-fold and caused patients to be far more likely to have late-stage, metastatic ACC. Interestingly, multiple reports have implicated DNA repair and ACC-Net gene groups in ACC metastasis and prognosis [60][61][62][63]. Metastatic ACC was previously shown to have substantially higher mutation rates than primary ACC and a hypermutation phenotype was identified to be associate with mutations in the DNA repair genes ATM, MSH3 and MSH6 [61]. Another study also identified mismatch repair genes, particularly MSH6, to be associated with high mutation frequency in ACC [60]. In addition, a smaller investigation found alterations in DNA repair genes, including ATM and BRCA2, to be associated with increased mutations and poor prognosis [63]. All these studies highlighted DNA repair, chromatin remodelling and cell cycle genes to be important markers for ACC prognosis and key to ACC metastasis, with some also suggesting the use of DNA repair inhibitors as a novel therapeutic strategy [60,62,63]. These studies further support our identification of the ACC-Net gene network as key prognostic markers and potential therapeutic targets.
We therefore hypothesise a mechanism by which these genes are required for DNA repair, and their elevated expression leads to increased mutation frequency at repair sites, overcoming the normally toxic levels of DNA damage to prevent apoptosis, but increasing mutational load. These mutations provide a large source of genetic variation, increasing the rate of evolution of the cancer, advancing its progression, and providing a greater ability for the cancer cells to adapt to and overcome therapeutics. All this contributes to the remarkable decrease in survival observed with high expression of these genes.
ACC is a relatively rare cancer but has quite poor prognosis for reasons that are not yet fully understood. Treatment options are limited and inadequate, leading to all treatments of unresectable and metastatic ACC to be solely palliative [64]. Prognosis has also been shown to be tightly linked to tumour grade, with 5-year survival of stage I patients being ~82 % while for stage 4 patients it is as low as 6-13 % [60, [65][66][67]. Some studies have found individual genes to be significant drivers of ACC, such as β-catenin and insulin-like growth factor 2 [68,69], however these individual genes often only apply to very few cases and offer limited mechanistic understanding and therapeutic possibilities. More recently, genomic investigations identified a number of genes associated with cell-cycle regulation and DNA replication repair as ACC prognostic markers [60,70] which were the first indications of dysregulation of these processes in ACC. Our findings here are therefore a significant step forward in our understanding of ACC pathogenesis. The original identification of these genes through Damage-Net and their correlations with mutational burden links their role directly to DNA repair, giving us the foundation for the likely mechanism by which they act, and the tumour stage and survival analysis presents a remarkable clinical result of the expression of these genes. Expression of these genes is therefore not only an effective prognostic marker but could also represent a large source of potential therapeutic targets, since their inhibition would theoretically reduce the cancers ability to progress and would improve patient survival. The large network of genes identified here gives multiple avenues for targeted therapeutics increasing the chance of success and allowing for combination therapy to reduce side-effects [71][72][73]. In addition, therapeutic targeting of these genes would likely be very effective as a combination therapy with traditional genotoxic therapeutics, such as cisplatin or radiotherapy, potentially increasing their efficacy at lower doses and thus reduced side-effects.

Conclusion
Damage-Net is a powerful tool in the investigation of DNA damage that greatly enhances users' analysis of their large-scale results and also increases their ability to distinguish novel DNA repair factors from previously published datasets. With Damage-Net, we identified a network of novel DNA repair genes that are collectively responsible for a substantial decrease in adrenocortical carcinoma survival via facilitating increased mutational burden and therefore advanced cancer progression and treatment resistance. These represent a promising group of potential targets for therapeutics that could greatly enhance the efficacy of DNA damaging therapies, improving the currently dismal prognosis of ACC.

Damage-Net access and installation
Damage-Net is available to download for all operating systems at www.damage-net.co.uk under downloads. This site also has information regarding installation and use and has a page for contacting us regarding problems, suggestions and questions.
The installers and source code are also available at https://source forge.net/projects/damage-net and the source is additionally available at https://github.com/aldob/Damage-Net

Damage-Net design
Damage-Net was written using Python 3.6.5. Multiple packages were used to create the variety of features such as SQL for writing the database and tkinter for writing the graphical interface. The following is the full list of all the packages used: The scripts were compiled into an executable file using pyinstaller v3.5. The Windows installer was created using Inno Setup v6.0.3 and the Linux and MacOS installers were created using makeself v2.4.2.

Gene ontology enrichment analysis
Gene ontology enrichment was completed using PANTHER [74,75]. Term extraction was done by taking the lowest branch of each enrichment that had greater than 10 proteins identified and a p-value below 0.01. This was done to provide an informative and accurate balance between term specificity and statistical confidence.
The DNA repair gene ontology group was obtained from the MGI database [76].

Cancer mutation analysis
TCGA mutation and gene expression data were queried and downloaded using the TCGAbiolinks R package v2.10.5 [77]. The files obtained were of harmonised mutation data analysed using the MuTect2 pipeline in MAF file format. Gene expression data were downloaded in FPKM-UQ file format and combined per-patient with the mutation rates to create a database containing transcriptome wide gene expression and mutation frequencies per patient.
Tumour mutational burden was calculated as the total number of identified genomic mutations for a sample, i.e. point mutations + insertions + deletions. Mutation associations were calculated by dividing patients into two groups of high and low expression for a gene of interest, calculating the mean mutation frequency for each group and then calculating the log2 fold change of the high/low expression groups' mutational frequencies.

Cancer survival analysis
TCGA survival data and cancer stage metadata were queried and downloaded using the TCGAbiolinks R package v2.10.5 [77]. Cox proportional hazards statistical analysis was used to calculate survival coefficients and p-values via the survival R package v3.1.8. Survival hazard ratios were generated by comparing groups of patients with either high or low expression of a gene of interest. This was then log2 transformed to make the data easier to understand by creating a value where positive values denote increased survival and negative values denote decreased survival.

Network analysis
Protein interaction networks were modelled in Cytoscape v3.7.1 [78] using interaction strengths calculated using the STRING database [79]. Groups were designated based on the gene ontology biological process enrichment of the genes.

Co-expression analysis
Expression correlation was calculated between the expression of every gene being queried using the Pearson correlation model and the subsequent matrix was clustered via hierarchical clustering using the complete linkage model. The same matrix was also used for principal component analysis. The following base R functions were used for this analysis: cor.test, dist, hclust and prcomp. In addition, heatmap.2 from gplots v3.0.1.2 was used for heatmap plotting.
Strong correlation partners were extracted from transcriptome-wide correlation analysis by filtering the matrix for all genes that have a correlation value of > = 0.5 with over half of the target genes.

Cell-culture and transfection
BJ-5ta fibroblast cells were obtained from the ATCC and cultured in Dulbecco Modified Eagle ′ s Medium (DMEM, GibCo) fortified with 10 % Fetal bovine serum and 2 mM L-glutamine.

Annexin-V apoptosis assay
BJ-5ta cells were seeded on six well plates at a density of 25,000 cells per well and incubated for 24 h. Cells were then transfected and incubated for 48 h in transfection media before this was replaced with treatment media containing either 25μM etoposide (Merck, E2600000) or an equivalent amount of DMSO, or in the case of cisplatin treatment media containing 2μM cisplatin or an equivalent amount of water. Cells were then incubated for a further 24 h and then cells were harvested with trypsin along with their media and PBS wash, pelleted at 300 g for 5 min, washed once in PBS and re-pelleted. Cells were then resuspended in 500 μL Annexin V binding buffer (10 mM HEPES pH 7.4, 140 mM NaCl, 2.5 mM CaCl) supplemented with 1 u L Annexin V-FITC (ab14082) and 5 u L 7-AAD (BioLegend, 420403) and incubated for 10 min in the dark before scanning on an Attune NxT flow cytometer.