Associations between DPP9 expression, survival and gene ex- pression signature in human hepatocellular carcinoma: Com- prehensive in silico analyses

Dipeptidyl peptidase (DPP) 9, DPP8, DPP4 and fibroblast activation protein (FAP) are the four enzymatically active members of the S9b protease family. Associations of DPP9 with human liver cancer, exonic single nucleotide polymorphisms (SNPs) in DPP9 and loss of function (LoF) variants have not been explored. Human genomic databases including The Cancer Genome Atlas (TCGA) were interrogated to identify DPP9 LoF variants and associated cancers. Survival and gene signature analyses were performed on hepatocellular carcinoma (HCC) data. We found that DPP9 and DPP8 are intolerant to LoF variants. DPP9 LoF variants were most often associated with uterine carcinoma. Two DPP9 intronic SNPs that have been associated with lung fibrosis and COVID-19 were not associated with liver fibrosis or cancer. All four DPP4-like genes were overexpressed in liver tumours and their joint high expression was associated with poor survival in HCC. Increased DPP9 expression was associated with obesity in HCC patients.. High expression of genes that positively correlated with overexpression of DPP4, DPP8, and DPP9 were associated with very poor survival in HCC. Enriched pathways analysis of these positively correlated genes featured Toll-like receptor and SUMOylation pathways. This comprehensive data mining suggests that DPP9 is essential for human survival and the DPP4 protease family is important in cancer pathogenesis.


Introduction
Dipeptidyl peptidase (DPP) 9, DPP4, DPP8 and fibroblast activation protein (FAP) are the enzymatic members of the DPP4 family of serine proteases and have been implicated in cancer pathogenesis [1]. DPP9 is ubiquitously expressed in tissues [2] and has diverse roles in cell behaviours [3], immune regulation [4][5][6] and cancer [7][8][9]. In nonsmall-cell lung cancer, poor 5-year survival has been associated with greater expression The amino acid sequences of the short and long forms of DPP9 were downloaded from the UniProt Consortium (https://www.uniprot.org/uniprot/Q86TI2) on 7th April 2020. The long form of DPP9 and its amino acid numbering was used in this study.

Public databases for accessing loss of function variants
Public databases assessed in this study are listed in Table 1.

Loss of function variants
The exonic variants that are predicted to cause LoF of DPP9 were exported from gnomAD, TCGA and COSMIC. Only the LoF variants that are annotated as either stop gained or frameshift were studied. The numbering of amino acid sequence of DPP9 in GnomAD is of the short form of DPP9. Therefore, to use the DPP9 long form, 29 was added to all protein sequence numbers for the gnomAD DPP9 LoF analyses. Variants having sequence number mismatch with the long form of DPP9 were excluded. In addition, if frameshift variants exported from gnomAD and TCGA occurred where there is no corresponding structural residue, these frameshift variants were considered to be unlikely to be DPP9 exonic LoF variants and thus were excluded from this study.

Intronic variants
The two DPP9 intronic variants rs12610495 and rs2109069 have the smallest p values that have been associated with critical illness caused by COVID-19 [33]. More genetic information about these two intronic variants was exported from PhenoScanner V2 [40].

Genetic variant tolerance parameters
The observed to expected ratio (o:e) of LoF, the probability of LoF intolerance (pLI) and haploinsufficiency score were retrieved from gnomAD and DECIPHER [38]. Tolerance percentile and GDI values were retrieved from GeneCards human gene database.

Gene correlation analysis
UALCAN is a comprehensive web database for investigating complete genetic or molecular data of cancers and allows in silico validation of genes of interest [41]. It is available online at http://ualcan.path.uab.edu/index.html. Both positively and negatively correlated genes of DPPs and FAP were accessed using liver hepatocellular carcinoma (LIHC; HCC) dataset of TCGA. The expression level of each gene was displayed as Transcripts Per Million (TPM).

Survival analysis
OncoLnc database contains TCGA survival data to mRNA expression level for 8,647 patients from 21 cancer projects [42]. Survival data for 360 patients with HCC and 540 patients with UCEC were exported from OncoLnc. Normal patients' data (n=50) were exported from UALCAN [41]. HCC patient survival was also analysed based on patients' BMI (Table 2). SurvExpress is an online database for cancer gene expression data using survival analysis, which is able to access the influence of co-expression of multiple genes on survival in a specific cancer type [43]. The genes that correlate in common amongst the DPP4 gene family were imported into SurvExpress and analysed for their associations with HCC patient survival (June 2016 dataset). Patient survival time and status were exported from SurvExpress and Kaplan-Meier curves were plotted.

Reactome pathways enrichment analysis
Reactome enrichment pathways of positively or negatively correlated genes in common between DPP9 and DPP8 were analysed from STRING database. The positively correlated genes in common amongst DPP9, DPP8 and DPP4 were imported into Concen-susPathDB (CPDB) [44] and identified as gene symbols. The analysis was performed under gene set analysis and over-representation analysis.

Data and statistical analyses
Figures and Kaplan-Meier curves were generated in GraphPad Prism (version 8.4.2, San Diego, CA, USA). Data statistics were analysed in GraphPad Prism. Survival analysis was performed in Prism using Log-rank (Mantel-Cox) to assess the significance between the groups. The Mann-Whitney U test was used to compare differences between two independent groups. The Kruskal-Wallis one-way analysis of variance with Tukey's posthoc test was used to compare differences amongst multiple groups. Significance was assigned to p-values *<0.05, **<0.01, ****<0.0001.

DPP9 loss of function (LoF) variants and disease associations
Large-scale human genetic databases provide useful tools to study protein LoF variants and their associations with diseases. DPP9 LoF has not been explored previously. Therefore, we assessed data from several human genomic databases to identify DPP9 LoF and potentially associated diseases. Gaining a premature termination codon (Ter), or stop codon, anywhere in a gene of the DPP4 gene family is expected to ablate enzyme activity [45]. We have shown previously that mature termination mutants of DPP8 are inactive, even if lacking only the C-terminal alpha-helix [45]. Similarly, any frameshift variant is almost certain to lack activity due to nonsense sequence causing a Ter. Therefore, the exonic gene variants considered to be LoF are very likely true LoF.
In gnomAD, there were 18 exonic variants of DPP9 that were tagged as LoF. However, examining the sequence of long-form DPP9 in the UniProt Consortium database revealed that some variants were mismatched with UniProt and thus were excluded. Only exonic LoF variants that were annotated as either stop-gained or frameshift variants were included in this study. We identified three frameshift variants and seven stop gained variants that were exonic in gnomAD (Table 3). There was no record of associated diseases for each mutation variant. In TCGA, we identified eight DPP9 LoF variants (Table 4), and none of these variants were the same as found in gnomAD. Four DPP9 variants were designated as stop gained, with three of them associated with primary uterine corpus endometrial carcinoma (UCEC). Another four DPP9 variants were designated as frameshift mutation, with half of them found in patients diagnosed with stomach adenocarcinoma ( Table 4). None of the DPP9 variants in TCGA were found to be associated with HCC. There were many more DPP8 than DPP9 LoF gene variants in TCGA, found in diverse types of cancer, mostly in UCEC and one in HCC (Supplementary Table 1). In a separate database, COSMIC, eleven DPP9 variants had been tagged as nonsense mutation and were substitution mutations resulting in Ter codons (Table 5). These variants were associated with different types of carcinoma, most often lung carcinoma. All variants that were likely to cause LoF for DPP8 or DPP9 have a Ter. All exonic DPP9 variants found in each of the three databases were mapped to the DPP9 protein structure (Supplementary Figure  1). Stop gained 1.2E-5 1 Gln461 and Arg111: no corresponding residue in the primary structure of DPP9. Table 4. DPP9 LoF variants in TCGA. Simple somatic mutation (SSM) affected frequency was calculated as the number of cases affected by a specific mutation in a TCGA disease project divided by the number of cases tested for SSM in that disease project in TCGA. *Termination codon (Ter).  There were 22 DPP8 nonsense mutations in COSMIC, most often in colon adenocarcinoma and one in HCC (Supplementary Table 2). DPP9 variants Glutamic acid (Glu) 154 to Ter and Arginine (Arg) 303 to Ter were both reported in both TCGA and COSMIC, but with different nucleotide substitutions. Another anomaly was that in COSMIC, both endometrioid carcinoma and colon adenocarcinoma were associated with the Arg303* variant, but in TCGA only endometrioid carcinoma was recorded.

Intronic SNPs in DPP9
Intronic SNPs rs12610495 and rs2109069 in DPP9 are genome-wide significant loci for severe COVID-19, idiopathic pulmonary fibrosis and idiopathic interstitial pneumonia [29,30,32,33]. These SNPs are located 5' to the open reading frame, at nucleotides 4717672 and 4719443 of chromosome 19. They are important DPP9 SNPs because they are relatively common, with allelic frequencies of 0.706 to 0.828 for rs12610495 and 0.186 to 0.321 for rs2109069 (Supplementary Table 3). These SNPs have been associated with lower DPP9 expression levels in human lung [33]. We found that these SNPs are not associated with liver fibrosis or cancer in the PhenoScanner database.

Genetic variation tolerance in DPP9 and the DPP4 gene family
Some databases incorporate scoring systems that determine tolerance towards genetic variations. We analysed genetic variation tolerance in the DPP4 gene family using five scores: observed to expected ratio (o:e), probability of LoF intolerance (pLI), haploinsufficiency score, Tolerance Percentile, and gene damage index (GDI) scores. When o:e is approximately 0.1, the gene is haploinsufficient and heterozygous LoF is very probably not tolerated; when o:e is between 0.1 and 0.5, the heterozygous LoF may be tolerated; when o:e is close to 1, LoF is tolerated [46]. DPP8 and DPP9 had low o:e, little more than 0.1, DPP4 had a moderate o:e less than 0.5 and FAP has a high o:e ( Table 6). The probability of LoF intolerance (pLI) is the probability of a given gene being haploinsufficient; genes with pLI ≥ 0.9 are extremely intolerant to LoF variation and genes with pLI ≤ 0.1 are LoF tolerant [46]. The pLI scores for DPP9, DPP8, DPP4 and FAP (Table  6) suggested that DPP8 and DPP9 LoF variants are poorly tolerated. Haploinsufficient genes only contain a single copy of a diploid genome that changes the organism phenotype from normal to a disease state, and many of the LoF associated haploinsufficient genes contribute to metabolic disorders and tumorigenesis [47]. The haploinsufficiency score was greater for DPP9 than for DPP8, DPP4 or FAP, and, like the o:e and pLI, indicated that DPP9 is likely to exhibit haploinsufficiency whereas DPP4 and FAP are unlikely to exhibit haploinsufficiency. The gene damage index (GDI) scores for these four enzymes were low ( Table 6), suggesting that LoF variants are likely to be disease-causing [48]. Genes in the 25th Tolerance Percentile and below are considered relatively intolerant to variation. Unlike the other haploinsufficiency indications, GDI and Tolerance Percentile did not discriminate amongst the four genes (Table 6). Table 6. Genetic variation tolerance in the DPP4 gene family. The o:e ratio, pLI and haploinsufficiency score were retrieved from gnomAD and DECIPHER. Tolerance percentile and GDI values were retrieved from the GeneCards human gene database. o:e, the observed to expected ratio. pLI, the probability of LoF intolerance. GDI, gene damage index score.

Association of DPP9 expression with patient survival in HCC and UCEC
Previous studies reported associations of DPP9 expression with patient survival in non-small-cell lung cancer [7], colorectal cancer [8] and oral squamous cell carcinoma [9]. Many DPP8 and DPP9 variants were identified in patients with UCEC and HCC in our study (Table 4 and Supplementary Tables 1-2). Therefore, we analysed whether DPPs expression is associated with patient survival in HCC and UCEC. Clinical parameters and mRNA expression of DPPs in patients with HCC and UCEC were exported from TCGA. Patient demographics are presented in Supplementary Table 4. Kaplan-Meier analysis for DPP9, DPP8, DPP4 and FAP mRNA suggested that expression of none of these four genes was associated with HCC overall survival (OS) ( Figure 1A-D). In contrast, patients with low DPP9 mRNA expression had significantly shorter survival time for UCEC ( Figure 1E). Less than 15% of UCEC patients with low DPP9 expression had survived for 10 years after diagnosis. Unlike DPP9, DPP8, DPP4 and FAP expression did not associate with OS in UCEC ( Figure 1F-H). This data indicates that DPP9 might have an important role in UCEC.

Association of DPP9 expression with clinical parameters and survival outcome in HCC
We next explored closely the association of DPP9 with clinical parameters relevant to HCC, including gender, age, BMI and cancer stages. In HCC patients, all four DPP genes were significantly upregulated in tumours compared to normal tissue (Figure 2A-D). Significantly more expression of DPP9 was detected in female than male ( Figure 2E). DPP9 expression did not differ significantly amongst six age groups ( Figure 2F) and different cancer stages ( Figure 2G). Obesity is a well-known risk factor for HCC [16], so we examined DPP9 expression with HCC patient survival based on body mass index (BMI; BMI categories in Table 2). We stratified HCC patients based on BMI and found that DPP9 mRNA was significantly upregulated in the obese and extreme obese patients compared to patients with normal body weight ( Figure 3A). Survival analysis suggested that DPP9 was not significantly associated with patient survival when patients are overweight ( Figure 3B), or obese/extreme obese ( Figure 3C), however there was trend to poor prognosis in the overweight group.

Association of expression of DPPs with poor survival of patients with HCC
To further investigate DPP4 gene family expression in liver cancer prognosis, we stratified DPP9, DPP8 and DPP4 based on median expression in HCC. High co-expression of DPP9, DPP8 and DPP4 was not associated with OS ( Figure 4A). However, high coexpression of DPP9, DPP8, DPP4 and FAP in the liver tumour was significantly associated with poor survival in HCC patients ( Figure 4B). This data strengthens evidence that FAP might contribute to poor outcomes, possibly due to its role in liver cirrhosis and steatosis, which is strongly associated with poor liver function and liver cancer [49,50].

Genes correlated with DPPs significantly associated with poor prognosis in HCC and multiple oncogenic and epigenetic pathways
Having shown that DPP9, DPP8, DPP4 and FAP were overexpressed in liver tumours (Figure 2A-D), and high co-expression of the DPP4 gene family members was significantly associated with shorter survival time in HCC ( Figure 4B), we next assessed the gene expression signature correlated with the DPP4 gene family in HCC patients to gain mechanistic insights. Firstly, we analysed the two homologs DPP8 and DPP9 and found that 661 genes were positively correlated in-common between DPP8 and DPP9 ( Figure 5A). Reactome pathway enrichment analysis revealed that these genes were associated with tumour suppressor TP53 regulation and with chromatin remodeling pathways in HCC ( Figure 5B). In contrast, only 10 negatively correlated genes in-common between DPP8 and DPP9 were identified and they were associated with metabolism related pathways (Supplementary figure 2B). A similar approach was undertaken to study the gene expression signature correlated with all three genes, DPP9, DPP8 and DPP4. We found that 268 genes were positively correlated in-common amongst DPP9, DPP8 and DPP4 in HCC patients ( Figure  6A). Most importantly, high expression of these positively correlated gene signatures was significantly associated with poor prognosis in HCC patients ( Figure 6B, Logrank ****p<0.0001). To gain a better understanding of how DPPs may influence survival, we analysed the biological pathways that were enriched in genes that were positively correlated with DPPs. There were 46 significantly enriched pathways according to Reactome annotations for those 268 genes (Supplementary Table 5). Amongst the 15 top-ranked pathways ( Figure 6C), Toll-like receptors (TLRs) cascade, especially TLR9 and TLR7/8 cascade, MyD88 dependent cascade initiated on endosome, N-glycosylation and SUMOylation pathways are dominant. When FAP was added to this analysis, only 7 in-common genes emerged, most of which are the zinc finger gene regulators LIMS1, PHF20L1, ZEB1 and ZNFX1 (Supplementary Figure 3).

Discussion
This is the first analysis of DPP9 exonic variants and LoF variants in human genomic databases and their associations with survival and cancers. There were 27 unique DPP9 LoF variants found from our analyses of gnomAD, TCGA and COSMIC. Notably, patients with DPP9 LoF variants were most commonly diagnosed with cancers. This differs from the intronic SNPs of DPP9, which lower DPP9 expression [33], but have been associated with pulmonary fibrosis and severe COVID [29,30,32,33]. Uterine corpus endometrial carcinoma (UCEC) and lung carcinoma were the most common diagnosed cancers in patients with DPP9 LoF variants and low DPP9 expression was associated with poor survival in UCEC. Therefore, DPP9 might be protective in UCEC. Our datamining evidence that DPP9 LoF is associated with lung carcinoma does not support a previous study that linked high DPP9 expression with poor survival in human lung cancer [7]. Perhaps the dominant role of DPP9 differs in different cancer types and differs depending upon whether high or low DPP9 expression is a recent response or is life-long.
We observed that intrahepatic DPP9 expression is greater in females and that DPP8 and DPP9 LoF variants are more commonly diagnosed with a gynaecological cancer, such as UCEC, cervical cancer or ovarian cancer. These data might suggest that DPP9 is a suppressor of gynaecological cancers, which would align with the recently discovered role of DPP9 in BRCA2 homeostasis in DNA repair [51]. However, in both TCGA and COSMIC, there were limited clinical data concerning patient outcomes, especially in COSMIC where many cases were filed as unknown gender, age and diseases. This is a common limitation of in-silico investigations.
Our investigation found that DPP9 and DPP8 are very likely essential for life as they are LoF intolerant in human. Genetic variation tolerance parameters universally suggested that DPP9 was LoF intolerant (Table 7). LoF intolerance has been associated with successful targets of medicinal therapies for cancer [26]. The necessity of DPP9 is conserved in mice, where constitutive homozygous ablation of DPP9 enzyme activity is neonate lethal [25,[52][53][54]. People who are homozygous for a DPP9 LoF variant would not survive to be included in the searched databases and any that do survive possibly possess a defect in a currently unknown [25] molecular pathway that causes death in DPP9 deficiency. Another implication of this information is that the LoF DPP9 variants in genomic databases are expected to be heterozygous and would therefore have lifelong low DPP9 expression throughout the body.
The databases contained some data mismatches about DPP9 sequences, presumably due to automation of data importation. There were more LoF variants in gnomAD than in COSMIC, but some of the listed LoF locations were not found in either long form or short form DPP9. Moreover, in both TCGA and COSMIC, variants of Arg303* and Glu154 * were reported, but the mutations of nucleotides differed between the two databases. Also, in patients with the Arg303* variant, COSMIC reported both endometrial carcinoma and large intestinal adenocarcinoma, while TCGA only reported endometrial carcinoma. Hence, using a single database can be misleading and unmatched information is inevitable when using multiple databases. Therefore, database derived information needs to be carefully and critically assessed.
Our TCGA datamining on the DPP4 gene family showed that in HCC patients, DPP9, DPP8, DPP4 and FAP were all overexpressed in liver tumours. Most notably, high expression of all four genes showed poorer survival in HCC patients ( Figure 4B). Therefore, DPP targeted therapy might need to target all four enzymes together. Supporting this concept, VbP (Talabostat; PT-100; BXCL-701), which exerts potent inhibition of DPP9 and DPP8 and moderate inhibition of DPP4 and FAP, has been shown to lessen tumour burden in lung cancer and sarcoma models [55][56][57].
Enriched biochemical pathways analysis provided insights into differentially expressed genes correlated with DPPs in human HCC. Positively correlated genes in-common between DPP8 and DPP9 were associated with oncogenic and chromatin remodeling pathways in HCC, whilst 10 negatively correlated genes were associated with metabolism related pathways (Supplementary figure 2A). Mutation of the tumour suppressor TP53 is associated with poor outcome for HCC patients [58]. The top ranking TP53 regulation pathway in our analyses indicates that DPP8 and DPP9 overexpression could be oncogenic in human HCC. The downregulation of metabolism genes with DPP8 and DPP9 upregulation concords with previous data showing that DPP9 deficiency dysregulates metabolic pathways in neonatal liver and gut [53], and that DPP9 hydrolyses metabolism related substrates including adenylate kinase 2 [59,60] and nucleobindin1 [61] and regulates preadipocyte differentiation [62]. Moreover, people with obesity are more likely to develop HCC [63,64]. DPP9 expression was significantly upregulated in HCC patients who were obese and extreme obese compared to those with normal weight. Taken together, this information suggests that DPP9 has an important role in liver manifestations of the metabolic syndrome, which drives progression to NASH, fibrosis and HCC.
The high expression of positively correlated gene signatures in-common amongst DPPs were associated with considerably shorter survival time in HCC patients. Amongst the identified Reactome pathways, TLR cascade, glycosylation and SUMOylation were the most frequent pathways enriched with increased DPP9 mRNA expression. No association between DPP9 and glycosylation is known. SUMOylation is mostly associated with SUMOylation of chromatin organization proteins, and SUMOylation of DNA damage response and repair proteins, which could have roles in cancer development. SUMO1 is a DPP9 interacting protein, acting as an allosteric regulator of DPP9 and thereby influencing its enzymatic activity [23]. TLR7/8 and TLR9 signalling pathways have not been linked to DPP9 previously. The expression of TLRs is low in healthy liver [65]. TLRs are upregulated in HCC tissues [66], and inhibition of TLR7 and TLR9 eliminates the proliferation of liver cancer cells [66,67]. Therefore, overexpression of DPPs might activate these TLR cascades, thereby inducing overactivation of the innate immune system and pathological changes.

Conclusions
This study reports the first comprehensive data mining of associations between DPP9 expression, survival and gene expression signature in human HCC. DPP9 was found to be very LoF intolerant and its LoF exonic variants were mostly associated with cancers. Two intronic variants that lower DPP9 expression and have been linked to lung fibrosis and COVID-19 were not associated with HCC or liver fibrosis. In HCC patients, DPP9 and many closely related genes were significantly upregulated in liver tumours. High co-expression of genes that were positively correlated in-common amongst DPP4, DPP8 and DPP9 were associated with poor survival of HCC patients, and included Tolllike receptors (TLRs), glycosylation and SUMOylation. These findings strongly implicate the DPP4 gene family, especially DPP9, in the pathogenesis of human HCC and therefore requires future functional studies.