Exomes of Ductal Luminal Breast Cancer Patients from Southwest Colombia: Gene Mutational Profile and Related Expression Alterations

Cancer is one of the leading causes of mortality worldwide. Breast cancer is the most frequent cancer in women, and in recent years it has become a serious public health problem in Colombia. The development of large-scale omic techniques allows simultaneous analysis of all active genes in tumor cells versus normal cells, providing new ways to discover the drivers of malignant transformations. Whole exome sequencing (WES) was obtained to provide a deep view of the mutational genomic profile in a set of cancer samples from Southwest Colombian women. WES was performed on 52 tumor samples from patients diagnosed with invasive breast cancer, which in most cases (33/52) were ductal luminal breast carcinomas (IDC-LM-BRCA). Global variant call was calculated, and six different algorithms were applied to filter out false positives and identify pathogenic variants. To compare and expand the somatic tumor variants found in the Colombian cohort, exome mutations and genome-wide expression alterations were detected in a larger set of tumor samples of the same breast cancer subtype from TCGA (that included DNA-seq and RNA-seq data). Genes with significant changes in both the mutational and expression profiles were identified, providing a set of genes and mutations associated with the etiology of ductal luminal breast cancer. This set included 19 single mutations identified as tumor driver mutations in 17 genes. Some of the genes (ATM, ERBB3, ESR1, TP53) are well-known cancer genes, while others (CBLB, PRPF8) presented driver mutations that had not been reported before. In the case of the CBLB gene, several mutations were identified in TCGA IDC-LM-BRCA samples associated with overexpression of this gene and repression of tumor suppressive activity of TGF-β pathway.


Introduction
Among females, breast cancer is the most commonly diagnosed cancer and the leading cause of cancer death both in developed and developing regions [1]. Worldwide, 2.1 million female breast cancer cases were diagnosed in 2018, accounting for almost one in four cancer cases among women. The disease is the most frequently diagnosed cancer in the vast majority of the countries (154 of 185) and is also the leading cause of cancer death in over 100 countries; the main exceptions are Australia/New Zealand, Northern Europe, Northern America (where it is preceded by lung cancer), and many countries in sub-Saharan Africa (because of elevated cervical cancer rates) [2]. In Colombia, this disease is the second most frequently diagnosed malignancy, representing the leading cause of death in women according to statistics from the National Cancer Institute (INC) from Colombia [3] that estimated around 7600 new breast cancer cases diagnosed annually in the period 2007-2011, with 2226 annual breast cancer deaths [4].
Breast cancer is a heterogeneous complex of pathology that includes multiple tumor subtypes with distinct biological features that lead to differences in response to treatment and in clinical outcome [5]. According to the cellular classification, the invasive ductal carcinoma (IDC) is the most common subtype of breast cancer, accounting for about 80% of breast cancer diagnoses [6]. Moreover, considering the molecular classification, the luminal-like tumors (LM) are the most common subtypes among breast cancer [7]. Since cancer is a disease of complex genetic origin, it cannot be characterized from the study of a single gene or gene product. The genetic complexity inherent to cancer is primarily attributable to variation across patients, who suffer different somatically acquired alterations in different genes and present different rates of accumulation of such alterations [8]. In this scenario, the development of large-scale omic techniques, allowing the simultaneous analysis of all active genes in tumor cells versus normal cells, provides a new comprehensive way to discover the genetic alterations that can drive the expression and regulatory changes in the complexity of malignant transformations [9]. Currently, in large genomic studies such as The Cancer Genome Atlas (TCGA) project, DNA sequencing (DNA-seq) is the main technique utilized for mutation detection, either using a gene panel sequencing approach or whole exome sequencing approach, while RNA sequencing (RNA-seq) is performed to measure gene expression (looking for coding or noncoding genes) and transcript use (sometimes including splicing analyses to detect isoforms) [10].
In this work, we applied some of the omic technologies to the study of breast neoplasms. In particular, we used complete exome sequencing (whole exome sequencing (WES)) to provide a deep view of the mutational genomic profile in a set of cancer samples from Southwest Colombian women. Furthermore, we combined this information with the analysis of WES (DNA-seq) and genome-wide expression (RNA-seq) data from a subset of samples from the TCGA project with the same subtype of breast tumors as the Colombian cohort to infer the activation or alteration profile of genes and identify common pathogenic mutations. The integration of DNA-seq and RNA-seq data from the TCGA samples was also used to find expression quantitative trait loci (eQTLs), which allowed the identification of certain genomic loci that explained variation in expression levels of mRNAs based on allele modification. As a whole, the objective of this study was to find a set of gene-centered alterations identified as pathogenic somatic mutations in exomes from invasive breast carcinomas of ductal luminal subtype, in a cohort of patients of Southwest Colombia, and compare that set with a similar, but larger, cohort of TCGA patients. Pathogenic mutations were detected as somatic tumor variants corresponding to nonsynonymous single nucleotide polymorphisms (nsSNPs). The results provide valuable information on the characteristics of this type of breast cancer and allow us to identify novel associations between genetic mutations and related genes in ductal luminal breast carcinomas. The work revealed well-known cancer driver genes, along with some new cancer driver mutations associated with the patient subpopulation we studied.

Ethical Approval
Tumor samples were collected from the volunteer participants with informed consent, following ethical guidelines approved by the University of Valle, the University of Cauca and the Imbanaco Medical Center, all based in Cali (Colombia).

Samples Collection and DNA Sequencing
A total of 52 breast cancer (BRCA) patients and seven controls from Southwest Colombia were considered for this study. Samples were taken from breast tumor tissue in stages I to IV. No chemotherapy or radiotherapy had been applied to the patients before the collection of the tumor biopsies. The anatomopathological diagnosis of the breast cancer samples indicated that they were invasive ductal carcinomas (IDC) (42/52 samples) and invasive lobular carcinomas (ILC) (10/52 samples). DNA was extracted from the samples with Invitrogen PureLink Genomic DNA Mini Kit and sequenced by Macrogen Inc., using an Illumina HiSeq 4000 System at a 100× depth. Seven other breast samples were collected from healthy tissue, to be used as controls in the study. The exomes of these control samples provided a large set of germline variants that were used to filter and clean out the exomes from the tumor samples to enrich them for somatic mutations.

Exome Mapping and Genetic Variant Calling
Sequencing datasets were mapped to the reference human genome (hg19/NCBI GRCh37) with BWA-MEM 0.7.8-r455, and Picard 1.115 was used to remove duplicates. Sequences were mapped using Seqmule 1.2.6 (locally adapted to run with the Slurm scheduler), and then a consensus of variants was obtained running HaplotypeCaller from GATK-lite 2.3.9, SAMtools 0.1.19 and FreeBayes 0.9.14 with default parameters.
In the analysis of the raw exome sequences, a coverage threshold of 100× was applied in order to have a clear mapping of each genomic locus and to identify well all the variants found. Statistical analysis and quality scores were computed for each single nucleotide variant, and variants with low scores were removed. The consensus of the three tools indicated above (GATK, SAMtools and FreeBayes) was taken into account for each variant, and if there was not complete agreement between them, after manual inspection, the variant was not considered. Variant annotation (looking for nonsynonymous single nucleotide polymorphisms (nsSNPs)) was performed with ANNOVAR. Data from 1000 Genomes, dbSNP, ExAC (Exome Aggregation Consortium) and specially data from COSMIC (Catalogue of Somatic Mutations in Cancer) were used in the annotation of variants. In this way, after all these steps, we found a first raw set of 60,026 variants (SNPs) in 14,634 genes. Variants present in any of the seven exomes of the control breast tissue samples were discarded, producing a second set of 41,404 variants (see steps 1 to 4, in Figure 1, which presents the workflow that we used to select variants). In a third filter step, each of the 41,404 variants were evaluated for the damaging/pathogenic effect on the corresponding gene using six different tools: SIFT [11], PolyPhen-2 [12], MutationTaster 2 [13], FATHMM [14], CADD [15] and GERP++ [16]. Each of them was used with the following pathogenicity thresholds: SIFT <0.05; PolyPhen-2 ≥ 0.98; MutationTaster A or D; FATHMM D; CADD ≥ 20; and GERP++ ≥ 2. This provided a strict filtering that resulted in the identification of 1079 pathogenic variants present in 845 genes.

Variant Prioritization Based on Greater Harmful Impact on Protein
In order to focus on the best predictions about the functional effects of the nsSNPs in the corresponding gene-products and their identification as somatic mutations, we applied some more stringent filters. Once the 1079 variants were identified, the least-pathogenic 10% of the variants for each of four quantitative tools (SIFT, PolyPhen-2, CADD and GERP++) were discarded (step 4 in Figure 1). In an alternative fourth step (step 4' in Figure 1), we removed the least-pathogenic 20% of the variants for each of the four quantitative pathogenicity predictor methods (also starting from the 1079 pathogenic variants). This last step provides only the consensus variants that were above the threshold for all four methods. This led to the identification of 508 highly pathogenic mutations finally selected, identified in the 52 exomes analyzed and present in 432 genes. For each variant, we had the specific pathogenicity values provided by the tools; therefore, these 508 variants were ranked from more to less confident. This final set represents the most significant signal obtained.

Selection of Samples from TCGA for Comparative Analysis with Colombian Samples
We assessed the clinical characteristics of our cohort of Colombian patients to select a similar cohort of patients from TCGA and study them together. In this selection, we pursued a series of clinical and phenotypic similarities between these two cohorts of patients to allow the in-house Colombian WES data to be compared with WES and RNA-seq data from TCGA. As indicated above, in order to do this comparative analysis of genetic mutations (somatic variants) and expression data, it was very important to previously select specific cancer subtypes. Since most of the Colombian patients were ductal and luminal (33/52, 63.5%), our study focused on the analysis of this specific cancer subtype: invasive ductal luminal breast carcinoma (IDC-LM-BRCA). Therefore, considering the characteristics of the Colombian samples, we selected a similar set of samples from TCGA. These similarities were the following: (i) All patients from Colombia and from TCGA selection were women of similar age, presenting an average of 61.6 years old at diagnosis (standard deviation ± 12.6) for the Colombian cohort and an average of 57.3 years (SD ± 13.2) for the TCGA patients. (ii) Both cohorts of patients were mostly White. A recent genetic study by Norris et al. (2017) [17] stated that the population from Antioquia, a close Colombian state culturally very similar to the patient's region (Valle del Cauca), shows averages of 64% European ancestry, 29% Native American ancestry and 7% African ancestry. The majority of the selected TCGA patients were

Selection of Samples from TCGA for Expression Calculation
We were able to obtain genome-wide expression data from TCGA for 859 breast samples that were downloaded from the GDC DataPortal (https://portal.gdc.cancer.gov): 770 breast tumor samples and 89 controls. The data corresponded to the expression signal measured by RNA-seq of 60,423 genes per sample, provided as normalized counts. The TCGA biolinks tool [18,19] was used to retrieve the expression data and information about the samples. Before making a more specific selection within this cohort of breast samples, we performed a global analysis to obtain a comparative overview of the transcriptomic signal of the 859 samples (770 tumors + 89 controls). To do this, we applied a multidimensional scaling analysis (MDS) to the global gene expression data. MDS is a nonlinear dimensionality reduction method that allows visualization of the level of similarity between samples in a dataset. The results of this analysis, presented in Figure S1, show the distribution of the samples within the two main dimensions, indicating a clear separation between the healthy breast samples (green dots in Figure S1) and the breast tumor samples. The analysis also reveals a clear separation between the Luminal and Basal breast cancer subtypes. Furthermore, in Figure S1, we appreciate a clear separation of Luminal from healthy Controls and a fair separation of Luminal from Others (i.e., all other subtypes of breast tumors). This analysis supported the separation and specific selection of luminal breast tumors for the main study in this work.
The RNA-seq expression counts from the 859 samples from TCGA were processed with the expression filter defined by Chen et al. (2016) [20], using the filterByExpr function of the algorithm edgeR (used as an R package). Then, a procedure was applied to recover 780 genes that were filtered out by filterByExpr but that have significant expression in some specific subtype (as explained in the next section). In this way, the data from 27,603 genes (26,823 genes that passed filterByExpr + 780 recovered genes) were normalized with the calcNormFactors function from edgeR. This function uses the weighted trimmed mean of M-values method proposed by Robinson and Oshlack (2010) [21] to normalize the expression data and calculate counts per million (CPM).

Recovery of Some Genes Expressed Only in Some Groups
The filterByExpr function is applied to filter all genes or genetic entities that have very low expression levels in most samples of the different groups or subtypes compared. We considered that some of these genes may be relevant just for some groups, and therefore a protocol was developed to recover the fraction of genes with significant expression only in one or two of the groups considered (Luminal, Others and Control). As a recovery threshold, first we calculated for each gene (in 60,423 genes) the mean expression (in counts) across the 859 samples. Second, we calculated the median of the mean expression distribution, also in raw counts. This median was 2.256 counts. Finally, we chose 3 times this median as the selected threshold for recovery. Accordingly, genes filtered by filterByExpr that had average expression counts > 6.77 in one or two of the groups were recovered. So, we found that 159 genes had average expression counts > 6.77 for the Luminal group and average counts < 6.77 in the other two groups (Others and Control); 224 genes had average counts > 6.77 only for the Control group; and 285 genes had average counts > 6.77 only for the Others group. Finally, we also found 22, 79 and 11 genes that had average counts < 6.77 only in the Luminal, Control or Others groups, respectively. These genes were also recovered since they had average expression counts > 6.77 in the other two groups. In total, a set of 780 genes (159 + 224 + 285 + 22 + 79 + 11) were recovered to be included in the differential expression analysis.

Differential Expression Analysis of Ductal Luminal Breast Cancer (Idc-Lm-Brca) Subtype
Differential expression analysis was carried on the normalized data with two independent methods, namely Limma-Voom [22] and DESeq2 [23], selecting the subtype of samples that was the objective of this work: ductal luminal. These samples (i.e., 510 from TCGA, 339 luminal A plus 171 luminal B), were compared with healthy controls (89 samples from TCGA) and also with all the other subtypes of breast cancer (260 Others). Only the results of ductal luminal samples versus controls were considered for the comparison with the variants derived from the exomes. The differential expression thresholds applied to select the most significant genes were adjusted p-value < 0.001 and |log2FC| > 2.5 (i.e., absolute value of the log2 of the fold change). The p-values were adjusted by the Benjamini and Hochberg [24] procedure. A less stringent threshold (adjusted p-value < 0.05) was applied in order to find and annotate all the significant expression changes of the genes detected in the analysis of the exomes. In this way, we could combine expression data with genetic variants.

Functional Analysis and Annotation of the Variants
In order to record the clinical and biological relevance of the genes found and their genetic alterations, obtained after prioritization, these alterations were analyzed with the Cancer Genome Interpreter (GCI) platform [25]. This systematizes the interpretation of cancer genomes, since it normalizes and automatizes the whole process. CGI identifies all the genomic alterations known as tumorigenic through the tool OncodriveMUT [26,27], which includes the analysis of alterations of unknown clinical significance, and it uses all the available clinical evidence for the annotation of tumor variants that could act as biomarkers. CGI contains information of 5314 validated mutations and 1624 genomic biomarkers of response (sensitivity, resistance or toxicity) to 310 drugs in 130 cancer types. The selected genes were also mapped to eight databases that include annotations for cancer genes and variants: CancerMine [28], UniProt (Universal Protein Repository, UniProt Consortium 2019) [29], COSMIC (Catalogue of Somatic Mutations in Cancer) [30], CIVic (Clinical Interpretation of Variants in Cancer) [31], DoCM (Database of Cured Mutations in Cancer) [32], ClinVar (Clinically Relevant Variants) [33], OncoKB (Precision Knowledge Base in Oncology) [34] and NCG6.0 (Cancer Gene Network) [35]. A protein-protein interaction analysis was performed using STRING [36] and APID [37] for each of the prioritized candidates predicted as driver mutations, in order to verify the interactions or associations between the selected genes. Finally, a functional enrichment analysis was performed using GeneTerm Linker [38] for the most significant differentially expressed genes.

Combined Analysis of Wes Data From the Colombian and Tcga Cohorts
To complement the approach illustrated in Figure 1, we used the WES data available for breast cancer patients from TCGA (https://www.cancer.gov/tcga) that were included in the expression analysis. Using this WES data from TCGA, we searched and identified mutations that were present also in the Colombian populations. The WES data (hg38/NCBI GRCh38) including somatic mutations of the primary tumors of 713 patients with invasive ductal carcinoma from TCGA project were downloaded from the GDC data portal. The data contained 79,508 different mutation sites. As indicated above, the dataset was filtered to include only the 476 samples from the patients of ductal luminal breast cancer subtype (IDC-LM-BRCA) (476/713, 67%). Within these samples, the number of mutations sites found was 43,213. Moreover, all these 476 patients were within the set of 510 ductal luminal breast cancer patients used in the differential expression analysis (Figure 1). After defining the specific tumor subtype studied, we combined the Colombian and the TCGA datasets (i.e., 33 and 476 WES samples, respectively) to search for their common mutation sites. The WES data from Colombia were prepared as explained in Sections 2.2 and 2.3. However, to achieve a better comparison of both WES data, the pathogenicity filters were not taken into account, since such pathogenic information was not available in the same way as for the samples from TCGA. Therefore, we took the Colombian WES data prior to the application of the filters (which contained 45,454 mutation sites) and combined them with the 43,213 mutation sites found in the TCGA WES data in order to find the intersection of both sets. The results of the differential expression obtained in the comparison of 510 ductal luminal samples versus 89 controls were checked to determine which of the genes with variants may suffer a concomitant expression alteration. Despite the fact that the experiments were done with different samples, we found that 81 of the 304 genes (26.44%) showed differential expression for Luminal vs. Control (adjusted p-value < 0.05 for Limma-Voom and DESeq2). In addition, 17 of these genes were upregulated and 64 were downregulated, indicating enrichment in the suppression signal. Some relevant genes that included variants and were found in the differential expression analysis were: ESR1 and ERBB3 (overexpressed); NOTCH4 and CD36 (repressed). The identification of the estrogen receptor (ESR1) as a genetically activated and mutated gene in patients with ductal luminal breast cancer is very consistent with the fact that ESR1 is a well-known marker and driver of luminal breast cancer.

Analysis of the Whole Exome Sequencing Data to Identify Relevant Genetic Variants
Of the highly pathogenic SNPs found in genes that showed differential expression (i.e., 18 SNPs present in upregulated genes and 72 SNPs present in downregulated genes (Figure 1)); 19 SNP variants were identified as driver mutations by the Cancer Genome Interpreter (4 known, 13 reported and 2 new). Therefore, two SNP variants considered as somatic mutations in the tumors had never been reported. These 19 driver mutations are listed in Table 1.
The differential frequency distribution of genetic variants detected in our study shared with other populations of the world showed different overlaps: 26.7% with European (non-Finnish) population, 20% with Latino population and 13% with African population. This reflects a high level of ethnic background miscegenation of the Colombian population. In fact, in the whole country, around 20% of Colombians can be identified with African ancestry, showing the second largest population of Afro-descendants in continental Latin America. However, these proportions change quite a lot in different regions. For example, the Chocó region shows mostly African ancestry (76%) with an almost uniform division among European fractions (13%) and Native Americans (11%). By contrast, the Medellin region has mainly European ancestry (75%), followed by Native Americans (18%) and Africans (7%) [39]. The Colombian population of Valle del Cauca, which is the region of the patients in this study, shows genetic characteristics very similar to the population of Antioquia in Colombia, which in a recent genetic study showed around 65% European ancestry, around 30% Native American ancestry and 5-9% African ancestry [17]. The majority of the selected patients of the TCGA cohort used in our study were also White and of European ancestry (64%). The exomes studied here initially provided a large set of 60,026 SNP variants. As indicated in Figure 1 and described in Materials and Methods, several consecutive steps were applied to identify and select pathogenic mutations that reach a set of 508 alterations, verified using four different quantitative methods. These 508 alterations were furtherly filtered to include only the ones present in the ductal luminal patients (IDC-LM-BRCA subtype), reaching the number of 339 SNP alterations included in 304 protein-coding genes (pcg) (Figure 1). The complete list of these 339 variants, including complete details on the location of the mutation (aa position in the protein), as well as the patients who have each alteration, is provided as Table S2. The variants were classified, using the Cancer Genome Interpreter, into population polymorphisms (that are population variants not currently related to cancer) and two types of somatic mutations due to cancer: transient mutations predicted as passenger and driver mutations identified as causal in cancer (known, reported or new). In this analytical separation of mutations, priority was given to sequence alterations whose prediction corresponded to conductive cancer mutations (i.e., to driver mutations). As indicated above, Table 1 includes the SNP variants that were identified as driver mutations: 17 known or already reported, plus 2 newly reported in this work.

Genes Including Genetic Variants Considered Driver Mutations
Among the 17 selected genes with at least one driver mutation, some canonical cancer genes were found, such as: Additionally, some genes were found that had rarely been reported as altered in cancer studies, like UPF3B (regulator of nonsense mediated mRNA decay), which encodes a protein that is part of a post-splicing multiprotein complex involved in both mRNA nuclear export and mRNA surveillance, and DPDY (dihydropyrimidine dehydrogenase) enzyme involved in the breakdown of nucleotides pyrimidines (uracil and thymine) when they are not needed. Finally, as shown in Table 1, we found two new driver mutations in two genes that have already been associated with breast cancer: PRPF8 and CBLB. PRPF8 (pre-mRNA processing factor 8) is a component of both U2-and U12-dependent spliceosomes, found to be essential for the catalytic step II in pre-mRNA splicing process. PRPF8 is a cancer-related gene with different effect in different tissues, and it may affect how RNA-binding proteins mediate cancer-specific phenotypes [40]. CBLB (Cbl proto-oncogene B) encodes an E3 ubiquitin-protein ligase which promotes proteasome-mediated protein degradation by transferring ubiquitin from an E2 ubiquitin-conjugating enzyme to a substrate. It also functions as a negative regulator of T-cell activation. The CBLB gene can block the TGF-β pathway and has been associated with breast cancer [41]. In our study, we investigated the relationship between CBLB and the TGF-β pathway by analyzing the mutations and the expression levels found for this gene. This is explained in Section 3.7.

Functional Involvement in Cancer of Genes Found with Driver Mutations
UPF3B encodes a protein that is part of a post-splicing multiproteic complex involved both in nuclear mRNA export and mRNA control, detecting mRNA with a defective reading frame and initiating nonsense-mediated mRNA decay (NMD). UPF3B has been linked to cancer because some tumor cells use NMD to destroy mRNAs from key tumor suppressor genes [42]. This is the case, for example, in breast and ovarian cancer, where the nonsense-mediated mRNA decay pathway triggers degradation of most BRCA1 mRNAs [43]. Another gene eventually linked to cancer is DPDY. As indicated above, the protein encoded by DPDY is a catabolic enzyme of pyrimidines (dihydropyrimidine dehydrogenase) that participates in the first step of the decomposition of pyrimidines, converting uracil into another molecule called 5,6-dihydrouracil and thymine into 5,6-dihydrotothiamine. The molecules generated from this process can be used in other cellular processes. Cancer cells exhibit a very active and dynamic metabolic control to have sufficient supply of nucleotides and other macromolecules to grow and proliferate. In fact, cancer cells tune the signaling pathways to empower de novo synthesis of nucleotides. This enables that the metabolic requirements for cell growth to be satisfied and allows the synthesis of nucleic acids and proteins to take place [44]. Alterations in the DPDY gene could alter the normal functioning of the enzyme encoded by this gene, contributing to the proliferation of cancer cells. Similarly, other defects in pyrimidine metabolism can increase the toxicity risk in cancer patients that receive chemotherapy with the drug 5-fluorouracil (5-FU), which is a pyrimidine analog.
The alteration in DPDY was identified in a patient positive for estrogen receptor (ER), progesterone receptor (PR) and HER2 receptor (triple positive) (Table S2). This finding is very consistent with the literature, since there are reports of several studies related to the low response of ER+ tumors to conventional chemotherapy. In fact, patients with ER-tumors have more complete pathological responses to neoadjuvant chemotherapy than ER+ [45]. Luminal tumors have only a 6% complete pathological response to preoperative chemotherapy based on paclitaxel followed by 5-fluorouracil, doxorubicin and cyclophosphamide, contrary to 45% complete pathological response in the basal (ER-PR-) and HER2+ subtypes [46]. The same studies confirm that the luminal B subtype has even a worse response than luminal A. In this context, we can consider that the negative response to chemotherapy in the evaluated cases can be related to the alteration in DPDY. As indicated above, PRPF8 is a central RNA splicing factor, essential for catalytic passage II in the splicing process prior to mRNA [47]. The disruption of RNA splicing causes genome instability, and the factors involved in this process have been related to tumor suppression. Recurrent mutations in the PRPF8 gene have been observed in malignant myeloid tumors and are correlated with an increased proliferative capacity [48].
Two of the alterations predicted as cancer drivers in the ductal luminal subtype are well-known breast cancer biomarkers: ESR1 mutation (E380Q) and NOTCH1 mutation (G995S) ( Table 1). Biomarkers have many potential applications in oncology, including risk assessment, screening, differential diagnosis, prognosis determination, prediction of treatment response and disease progression monitoring [49]. Therefore, confirmation of specific biomarkers will have a very positive impact on the management of disease for patients with specific cancers. With regard to the effect of these mutations in pharmacological treatments, ESR1 mutation (E380Q) is sensitive to fluvestrant (hormone therapy) [50] and resistant to tamoxifen (hormone therapy) [51]; NOTCH1 mutation (G995S) is sensitive to the gamma-secretase inhibitors (GSI) that block NOTCH signaling [52].

Global Differential Expression of Ductal Luminal Breast Cancer Samples
Differential expression analyses were carried with the Limma-Voom [22] and DESeq2 [23] methods, as described in Section 2, using RNA-seq data from TCGA, comparing 510 ductal luminal samples (339 luminal A and 171 luminal B) with 89 healthy controls. The differential expression thresholds applied to select the most significant genes obtained with these two methods were adjusted p-value < 0.001 and |log2FC| > 2.5. The genes that were significantly differentially expressed with both methods were selected. In this way, a significant set of 840 genes was identified, including 263 overexpressed genes and 577 repressed genes (Figure 2). The complete list of these genes, with their description, the corresponding p-values and the fold-change given by each method, is provided as Supplementary Material in Table S3.
As indicated above, the complete list of 840 genes corresponding to the overlapping differential expression signature (shown in Figure 2) is provided in Table S3. This list includes upregulated genes that are clear cancer markers, like the aurora kinases A and B (AURKA and AURKB), which are frequently amplified and overexpressed in cancer; they are also related to proliferation, as is gene Ki-67 (MKI67). Other upregulated genes are CEACAM5 and CEACAM6, associated with carcinogenesis, and many genes involved in the stimulation of mitosis and the cell cycle: CCNB2 (cyclin B2), CDK1, CDC6, CDC20, CDC20B and CDC25C. One of the most altered pathways, according to a functional enrichment assignment in KEGG database, is transcriptional misregulation in cancer (i.e., KEGG pathway hsa05202), which includes genes like WT1 and MMP9, as well as several other highly overexpressed matrix metallopeptidases (MMP11 and MMP13). As a whole, we obtained a large gene differential expression signature characteristic of ductal luminal breast cancer samples derived from TCGA, even using fairly strict statistical thresholds and considering only the results of the superposition of two methods. In the next section, we looked for any gene that had significant differential expression in the ductal luminal breast cancer samples and that also showed some alteration or mutation in the exome sequencing data.

Differential Expression of Ductal Luminal Breast Cancer Samples in Genes That Suffer Mutations
The differential expression results corresponding to the comparison of 510 ductal luminal samples versus 89 controls (i.e., the same samples as in the previous section), using a threshold of adjusted p-value < 0.05, were crossed with the genes identified after all the WES data analyses (i.e., the 304 protein-coding genes found for ductal luminal breast cancer). With this approach, a set of 81 genes were identified. The complete list of the 304 protein-coding genes that include variants, combined with the differential expression data obtained for 81 genes derived from the comparison of ductal luminal samples versus controls, are provided in Table S4. Figure 3 shows the chromosome location (in the X-axis) of these 81 genes, together with their differential expression significance (in the Y-axis) measured as -log10(adjusted p-value) (taking the data from the values calculated with DESeq2). The concentration of these 81 genes along the genome was highest in chromosomes 6, 7 and 15. The genes marked as red dots in Figure 3 were upregulated in the RNA-seq data analysis of cancer samples with respect to healthy controls, and the genes marked as blue dots in Figure 3 were downregulated in cancer samples with respect to healthy controls. The four genes that were identified to have known driver mutations, according to our analysis of WES, are marked in this plot with a green box (Figure 3). These genes were also identified in the differential expression analysis: NOTCH4 and CD36 were found to be repressed in the RNA-seq data, while ESR1 and ERBB3 were found to be overexpressed in the RNA-seq data.
The identification of the estrogen receptor (ESR1) as an overexpressed gene, as well as a mutated gene in the data of our patients with ductal luminal breast cancer is worth noting, since it confirms and provides validation to the methodological approach of this study. ESR1 is a well-known positive biomarker of luminal breast cancer. Furthermore, several studies suggest that alteration of the estrogen receptor alpha gene (ESR1) can contribute to therapeutic resistance and metastasis in breast cancer [53,54]. ERBB3 is a member of the human epidermal growth factor receptor (EGFR) family. ERBB3 is an important molecule in estrogen-receptor-positive breast cancers (ER+), which represent approximately 80% of all breast cancers. A high expression of this gene has been detected in ER+ and luminal tumors [55]. Furthermore, elevated levels of ERBB3 correlate with progression of several solid tumors [56,57]. In these reports, it has also been observed that ERBB3 mutations can activate MAPK and HER signaling in ER+ breast cancer cells. Additionally, ERBB3 activates the PI3K pathway through its association with ERBB2 (HER2). In many cases, the effectiveness of hormonal therapy is nullified by the PI3K pathway, which remains very active along with high levels of ERBB2. This cancellation of hormonal therapy results in the activation of transcription factors that disrupts the epithelial polarity and leads to hyperproliferation [47].
NOTCH4, a member of the NOTCH signaling pathway and the NOTCH family, plays an important role in cell development pathways, including proliferation, differentiation and apoptosis. NOTCH4 expression is inversely associated with the estrogen receptor (ER) and/or progesterone receptor (PR) and is positively associated with a large tumor size, lymph node involvement and a more advanced stage of tumor lymph node metastasis. Its overexpression is more related to basal molecular subtypes. Thus, it is reasonable that NOTCH4 is downregulated in the luminal subtype.
The identification of the CD36 repressed gene in this study is consistent with the finding by Sun et al. (2018) [58], who reported that the repression of the CD36 gene in lung tumor samples inhibits cell proliferation, blocks the cell cycle in the G0/G1 phase and inhibits cell migration.

Functional View of the Genes Altered in Ductal Luminal Breast Cancer
A functional analysis carried out with Gene Term Linker [36] for the 81 deferentially expressed genes with highly pathogenic mutations in IDC-LM-BRCA showed a significant enrichment in processes associated with carcinogenesis and tumor progression, indicating that this set of genes constitutes a significant molecular signature of the malignant state of the samples studied.
In particular, genes DLL1, FOXS1, GJB5, KRT15, LAMA1, LAMA3, NOTCH4 and TGM5 were all downregulated and participate in the enrichment of the NOCTH4 signaling pathway (GO: 0007219) and cell adhesion regulation (GO: 0030155). Genes CD36, COL4A4, COL5A1, CTSG, FBN3, FLNC, LAMA1 and LAMA3 participate in the signaling pathways WNT, PI3K-AKT, calcium and MAPK. The MAPK pathway is the most frequently mutated signaling pathway in human cancer, currently considered a promising target for cancer therapy. This pathway plays a central role in the induction of responses such as cell proliferation, differentiation, growth, migration and apoptosis [59]. This pathway is initiated by an extracellular mitogenic stimulus that leads to the activation of RTK or GPCR. The MAPK/ERK pathway leads to phosphorylation and subsequent translocation of ERK in the nucleus. ERK activation plays a central role in the induction of cell cycle input and the suppression of negative regulators of the cell cycle [60]. The PI3K-AKT signaling pathway also regulates many normal cellular processes, including cell proliferation, survival, growth and motility. These processes are critical for tumorigenesis, and the role of this pathway in oncogenesis has been extensively investigated. Many components of this pathway have been related to human cancer in studies that analyze both mutation and expression changes [61].
The identification of genes enriched in the WNT and calcium signaling pathways in the analysis of ductal luminal samples is an expected result, since both these pathways have been directly related to the tissue and the subtype evaluated in our work. The WNT signaling pathway is important for the development and remodeling of the breasts during pregnancy and lactation, and modifications in the components have an impact on oncogenic transformation. Likewise, alterations in calcium homeostasis occur frequently in some pathological conditions such as malignant proliferation, and the entry of Ca has a decisive part in determining the concentration of Ca in the epithelial breast cells. Glandular breast proliferation, differentiation and lactation are regulated by several local and systemic hormones, of which estrogen is one of the most important ones. Estrogen regulators and their receptors are modulators of proliferation and differentiation of breast epithelial cells [62]. The effect of estrogen on the epithelial breast cells is mainly done through genomic regulation, but nongenomic mechanisms depend particularly on Ca signaling [63].
Another group of genes found in our set (constituted by ABCA13, ABCA8, ABCB5, ABCC9, ATAD2, ATP13A5, CFTR and DNA2) was enriched in ABC transporters and ATPase activity coupled to transmembrane movement of substances. The expression of these proteins is related to drug resistance and is an important obstacle for successful chemotherapy. Genes CFTR, CHRNA7, CLCNKB, CNGA1, KCNA2, KCNH8, SCN4A, SCN7A and SLC26A4 were associated with ion channels activated by voltage (GO: 0005244). In breast cancer, different types of ion channels other than Ca have been associated with tumorigenesis. Recently, voltage-dependent Na channels (VGSC) have been implicated in processes that lead to increased tumor aggressiveness [64]. This may be due to the fact that alteration in the proteins involved in the cell processes described can also contribute significantly to cellular mitotic biochemical signaling, cell cycle progression and cell volume regulation [65].

Mutations Found in CBLB, a Gene That Inhibits the TGF-β Pathway
The CBLB gene was one of the two genes for which we identified novel mutations in the Colombian cohort. CBLB and its paralog CBL are called proto-oncogenes and encode E3 ubiquitin-protein ligases. These genes are known to block the TGF-β pathway. Indeed, it has been reported that CBL genes enhance breast tumor formation by inhibiting tumor suppressive activity of TGF-β signaling pathway [41]. In our analysis of the 476 IDC-LM-BRCA samples from TCGA, the CBL and CBLB genes showed no significant change in expression compared to normal controls. However, two of the genes of the TGF-β pathway (TGF-β receptors TGFBR2 and TGFBR3) presented a very significant downregulation in expression (adjusted p-value < 0.001 with Limma-Voom and DESEq2), which indicated a probable inhibition of the TGF-β signaling pathway in the ductal luminal breast carcinomas. Furthermore, analysis of the whole exomes of the 476 IDC-LM-BRCA samples from the TCGA project revealed 13 distinct mutations in the CBLB gene, 12 of which had been reported as confirmed somatic tumor mutations (in COSMIC v90) and 6 of them corresponding to missense variants that are likely to damage the protein (Table S5). Based on these exomes, we also evaluated whether there was any relationship between the mutations and the gene expression of CBLB. Thus, we found, in our TCGA set of IDC-LM-BRCA samples, six patients with one or more of the 13 CBLB reported mutations. Comparison of the expression of CBLB in these mutated patients with the average expression in the nonmutated patients detected an overexpression of CBLB (adjusted p-value = 0.0343) and a repression of a TGF-β receptor called TGFBR3L (adjusted p-value = 0.0613). Therefore, these 13 mutations may be contributing to a more acute blockage of the TGF-β pathway in this group of breast cancer patients. Information on these 13 mutations is provided in Table S5.

Common Mutated Genes in Ductal Luminal Breast Cancer Patients from Colombia and TCGA
As described in Section 2.10, we performed the intersection of 43,213 mutation sites found in the 476 TCGA ductal luminal patients and the 45,454 mutation sites in the 33 Colombian ductal luminal patients. This combined analysis of WES samples from Colombia and TCGA provided a set of 29 common genes that were found mutated in both cohorts of ductal luminal breast carcinoma patients. These genes include 35 single nucleotide mutations, with the following three genes exhibiting multiple mutations: PIK3CA with four mutations, TP53 with three and MUC4 with two. PIK3CA mutations represent one of the most common genetic aberrations in breast cancer. They have been reported to be present in over one-third of cases, with enrichment in the luminal subtypes. The tumor suppressor gene TP53 is the most frequently mutated gene in somatic cells of human cancers. All the information on these mutations is included in an additional Supplementary Table (Table S6). Along with the genes that overlap between the Colombian and TCGA datasets, we also looked for the specific mutations that matched between this list of 35 selected variants and the list of 339 SNPs derived from our comprehensive analysis of the Colombian ductal luminal breast cancer cohort (339 SNPs included in Table S2). In this matching, we found five common SNPs present in both sets: rs766301333 (in gene EPHA1, site chr7_143091418 change G to A); rs762605878 (in gene PLEKHG1, site chr6_151125863 change G to A); rs758321674 (in gene STAB2, site chr12_104100711 change G to A); rs121912656 (in gene TP53, site chr17_7577547 change C to A); and rs1057519997 (also in gene TP53, site chr17_7579355 change A to T). Along with information on SNPs, in Table S6 we also included information on the differential expression analysis performed for all these genes with the RNA-seq data from the 476 TCGA samples. A group of 10 genes out of 29 showed significant changes in differential expression, considering the Limma-Voom algorithm; 25 genes out of 29 showed significant changes in differential expression, considering the DESeq2 algorithm. Many of these gene alterations observed in our analysis have been previously reported. For example, PLEKHG1 is a gene located in a breast cancer risk locus on chromosome 6, and it has been found to be downregulated in breast cancer samples compared to adjacent normal tissue samples [66]. We found this gene to be mutated and repressed. Another relevant result in our analysis was the detection of the tumor suppressor TP53 presenting three mutations that were conserved in both the Colombian and TCGA datasets. This gene as a whole did not have a significant change in the level of expression, but when we measured eQTL, which detects changes in expression between the two alleles of a mutation site, we observed that two SNPs in TP53 (rs587781288 and rs1057519997) presented a change in expression associated with the mutation: site chr17_7578508 change C to T, p-value = 0.0634; and site chr17_7579355 change A to T, p-value = 0.0926 (Table S6). The changes are not very significant but indicate a trend. In both cases, the mutation corresponded to upregulation of the gene, indicating that the mutation may have a positive effect by enhancing the tumor suppressor activity of TP53 in ductal luminal breast carcinomas. We calculated the eQTLs for all the 35 mutations included in this study, and found only two other cases in which the mutation is associated with a change in expression: AKT1 gene, rs121434592 mutation (site chr14_105246551 change C to T) p-value = 0.0068; and PIK3CA gene, rs121913273 mutation (site chr3_178936082 change G to A) p-value = 0.0435. These two genes, AKT1 and PIK3CA, are well-known cancer genes and we report their double alteration in sequence and in expression detected in both ductal luminal breast cancer cohorts.

Conclusions
In the present study, we found a set of gene-centric alterations identified as pathogenic mutations in exomes from invasive breast carcinomas of ductal luminal subtype in a cohort of patients of Southwest Colombia. The pathogenic mutations were detected as somatic tumor variants corresponding to nonsynonymous single nucleotide polymorphisms (nsSNPs). The mutations were correlated with exome mutations and genome-wide expression alterations detected in a larger set of tumor samples of the same breast cancer subtype from TCGA (that included DNA-seq and RNA-seq data). The results provide a refined list of genes and mutations associated with the etiology of the ductal luminal breast cancer. This list includes 19 single mutations identified as tumor driver mutations in 17 genes. Some of the genes (such as ATM, ERBB3, ESR1 or TP53) are very well known cancer genes altered in breast cancer and therefore were expected, while other genes (such as CBLB and PRPF8) presented driver mutations that had not been reported before. Moreover, in the case of the CBLB gene, 13 mutations were identified in TCGA ductal luminal samples associated with overexpression of their host gene and repression of tumor suppressive activity of TGF-β pathway. Our study also reports a combined analysis of WES samples from the Colombian and TCGA patients, providing a set of 29 common genes that were found mutated in both ductal luminal breast carcinoma cohorts. These genes include 35 single nucleotide mutations. Using TCGA data, we also calculated the eQTLs for all these 35 mutations, finding only four mutations that showed a significant change in expression associated with the modified allele, corresponding to mutations in three cancer genes: AKT1, PIK3CA and TP53. The functional relevance of each of these mutations in these genes and the molecular effect on specific tumors and individual patients needed further investigation and are beyond the scope of this work. In any case, we present a neat collection of driver genetic mutations and expression alterations associated with a specific subtype of breast cancer and linked to a Colombian cohort of patients.
Supplementary Materials: The following files are available online at http://www.mdpi.com/2218-273X/10/5/698/s1 associated to this article: Figure S1-Multidimensional scaling (MDS) analysis of 859 samples of breast ductal cells (770 from invasive ductal breast carcinomas and 89 from normal healthy controls) using RNA-seq to obtain global expression (measuring 60,423 genes). Table S2-List and descriptive parameters of the 339 SNP variants (corresponding to 304 protein-coding genes) found in the tumor samples of the ductal luminal breast cancer patients studied in this work. Table S3-List and descriptive parameters of the 840 genes that presented significant differential expression in the set of 510 tumor samples from TCGA that corresponded to ductal luminal breast cancer subtype. Table S4-List and descriptive parameters of the 304 protein-coding genes that had mutation variants in the tumor samples of the ductal luminal breast cancer patients studied in this work. The genes also include the differential expression parameters and data corresponding to the comparison of 510 ductal luminal versus 89 control samples from TCGA. Table S5-List and descriptive parameters of the 13 somatic mutation sites that were found in gene CBLB in the 476 exomes of IDC-LM-BRCA TCGA samples. Table S6