Data set of the protein expression profiles of Luminal A, Claudin-low and overexpressing HER2+ breast cancer cell lines by iTRAQ labelling and tandem mass spectrometry

Breast cancer is the most common and the leading cause of mortality in women worldwide. There is a dire necessity of the identification of novel molecules useful in diagnosis and prognosis. In this work we determined the differentially expression profiles of four breast cancer cell lines compared to a control cell line. We identified 1020 polypeptides labelled with iTRAQ with more than 95% in confidence. We analysed the common proteins in all breast cancer cell lines through IPA software (IPA core and Biomarkers). In addition, we selected the specific overexpressed and subexpressed proteins of the different molecular classes of breast cancer cell lines, and classified them according to protein class and biological process. Data in this article is related to the research article “Determination of the protein expression profiles of breast cancer cell lines by Quantitative Proteomics using iTRAQ Labelling and Tandem Mass Spectrometry” (Calderón-González et al. [1] in press).


a b s t r a c t
Breast cancer is the most common and the leading cause of mortality in women worldwide. There is a dire necessity of the identification of novel molecules useful in diagnosis and prognosis. In this work we determined the differentially expression profiles of four breast cancer cell lines compared to a control cell line. We identified 1020 polypeptides labelled with iTRAQ with more than 95% in confidence. We analysed the common proteins in all breast cancer cell lines through IPA software (IPA core and Biomarkers). In addition, we selected the specific overexpressed and subexpressed proteins of the different molecular classes of breast cancer cell lines, and classified them according to protein class and biological process. Data in this article is related to the research article "Determination of the protein expression profiles of breast cancer cell lines by Quantitative Proteomics using iTRAQ Labelling and Tandem Mass Spectrometry" (Calderón-González et al. [1] in press).  Comparative analysis of putative biomarkers found in common in all breast cancer cell lines. Identification of specific over-and subexpressed proteins for each breast cancer cell line. Classification of specific proteins in each breast cancer cell lines according to protein class and biological process with PANTHER.

Data
We identified 1020 iTRAQ-labelled proteins with at least one peptide with a minimum of 95% in confidence with the ProteinPilot software (Table 1).

IPA analysis of the differentially expressed proteins found in common in all breast cancer cell lines
To determine whether over-or subexpressed polypeptides have been involved in diseases and biofunctions, or localised in networks, we used the IPA software to perform mainly three different core analyses: (a) Core I considered all tissues, primary cells and all cell lines. (b) Core II was performed for breast cancer cell lines and mammary gland. (c) Core III, for all cancer cell lines excluding breast cancer cell lines ( Table 2). They were obtained for different categories of diseases and biofunctions, using a p-value equal or lower than 0.05, which was obtained with Fisher´s Exact Test. Core analyses I and III shared similarities in the results obtained. Core I included 78 categories of diseases and biofunctions, whilst 11 categories were found in Core II and 72 in Core III. The most representative categories in Core I were Nucleic Acid Metabolism (16 proteins and p-values between 4.46 Â 10 À 10 and 2.53 Â 10 À 2 ), Small Molecule Biochemistry (36 polypeptides and p-values ranging from 4.46 Â 10 À 10 to 3.77 Â 10 À 2 ), DNA Replication, Recombination and Repair (19 proteins, p-values   between 7.24 Â 10 À 10 and 3.77 Â 10 À 2 ), and Energy Production (10 polypeptides and a p-values of 7.24 Â 10 À 10 ). In Core II we got the following categories: Cellular Response to Therapeutics (only 2 proteins with a p-value of 2 Â 10 À 3 ), Cellular Movement (9 polypeptides and p-values from 1.72 Â 10 À 2 to 3.67 Â 10 À 2 ), Cell Morphology and Cellular Assembly and Organisation (2 polypeptides, p-value of 1.85 Â 10 À 2 ). Finally, the most representative categories in Core III were the same as in Core I: Nucleic Acid Metabolism (16 polypeptides, p-values ranging from 4.6 Â 10 À 10 to 4.02 Â 10 À 2 ), Small Molecule Biochemistry (34 proteins, p-values between 4.6 Â 10 À 10 and 4.02 Â 10 À 2 ), DNA Replication, Recombination and Repair (21 proteins, p-values from 7.46 Â 10 À 10 to 3.78 Â 10 À 2 ), and Energy Production (12 proteins, p-values from 7.46 Â 10 À 10 to 4.02 Â 10 À 2) .
The two top networks obtained in Core I analysis were related with: (1) Cancer; Reproductive System Disease; Dermatological Diseases and Conditions; Haematological Diseases, which showed a score of 38 and contained 25 target molecules. (2) Connective Tissue Disorders; Neurological Disease; Skeletal and Muscular Disorders; DNA Replication, Recombination and Repair; Cancer; Gastrointestinal Diseases, that had a score of 27 and a network focused on 20 molecules (Fig. 1A and B). In Core II analysis, the two most representative networks were involved in: (1) Cellular Development, Cellular Growth and Proliferation, and Cell Cycle, with a score of 15 and a network of 15 polypeptides.
(2) Cell Death and Survival, Tumour Morphology, and Cellular Development. In this case, the score was 9 and the network generated was focused on 11 molecules ( Fig. 1C and D). Finally, the Core III analysis contained polypeptides playing a role in: (1) Cell Cycle, Cancer, Hereditary Disorders, and Haematological Diseases. In this case, it showed a score of 41 and considered 26 proteins (Fig. 1E).
(2) Developmental Disorders, Hereditary Disorders, Metabolic Diseases, and Cell Death and Survival. This network had a score of 24 and contained 18 polypeptides (Fig. 1F). In relation to the search for candidates in the PubMed database, we obtained 35 overexpressed and 53 subexpressed polypeptides not previously reported in breast cancer, with one exception, ANXA8, which has only one report [2]. Finally, we performed a comparison between the biomarkers found with IPA and those in the PubMed search. These analyses revealed only 3 specific overexpressed polypeptides in Biomarkers I, and 34 in the PubMed search ( Fig. 2A). For subexpressed proteins, we obtained 6 polypeptides for Biomarkers I, and 51 in PubMed (Fig. 2B). Biomarkers II and III modules did not find exclusive markers in both cases.

Panel of putative biomarkers exclusively found in Luminal A, Claudin-low and HER2 þ breast cancer cell lines
We identified sets of specific proteins for Luminal A, Claudin-low and HER2 þ breast cancer cell lines. We obtained a set of proteins shared by Luminal A cell lines containing 34 overexpressed (Fig. 3A) and 22 subexpressed polypeptides (Fig. 4A). In the case of Claudin-low, we identified a set with 55 and 44 overexpressed and subexpressed polypeptides, respectively (Figs. 3B and 4B). The HER2 þ cell line showed 74 overexpressed (Fig. 3B) and 49 subexpressed proteins (Fig. 4B). However, we need to determine the expression levels of these proteins in tumour tissues and or sera to validate them as biomarkers.

Functional classification of the exclusive proteins differentially expressed in each breast cancer cell line
The overexpressed and subexpressed specific proteins of each breast cancer cell line (MCF7, T47D, MDA-MB-231, SK-BR-3) were classified with PANTHER. For Protein Class of overexpressed proteins, we obtained 26 categories, being the Nucleic Acid Binding (XIV) the most representative category for MCF7, T47D and MDA-MB-231, and Hydrolase (VIII) and Oxidoreductase (XV) for SK-BR-3 (Fig. 5A). In the case of the Biological Processes, we found 12 categories, being the Metabolic process (IX) the category with more genes implicated (Fig. 5B). On the other hand, the categorisation of subexpressed proteins specific for each cell line had 25 categories in Protein Class and 12 in Biological Processes. The categories of Protein Class with mayor number of genes were Hydrolase (VII) for MCF7, and T47D, Nucleic Acid Binding (XIII) for MDA-MB-231 and SK-BR-3 (Fig. 6A). The category of Biological Process most affected was Metabolic process (IX) in each of cell line (Fig. 6B).

IPA analysis of proteins in common in all breast cancer cell lines
To identify diseases and biological functions, canonical pathways, molecular networks, and putative candidates for biomarkers from the proteins that are commonly express in all breast cancer cell lines, we used the complete functional Ingenuity Pathway Analysis software (IPA, QIAGENs Redwood City, www.qiagen.com/ingenuity). We performed three different analyses named Cores I, II and III. In all cases we used the stringent filter. The Fisher's Exact Test was used to determine the p-value, which was considered as significant with values r0.05. Core I was done to classify identified proteins in all tissues, primary cells and all cell lines including cervical, central nervous system, colon, hepatoma, immune, kidney, leukaemia, lung, lymphoma, macrophage, melanoma, myeloma, neuroblastoma, osteosarcoma, ovarian, pancreatic, prostate, teratocarcioma, breast cancer and other cell lines. Core II was used in mammary gland and breast cancer cell lines. Core III in all cancer cell lines, excluding breast cancer cell lines. Parameters used in the complete analysis were: (i) ingenuity knowledge base (genes only), considering direct and indirect relationships. (ii) Interaction networks including endogenous chemicals, default value of 35 molecules per network and 25 networks per analysis. (iii) All data source. (iv) Confidence: experimentally observed, highly and moderately predicted. (v) Human species with stringent filter. (vi) All mutations.

Selection of putative biomarkers common in all breast cancer cell lines
We used the module IPA biomarker to prioritise protein biomarker candidates in our data. Three analyses were carried out, Biomarkers I included a filter for all tissues, primary cells and all cell lines, Biomarkers II for breast cancer cell lines and mammary gland, and Biomarkers III for cancer cell lines, excluding breast cancer. Parameters used were: (1) human species. (2) All molecule types. (3) All biofluids. (4) All diseases for Biomarkers I, and cancer disease for Biomarkers II and III. (5) Biomarkers I, all biomarkers and disease application. For Biomarkers II, all biomarkers application and breast cancer disease (including breast cancer, breast carcinoma, ductal carcinoma, ductal carcinoma in situ, infiltrating ductal breast carcinoma, invasive ductal breast cancer, lobular breast carcinoma, and metastatic breast cancer). In the case of Biomarkers III, all biomarkers application and cancer disease excluding breast cancer. Moreover, we searched every one of the 206 identified polypeptides that changed in their expression levels in common in all breast cancer cell lines in PubMed database using key words such as biomarkers, cancer and breast cancer. Finally, we performed a comparison analysis of the results obtained with IPA Biomarker modules and PubMed with the Venny program (developed by Oliveros JC and available at http://bioinfogp.cnb.csic.es/tools/venny/).

Selection of biomarkers panel in the different molecular classification of breast cancer cell lines
For the selection of the exclusive sets of overexpressed and subexpressed polypeptides for Luminal A, Claudin-low and HER2 þ cell lines, we performed a comparative analysis of between them with the Venny program.
2.4. Functional classification of the exclusive sets of proteins for each breast cancer cell lines according to their corresponding molecular classification with PANTHER The specific sets of polypeptides for each breast cancer cell line according to their corresponding molecular classification were analysed with the PANTHER classification system v 9.0 (http://www. pantherdb.org/) [3]