High-throughput mass spectrometry and bioinformatics analysis of breast cancer proteomic data

Data present here describe a comparative proteomic analysis among the malignant [primary breast tumor (PT) and axillary metastatic lymph nodes (LN)], and the non-tumor [contralateral (NCT) and adjacent (ANT)] breast tissues. Protein identification and quantification were performed through label-free mass spectrometry using a nano-liquid chromatography coupled to an electrospray ionization–mass spectrometry (nLC-ESI-MS/MS). The mass spectrometry proteomic data have been deposited to the ProteomeXchange Consortium via PRIDE partner repository with the dataset identifier PXD012431. A total of 462 differentially expressed proteins was identified among these tissues and was analyzed in six groups' comparisons (named NCTxANT, PTxNCT, PTxANT, LNxNCT, LNxANT and PTxLN). Proteins at 1.5 log2 fold change were submitted to the Ingenuity® Pathway Analysis (IPA) software version 2.3 (QIAGEN Inc.) to identify biological pathways, disease and function annotation, and interaction networks related to cancer biology. The detailed data present here provides information about the proteome alterations and their role on breast tumorigenesis. This information can lead to novel biological insights on cancer research. For further interpretation of these data, please see our research article ‘Quantitative label-free mass spectrometry using contralateral and adjacent breast tissues reveal differentially expressed proteins and their predicted impacts on pathways and cellular functions in breast cancer’ [2].

PTxANT, LNxNCT, LNxANT and PTxLN). Proteins at 1.5 log2 fold change were submitted to the Ingenuity ® Pathway Analysis (IPA) software version 2.3 (QIAGEN Inc.) to identify biological pathways, disease and function annotation, and interaction networks related to cancer biology. The detailed data present here provides information about the proteome alterations and their role on breast tumorigenesis. This information can lead to novel biological insights on cancer research. For further interpretation of these data, please see our research article 'Quantitative label-free mass spectrometry using contralateral and adjacent breast tissues reveal differentially expressed proteins and their predicted impacts on pathways and cellular functions in breast cancer' [2].

Data
The differential proteomic profiling of the breast cancer-related tissues was obtained using a high throughput mass spectrometry platform and appropriate statistical methods. A total of 462 identified Specifications  Value of the data A differential proteome between tumor and non-tumor tissues is described, highlighting the use of a valuable biological sample as control, the contralateral non-tumor breast tissue. The non-tumor breast tissues (NCT e ANT) present high similarity in the proteome profiling. The common alterations in the proteomes of malignant tissues (PT and LN) point out to cancer associated proteins and pathways that can be explored in tumor progression studies. The complete lists of differential expressed proteins and their biological context are a rich source of potential targets to be investigated in further studies.
proteins presented significant differences in the protein expression among these tissues (Supplementary File S1). Six different comparisons were performed: contralateral non-tumor breast tissue versus adjacent non-tumor breast tissue (NCTxANT); primary breast tumor versus contralateral nontumor breast tissue (PTxNCT); primary breast tumor versus adjacent non-tumor breast tissue (PTxANT); axillary metastatic lymph node versus contralateral non-tumor breast tissue (LNxNCT); axillary metastatic lymph node versus adjacent non-tumor breast tissue (LNxANT); and primary breast tumor versus axillary metastatic lymph node (PTxLN). The differentially expressed proteins of each group' comparison were distinctly grouped by hierarchical cluster analysis using the Perseus software version 1.5.6.0 (Fig. 1). Proteins at 1.5 log2 fold change were analyzed with IPA's tools to identify significant canonical pathways, biological functions, diseases and interaction networks for each group' comparison (Supplementary File S2). A detailed data interpretation is available on [2].

Protein extraction and digestion
Tissue samples were collected during the surgical procedure at Hospital Nossa Senhora das Graças at Curitiba, Parana, Brazil, and stored in RNA later solution. The samples were prepared as described in [2], according to a protocol adapted from Ostasiewicz and coworkers [3] and Tyanova and coworkers  [4]. Briefly, the samples were lysed in 4% SDS, 0.1 M Tris-HCl pH 7.6 and 0.1 M DTT and homogenized in TissueLyser II sample disruptor (Qiagen Corp. MD, USA), followed by heating to 95 C for 5 minutes. Samples were submitted to ultrasonic bath, centrifuged to remove cellular debris and processed by filter-aided sample preparation (FASP) method [5]. Proteins were briefly separated in a 1D-PAGE 10% (v/v) acrylamide gel, reduced with 10 mM DTT, alkylated with 50 mM iodacetamide and digested for 18 h with 12.5 ng/ml trypsin at 37 C. The resulting peptides were processed to LC-ESI-MS/MS.

LC-ESI-MS/MS
Tryptic peptides were separated by online EASY-nLC 1000 chromatograph (Thermo Scientific) and analyzed in the LTQ Orbitrap XL ETD (Thermo Scientific). The runs were performed in triplicate for each sample. Full MS was acquired in the Orbitrap analyzer and the MS2 analysis in the ion trap analyzer, using the CID fragmentation in a DDA mode. The acquired data were analyzed in the MaxQuant software version 1.5.8.3 [6] through the Andromeda search engine [7] and the human UniProt protein database (UniProtKB [8]

Data analysis
The "proteinGroups.txt" file generated by MaxQuant software was processed and analyzed in Perseus v. 1.5.6.0 [9]. Distinct tissue samples were categorized in their respective groups, including PT, LN, NCT and ANT tissues. The LFQ intensity values (that represent the protein expression levels) were log2-transformed and only proteins quantified in at least 70% of samples for each tissue were used for further analysis. Normalization was performed by width adjustment previously to the imputation of the missing values (downshift ¼ 1.8 and width ¼ 0.3) [10,11]. This processed data were exported to the R platform and analyzed in RStudio version 3.4.2 (http://www.R-project.org), using in-house scripts containing the Bartlett's test, ANOVA and Duncan's test, all at significance level of 5%. Proteins that presented homogeneous variances (accessed by Bartlett's test) were submitted to ANOVA's test at p < 0.05 and FDR of 0.05. The resulting differentially expressed proteins were analyzed to identify significant differences in the mean values among the samples' pairs (Duncan's test), providing lists of the differential proteome for the six groups' comparisons (NCTxANT, PTxNCT, PTxANT, LNxNCT, LNxANT and PTxLN). Euclidean distances were used for hierarchical cluster analyses performed with the differentially expressed proteins for each group' comparison. The 1.5 fold change cutoff was applied into the log2 data.

Ingenuity Pathway Analysis
Proteins at 1.5 log2 fold change of each comparative group were separately analyzed in the IPA software version 2.3 (QIAGEN Inc.) [12]. The NCTxANT group comparison was not included considering that no protein was observed at this cutoff. The gene symbols of the differentially expressed proteins and their fold change values were uploaded in IPA. The Core Analysis was performed under the following parameters: the expression fold change was set as the type of Core Analysis; direct and indirect relationships were consider to generate the networks; the prediction of these networks included the endogenous chemicals, 35 molecules per network and a total of 25 networks enabled per analysis; the confidence considers only relationships based on experimentally observed data; only the human species as well as all tissues and cell lines were set in this analysis. The cutoff values applied to all datasets included fold change 1.5 for up-regulated and -1.5 for down-regulated proteins. Adjusted p values (Benjamini-Hochberg, FDR) of <0.05 were considered significant. Based on the IPA's analysis, significant canonical pathways, biological functions and diseases, and interaction networks were algorithmically generated, including z-score values for predict the activation status of these processes.

Funding
This study was financed by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior -Brasil (CAPES) -Finance Code 001 and the CNPq/Araucaria Research Foundation of Parana State (PRONEX/2012).