Dataset of protein species from human liver

This article contains data related to the research article entitled “Zipf׳s law in proteomics” (Naryzhny et al., 2017) [1]. The protein composition in the human liver or hepatocarcinoma (HepG2) cells extracts was estimated using a filter-aided sample preparation (FASP) protocol. The protein species/proteoform composition in the human liver was determined by two-dimensional electrophoresis (2-DE) followed by Electrospray Ionization Liquid Chromatography-Tandem Mass Spectrometry (ESI LC-MS/MS). In the case of two-dimensional electrophoresis (2-DE), the gel was stained with Coomassie Brilliant Blue R350, and image analysis was performed with ImageMaster 2D Platinum software (GE Healthcare). The 96 sections in the 2D gel were selected and cut for subsequent ESI LC-MS/MS and protein identification. If the same protein was detected in different sections, it was considered to exist as different protein species/proteoforms. A list of human liver proteoforms detected in this way is presented.


Value of the data
The data allow the estimation of the distribution of proteins and protein species/proteoforms in human liver cells according to their abundance.
It is possible to easily extract information about sets of proteoforms that are encoded by the same genes and the abundance of these protein species/proteoforms as well.
The data could be a starting point for quantitative research of protein species/proteoforms

Data
The extracts of human liver or HepG2 cells were treated with trypsin using the FASP protocol. The peptides produced were analyzed by ESI LC-MS/MS. The lists of proteins detected are presented in Supplementary Table 1. The extracts of human liver tissue (300 mg of protein) were also run by 2-DE ( Fig. 1). The gel produced was stained with Coomassie Brilliant Blue R350. Image analysis was performed by ImageMaster 2D Platinum software (GE Healthcare, Pittsburgh, PA, USA). Next, 96 sections were selected, given pI/Mw coordinates, and cut for subsequent ESI LC-MS/MS analysis (Fig. 1). A list of all proteins detected by Mascot (only without hemoglobin) in the human liver extracts is presented in Supplementary Table 2. Hemoglobin was removed as a major contaminant of blood plasma proteins. If the same protein was identified in different sections, it was considered to exist as different proteoforms. According to this rule, a total of 14667 proteoforms were identified.

In-gel digestion and mass spectrometry
Gel-free sample treatment of cell or tissue lysates was performed according to the FASP assay [10]. In short, cysteines were reduced with 100 mM dithiothreitol (DTT). Excess reagent was removed by ultrafiltration in Microcon filters (YM-10) followed by a wash with washing buffer (8 M urea 100 mM Tris, pH 8.5). Cysteines were carboxyamidomethylated with 50 mM iodoacetamide (IAA), and excess reagent was removed by washing buffer followed by digestion buffer (50 mM ammonium bicarbonate, pH 8.5). The proteins were digested with trypsin ("Trypsin Gold", 10 mg/ml, in digestion buffer) for at least 4 h at 37°C and the resulting peptides were collected as a filtrate.
The treatment of gel pieces was performed according to the protocol described elsewhere [4,11,12]. Agilent HPLC system 1100 Series and columns were used (Agilent Technologies, USA). In short, the tryptic peptides were dissolved in 5% (v/v) formic acid and injected into a trap column Zorbax 300SB-C18, 5 Â 0.3 mm. After washing (5% ACN, 0.1% formic acid), the peptides were resolved on a 150 mm Â 75 mm Zorbax 300SB-C18 reverse phase analytical column using a 30-min 5-60% ACN gradient in 0.1% formic acid with a flow rate of 300 nL/min. The peptides were ionized by nanoelectrospray at 2.0 kV using a fused silica emitter with an internal diameter of 8 mm (New Objective, USA). MS/MS analysis was performed in duplicate using an Orbitrap Q-Exactive Plus (Thermo Scientific, USA). Mass spectra were acquired in the positive ion mode. High resolution data was acquired with a resolution of 30,000 (m/z 400) for MS and 7500 (m/z 400) for MS/MS scans. Survey MS scan was followed by MS/MS spectra of five of the most abundant precursors. For peptide fragmentation, higher energy collisional dissociation (HCD) was 35 eV, the signal threshold was 5000 for an isolation window of 2 m/z, and the first mass of HCD spectra was 100 m/z. Fragmented precursors were dynamically excluded from targeting for 90 s. Singly charged ions and ions with unassigned charge state were excluded from triggering MS/MS scans. The automatic gain control target value was regulated at 1 Â 10 6 with a maximum injection time of 100 ms and at 1 Â 10 7 with a maximum injection time of 250 ms for MS and MS/MS scans, respectively. The data were searched by Mascot 2.4.1 (www.matrixscience.com). The following parameters were appliedenzyme: trypsin, allowing cleavage before proline; maximum missed cleavages: 2; fixed modifications: carbamidomethylation of cysteine; variable modifications: oxidation of methionine, phosphorylation of serine and threonine, acetylation of lysine; precursor mass tolerance: 20 ppm; product mass tolerance: 0.01 Da. As a protein sequence database, NeXtProt (October 2014) was used. A separate decoy database was generated for the false discovery rate (FDR) evaluation. A false-positive rate of 1% was allowed for protein identification [13]. The exponentially modified form of protein abundance index (emPAI) defined as the number of identified peptides divided by the number of theoretically observable tryptic peptides for each protein was used to estimate protein abundance [14,15].