Elsevier

Journal of Proteomics

Volume 73, Issue 6, 18 April 2010, Pages 1163-1175
Journal of Proteomics

Meta sequence analysis of human blood peptides and their parent proteins

https://doi.org/10.1016/j.jprot.2010.02.007Get rights and content

Abstract

Sequence analysis of the blood peptides and their qualities will be key to understanding the mechanisms that contribute to error in LC–ESI-MS/MS. Analysis of peptides and their proteins at the level of sequences is much more direct and informative than the comparison of disparate accession numbers. A portable database of all blood peptide and protein sequences with descriptor fields and gene ontology terms might be useful for designing immunological or MRM assays from human blood. The results of twelve studies of human blood peptides and/or proteins identified by LC–MS/MS and correlated against a disparate array of genetic libraries were parsed and matched to proteins from the human ENSEMBL, SwissProt and RefSeq databases by SQL. The reported peptide and protein sequences were organized into an SQL database with full protein sequences and up to five unique peptides in order of prevalence along with the peptide count for each protein. Structured query language or BLAST was used to acquire descriptive information in current databases. Sampling error at the level of peptides is the largest source of disparity between groups. Chi Square analysis of peptide to protein distributions confirmed the significant agreement between groups on identified proteins.

Introduction

There is a need to discover and assay proteins from blood. Blood likely contains the proteins of many different tissue and cell types [1] and these proteins were all first expressed as mRNAs from genes. The mRNA transcripts may be captured as cDNAs, or the parent genome sequenced, and the proteins inferred from the nucleotide sequence [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15]. Re-arrangements at the level of nucleic acid and post translational processing and modifications give rise to an immense number of possible protein sequences expressed in different tissues and cell types [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27]. The International Protein Index (IPI), conceived as an integrated database for proteomics experiments, was built as a complete and non-redundant (NR) compilation of the Swiss-Prot, TrEMBL, ENSEMBL and RefSeq databases [28]. RefSeq is a largely non-redundant database of human transcripts together with their gene ID numbers, and gene ontology (GO) annotations [29] that were parsed into an SQL database [30]. Protein sequence databases from these three organizations differ in format and content and contain both overlapping and unique information, with some proteins differing by only one, a few, or many amino acids. In some cases there is extensive data regarding the functions of the protein sequence but the function of many proteins are unknown and some predicted proteins inferred from genomic sequences may be only hypothetical and might not be expressed in cells.

Twelve sets of publicly available proteins and/or peptides by LC–MS/MS of blood that were fractionated and analyzed by different methods were assembled together in SQL to be summarized and compared. Adkins et al. (2002) used protein A/G depletion and tryptic digestion followed by 2D polysulfoethyl/C18 of peptides via nanospray into a Thermo Decca XP with correlation by SEQUEST without enzyme limitations and searching beyond 30 aa in length to yield about 585 proteins. Tirumalai et al. (2003) used ultrafiltration with 30,000 NMWL cut off in 5% ACN followed by separation of tryptic peptides with polysulfoethyl A/C18 via nanospray into a Thermo Decca XP correlated with SEQUEST without enzyme limitations and searching beyond 30 aa in length to yield about 317 proteins. Marshall et al. (2004) used DEAE and other chromatography resins or PAGE separation of proteins with trypsin of chymotrypsin digestion prior to C18 separation, sometimes with prior QA & PS separation of peptides, into a Thermo Decca XP with correlation by SEQUEST of fully tryptic peptides of 14 aa or less yielding 650 proteins with highly stringent correlation scores [31]. Shen et al. [40] digested neat serum and separated the peptides by UPLC with C18 alone or SCX followed C18 using nanospray into a Thermo Decca XP with correlation via SEQUEST using no enzyme limitations and searching beyond 30 aa in length to yield 953 proteins with modest correlation scores. Omenn et al. [44] used depletion of albumin, IgG, IgA, IgM, transferin, haptoglobin, A1AT and/or separation of proteins by CH2O affinity, reversed phase, PAGE, SAX, gradiflow/tca, free flow electrophoresis, IEF and/or separation of peptides by SCX and or C18 via MALDI and nanospray ESI into Paul ion traps, linear ion traps, Qq-TOF and FTICR with correlation to peptides by SEQUEST, MASCOT, PepMiner, Digger, Sonar, X!TANDEM, VIPER with various enzyme rules yielding 9303 proteins. Shen et al. [41] depleted albumin & IgG prior to digestion before UPLC separation of peptides by SCX/C18 with correlation by SEQUEST without enzyme limitation yielding a set of 2258 higher confidence and 2704 lower confidence proteins. Zhu et al. [70] used a small amount of the data that Marshall et al. [31] had previously calculated discretely (i.e. one experiment correlated individually) but instead calculated the subset jointly (i.e. many experiments contributing to the calculation of protein confidence) with a cut-off score of 2400 [32], [33] to yield 2571 proteins with an estimated 5% error rate [32]. Faca et al. [49] depleted serum of albumin, IgG, IgA, transferin, haptoglobin, A1AT followed by separation of intact proteins with Mono Q, reversed phase or PAGE followed by tryptic digestion and separation of peptides by C18 with analysis via nanospray into a Thermo LTQ-FTICR and correlation revealing fully, or occasionally, semi-tryptic peptides from 2254 proteins. Zhang et al. [77] captured N-linked glycopeptides and separated them using SCX followed by C18 via nanospray into an Thermo LTQ with correlation by SEQUEST generating 523 distinct proteins. Sennels et al. [48] used a random hexapeptide library on methyacrylate beads, eluted with thiourea, urea, detergents, acids or organic solvents prior to digestion of PAGE slices and analysis via nanospray into a Thermo LTQ with correlation using SEQUEST and MASCOT yielding 1559 proteins with no peptide information. Tucholska et al. [34] compared propyl sulfate, quaternary amine, diethylaminoethanol, cibachron blue, phenol sepharose, carboxy methyl sepharose, hydroxyl apatite, heparin, concanavalin A and protein G chromatography to yield some 4396 proteins by X!TANDEM.

Mass spectra may be correlated to peptides that sometimes exist in more than one protein [33], [35]. The peptide coverage from LC–MS/MS of blood may not be sufficient to correspond to only one protein, even when multiple peptides are observed, due to multiple related genes, protein domains and splice variants in Eukaryotes. Peptide fragmentation spectra from the LC–MS/MS of blood samples were correlated to disparate protein databases inferred from DNA sequences or transcripts from humans and other species [31], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50]. A meaningful review, summary and comparison of large scale LC–MS/MS data [51] created at different times and correlated to different genetic libraries [44], [52], [53] cannot be undertaken without the use of computation [54], [55]. In contrast to accession numbers that may change over time, the blood peptide and protein sequences themselves are immutable and portable identifiers that may be compared between libraries and datasets. One way to avoid ambiguity is to map the sequences identified to the full-length proteins of a relatively non-redundant set of transcripts. At present the most important computing tools for comparing biopolymer sequence data are Structured Query Language (SQL) [30], [56] and the Basic Local Alignment of Sequence Tool (BLAST) [57]. Algorithms such as BLAST or SQL databases may be used to make unbiased comparisons of peptide or protein sequences from correlation analysis of different populations of LC–MS/MS runs [54], [56], [58]. SQL forms the basis of many laboratory data automation and analysis systems [56], [59], [60]. The use of a protein or peptide sequence as a unique identifier or “database key” has been shown to improve comparisons between independent groups [54]. An efficient method for labeling protein and peptide sequences such as the Secure Hash Algorithm [61] is ideal for internal database purposes: The term SEQUID is a standard field name representing the protein identifiers generated. Proteins or peptide sequences that are not identical between the two populations of data can thus be derived and examined in detail. Populations of LC–MS/MS runs may contain redundant peptides and proteins. To ensure that redundancy is reduced to a defined standard prior to comparison, the actual peptide or protein sequences, may be collapsed to non-redundant (NR) sets in SQL or BLAST and then summarized and compared [54]. BLAST may thus be used to collapse many proteins into a single representative protein molecule and used to compare populations of LC–MS/MS protein lists [54]. SQL, and BLAST may also be used to obtain available annotation from databases [54], [56], [58]. The available protein data identified from normal human blood using mass spectrometry was mapped to the RefSeq and ENSEMBL databases. The data base provides up to five peptides in order of greatest representation, the source laboratory, the accession numbers, descriptor fields, GO terms and full-length sequences all of which may be directly sorted, searched or copied using a database tool such as Microsoft Access or SQL Server 2005. A database of all blood protein and peptide sequences with descriptor fields and gene ontology terms might be useful for designing immunological or MRM assays from human blood [62], [63], [64].

There is a concern that the results of tandem mass spectrometry of peptides are largely random or incorrect [65], [66], [67] and therefore it is challenge to derive high confidence identifications from blood [68]. In this regard, the relative distribution of peptides to proteins from random matching alone was determined using 120,000 spectra to 40,000 random protein sequences [69]. Under the conditions used by Cargile et al. [69], if the data is merely a random fitting of the spectra then about 87% of proteins should have only one peptide, about 11% of proteins should have two peptides and less than about 1% of proteins should have three peptides. The results of the distribution of peptides to proteins were unified on the basis of direct sequence examination and compared to Cargile's [69] estimates of the random proportions using the Chi Square test to determine if the discovery of blood proteins by LC–ESI-MS/MS was merely a random or false discovery.

Section snippets

Databases

The available serum and plasma peptide and protein data were parsed into SQL databases and the accession numbers and protein sequences obtained [31], [36], [37], [39], [40], [41], [42], [44], [48], [53], [70]. The ENSEMBL human protein database of 47,509 proteins was downloaded in 2008. The Human RefSeq database 2008 of human transcripts together with their genomic loci and gene ontology (GO) annotations [29] were parsed into SQL and used for these calculations and contained a total of 33,492

BLAST analysis

Many of the previously reported accession numbers and/or protein sequences from the archives failed to find an exact match in the current RefSeq or ENSEMBL databases. However in most cases an identical or similar protein that contains the same MS/MS peptides can be obtained by BLAST with accession numbers, sequences and annotation of identical or similar proteins [54]. Database entries where BLAST was used to obtain sequence descriptors or other annotation are indicated alongside by the

Peptide summary

A total of 75,432 redundant peptides were parsed and of these 57,784 were distinct (Table 1). Only 3483 distinct peptides were agreed upon by two groups or more and only 379 distinct peptide sequences were common to three groups or more. Double wild card searching around each peptide sequence (i.e.*peptide*) yielded 5769 peptides shared by two or more groups, with 3 groups in agreement on 947 peptides, and 5 groups agreeing on 173 peptides (Table 1). Hence greater levels of agreement were

Discussion

Discovering the peptides and proteins from blood is a challenging but not prohibitively difficult task. Comparing the results of experimental approaches to date may help in the design of cost efficient strategies. However, comparing a single parameter such as the number of proteins reported, number of peptides per protein, or quality of peptides reported [38] is not meaningful in itself since some groups used stringent correlation parameters, and algorithms, while others used more inclusive

Acknowledgements

The work was supported by a grant from the Ontario Cancer Biomarker Network to JGM.

References (89)

  • N.L. Anderson et al.

    Mol Cell Proteomics

    (2003)
  • D. Brutlag et al.

    Biochem Biophys Res Commun

    (1969)
  • T.R. Cech et al.

    Cell

    (1981)
  • K. Machida et al.

    Mol Cell

    (2007)
  • D. Schmucker et al.

    Cell

    (2000)
  • R.E. Moore et al.

    J Am Soc Mass Spectrom

    (2002)
  • W. Yan et al.

    Mol Cell Proteomics

    (2004)
  • J.N. Adkins et al.

    Mol Cell Proteomics

    (2002)
  • R.S. Tirumalai et al.

    Mol Cell Proteomics

    (2003)
  • J.V. Olsen et al.

    Mol Cell Proteomics

    (2004)
  • T. Liu et al.

    Mol Cell Proteomics

    (2006)
  • N.L. Anderson et al.

    Mol Cell Proteomics

    (2004)
  • L. Anderson et al.

    Mol Cell Proteomics

    (2006)
  • H. Keshishian et al.

    Mol Cell Proteomics

    (2007)
  • X. Liu et al.

    J Am Soc Mass Spectrom

    (2007)
  • H. Zhang et al.

    Mol Cell Proteomics

    (2007)
  • L. Belov et al.

    J Immunol Methods

    (2005)
  • J.A. Falkner et al.

    J Am Soc Mass Spectrom

    (2007)
  • M. Tucholska et al.

    Anal Biochem

    (2007)
  • N.L. Anderson et al.

    Mol Cell Proteomics

    (2002)
  • S.N. Cohen et al.

    Mol Gen Genet

    (1974)
  • F. Sanger et al.

    Proc Natl Acad Sci U S A

    (1977)
  • K.B. Mullis

    Ann Biol Clin (Paris)

    (1990)
  • W.R. McCombie et al.

    Nat Genet

    (1992)
  • M.D. Adams et al.

    Nat Genet

    (1993)
  • S. Brenner et al.

    Nat Biotechnol

    (2000)
  • M. Sanchez-Carbayo et al.

    Cancer Res

    (2002)
  • L.M. Smith et al.

    Nature

    (1986)
  • J.C. Venter et al.

    Science

    (2001)
  • D. Soll et al.

    Proc Natl Acad Sci U S A

    (1965)
  • M.Q. Zhang

    Proc Natl Acad Sci U S A

    (1997)
  • K. Knapp et al.

    Nucleic Acids Res

    (2007)
  • G.B. Hutchinson et al.

    Nucleic Acids Res

    (1992)
  • T. Pawson et al.

    Science

    (2003)
  • B. McClintock

    Genetics

    (1941)
  • A.A. Strekalov

    Sov Genet

    (1973)
  • J. Urbain

    Rev Fr Etud Clin Biol

    (1969)
  • S. Ono

    J Med Genet

    (1972)
  • S. Tonegawa et al.

    Proc Natl Acad Sci U S A

    (1974)
  • P. Lonai et al.

    Embo J

    (1983)
  • J. Kavaler et al.

    Nature

    (1984)
  • M. Hagen et al.

    Embo J

    (1999)
  • P.J. Kersey et al.

    Proteomics

    (2004)
  • D.R. Maglott et al.

    Nucleic Acids Res

    (2000)
  • Cited by (22)

    • Re-evaluation of the rabbit myosin protein standard used to create the empirical statistical model for decoy library searching

      2018, Analytical Biochemistry
      Citation Excerpt :

      The agreement between the No Enzyme and Fully Tryptic peptides indicate that the correlation algorithm was not easily confounded by near neighbor false positive peptides. The apparent low error rate from fully tryptic peptides is consistent with the axiom that three independent peptide correlations render a protein identification to be nearly certain so long as the signal to noise ratio is high (intensity ≥ 99% confidence level compared to blank noise) [15,16,22]. The human bulk IgG sample, by virtue of its variable and constant domains, is a powerful positive and negative internal control for mis-correlation.

    • OxLDL receptor chromatography from live human U937 cells identifies SYK(L) that regulates phagocytosis of oxLDL

      2016, Analytical Biochemistry
      Citation Excerpt :

      The statistical distributions of the peptides in terms of peptide-to-protein counts, expectation values and intensity essentially preclude false positive identification of SYK with the phagocytic oxLDL receptor complex [54,69]. Moreover, the probability of observing nested and overlapping sets of tryptic peptides that map to the same location on the kinase is remote and also rules out false positive identification by LC-ESI-MS/MS [60,64]. Furthermore, the multiply observed peptides that mapped to the most hydrophilic portions of SYK that would presumably be readily accessed by the protease were again consistent with the veracity of the LC-ESI-MS/MS experiment.

    View all citing articles on Scopus

    Contributions: P.B., performed the calculations shown in part using the modified code of V.P. and P.Z. and; V.P., wrote the original SQL codes and approaches; P.Z., wrote the original BLAST parsing code and GO terms; J.G.M., conceived the study, supervised the work, and wrote the paper with P.B.

    View full text