Enhanced Missing Proteins Detection in NCI60 Cell Lines Using an Integrative Search Engine Approach

The Human Proteome Project (HPP) aims deciphering the complete map of the human proteome. In the past few years, significant efforts of the HPP teams have been dedicated to the experimental detection of the missing proteins, which lack reliable mass spectrometry evidence of their existence. In this endeavor, an in depth analysis of shotgun experiments might represent a valuable resource to select a biological matrix in design validation experiments. In this work, we used all the proteomic experiments from the NCI60 cell lines and applied an integrative approach based on the results obtained from Comet, Mascot, OMSSA, and X!Tandem. This workflow benefits from the complementarity of these search engines to increase the proteome coverage. Five missing proteins C-HPP guidelines compliant were identified, although further validation is needed. Moreover, 165 missing proteins were detected with only one unique peptide, and their functional analysis supported their participation in cellular pathways as was also proposed in other studies. Finally, we performed a combined analysis of the gene expression levels and the proteomic identifications from the common cell lines between the NCI60 and the CCLE project to suggest alternatives for further validation of missing protein observations.


False discovery rate (FDR) calculation
There are several methods reported in the literature to calculate the FDR using the target-decoy searching strategy [3]. There are basically two ways of performing the search using the selected search engine against these proteomic databases: using concatenated databases, when the decoy database and the target database are combined and searched together and using separately searches for the decoy and the target database. We used the second method, in which the FDR at PSM level is calculated as: Where D is the number of decoys passing a given threshold (ion score) that corresponds to the number of FP and T is the number of hits in target PSMs above that threshold (the sum of the number of false positives and true positives). For the calculation of the peptide and protein FDR we used an in house implementation that calculated the protein identification FDR on the basis of peptide identifications (PSM FDR < 1%) with the best ion score per peptide or protein. Using the HPP guidelines only those proteins with pepFDR and protFDR < 1% and with 2 or more unique peptides were considered as positive identifications.

R COMMANDS:
## PSM FDR # Input: a matrix ("protPep") with all the PSMs of the searches # with target and decoy databases, their corresponding score and # the database used in the search for each PSM. Columns of the # matrix are PSMid (PSM identifier), PepSeq (assigned peptide # sequence), PEPID (peptide identifier of the database), PSMscore (search engine score for the assignment), # ProtID (Protein identifier of the database), Database (Target # or Decoy database used in the search). # Output: the input matrix ordered by score and with an # additional column including the PSM FDR.

Estimation of the number of false positives identifications
The estimation of the FDR at the different levels of the analysis from the resulting set of PSMs, peptides and proteins was performed using the number of decoys identified as a measure of the number of false positives included in the selection:

Spectral matching
Before matching, each query spectrum undergoes a simple preprocessing procedure. Isotopic peaks are removed and the peak intensities are rescaled to the range 0 to 1.
To determine peak hits when matching two spectra, a mass tolerance is used to compute the allowed mass shifts for peak matching. Then the spectral dot-product score (SDP_Score) is calculated as: where and denote the intensities of the endogenous peak (obtained from the NCI60 dataset) and synthetic peak (obtained from synthetic peptides), respectively.

Selection of samples where transcripts of a protein are highly expressed
A multi-omic bioinformatic analysis is proposed to highlight the samples in which the probability of detection of missing proteins is higher. For this purpose, we can use the gene expression profiles in different tissues or cell lines for which we have transcriptomic data. In the present work we used the BAM files corresponding to the cell lines available in both the NCI60 dataset and the CCLE project. We processed BAM files as explained in Methods section and obtained a matrix of normalized gene expression profiles. We define expressed and highly expressed genes based on the histogram of the normalized counts for all the genes in all the samples: a gene is considered expressed in a sample when its expression value is greater than the first quartile (Q1) or highly expressed when its expression exceeds the third quartile (Q3).