Data showing the compositional complexity of the mitochondrial proteome of a unicellular eukaryote (Acanthamoeba castellanii, supergroup Amoebozoa)

This article describes and directly links to 1033 Acanthamoeba castellanii mitochondrial protein sequences. Of these, 709 are supported by Mass Spectrometry (MS) data (676 nucleus-encoded and 33 mitochondrion-encoded). Two of these entries are previously unannotated mtDNA-encoded proteins, which we identify as highly divergent mitochondrial ribosomal proteins. Our analysis corrects many A. castellanii protein sequences that were incorrectly inferred previously from genomic data deposited in NCBI.


Specifications
Value of the data We report 1033 A. castellanii mitochondrial protein sequences, 709 supported by MS/MS data (676 nucleus-encoded and 33 mitochondrion-encoded), including two previously unannotated mtDNAencoded components, which we identify as highly divergent mitochondrial ribosomal proteins.
All reported sequences are complete and have been verified against transcriptomic data and interpreted in [3].
Our analysis corrects many A. castellanii protein sequences that were previously incorrectly inferred from genomic data deposited in NCBI.
Mitochondrial targeting sequences (MTS) are inferred from MitoProt and/or TargetP prediction programs.

Data, experimental design, materials and methods
In-gel protein digestion, in-solution protein digestion, SCX-HPLC of peptides and MS/MS were performed as described in [4]. Precursor ions with charges of 2 þ and 3 þ were examined. Peptide sequences were assigned using Mascot (Matrix Science) to search an EST database using the following search parameters: MS and MS/MS mass tolerances were set to 71.2 and 70.6 Da, respectively. One missed cleavage was allowed and carbamidomethylcysteine and oxidized methionine were set as fixed and variable modifications, respectively. Searches were based on a significance threshold of p o0.05. MudPIT scoring was used to remove protein hits that had scores based purely on a large number of low-scoring peptide matches. Ion score cut-off was set at Z25 and each protein hit was required to have at least one bold red (best match in database) peptide. False positive rates were calculated using the decoy option provided by Mascot and estimated as below 2%. Peptide sequences and inferred protein identities are compiled in Supplemental Table S1.
Some of these data have previously been published [6][7][8][9]. A Mascot semi-tryptic peptide search was employed to detect putative mature protein N-termini, essentially as described [10]. Briefly, semi-tryptic peptides with ion scores surpassing the Mascot identity threshold and lacking an N-terminal Arg or Lys tryptic cleavage site were considered. If a semitryptic peptide lacking an N-terminal Arg or Lys was located in the N-terminal region of the inferred protein sequence (usually the first 50-60 amino acids) and not located in a region of the protein conserved in other species, it was considered to be the putative mature protein N-terminus [2].
OrthoMCL [11] was used to identify orthologs/co-orthologs of A. castellanii proteins in the predicted cellular proteomes of Arabidopsis thaliana (35,386 sequences from TAIR 10 protein dataset), Homo sapiens (20,270 sequences from UniProt), Saccharomyces cerevisiae (6,572 sequences from Saccharomyces Genome Database), and Tetrahymena thermophila (27,054 sequences from NCBI nr database), all of which have been examined intensively at the level of the mitochondrial proteome. For A. castellanii protein sequences, we used a six-frame translation of RNA-seq data supplemented with corrected versions of each mitochondrial protein identified in our analysis. OrthoMCL was run according to the recommended parameters, with an E-value threshold of 1e À 5. Data are compiled in Supplemental Table S3.