Proteomic dataset: Profiling of cultivated Echerichia coli isolates from Crohn's disease patients and healthy individuals

One of the dysbioses often observed in Crohn's disease (CD) patients is an increased abundance of Escherichia coli (10–100 fold compared to healthy individuals) (Gevers et al., 2014). The data reported is a large-scale proteome profile for E. coli isolates collected from CD patients and healthy individuals. 43 isolates were achieved from 30 CD patients (17 male, 12 female, median age 30) and 19 isolates from 7 healthy individuals (7 male, median age 19). Isolates were cultivated on LB medium at aerobic conditions up to medium log phase. Protein extraction was performed with sodium deoxycholate (DCNa) and urea, alcylation with tris(2-carboxyethyl)phosphine and iodacetamide. Protein trypsinolysis was performed as described in (Matyushkina et al., 2016). Total cell proteomes were analysed by shotgun proteomics with HPLC-MS/MS on a maXis qTOF mass-spectrometer. The data including HPLC-MS/MS raw files and exported Mascot search results was deposited to the PRIDE repository project accession: PXD010920, project https://doi.org/10.6019/PXD010920.


a b s t r a c t
One of the dysbioses often observed in Crohn's disease (CD) patients is an increased abundance of Escherichia coli (10e100 fold compared to healthy individuals) (Gevers et al., 2014). The data reported is a large-scale proteome profile for E. coli isolates collected from CD patients and healthy individuals. 43 isolates were achieved from 30 CD patients (17 male, 12 female, median age 30) and 19 isolates from 7 healthy individuals (7 male, median age 19). Isolates were cultivated on LB medium at aerobic Proteome Crohn's disease HPLC-MS/MS conditions up to medium log phase. Protein extraction was performed with sodium deoxycholate (DCNa) and urea, alcylation with tris(2-carboxyethyl)phosphine and iodacetamide. Protein trypsinolysis was performed as described in (Matyushkina et al., 2016). Total cell proteomes were analysed by shotgun proteomics with HPLC-MS/MS on a maXis qTOF mass-spectrometer. The data including HPLC-MS/MS raw files and exported Mascot search results was deposited to the PRIDE repository project accession: PXD010920, project https://doi.org/10.6019/PXD010920.

Data
Escherichia coli is often observed as an abundant bacteria in intestines of Crohn's disease (CD) patients (Gevers et al., 2014) [1], in cotrast with healthy individuals. To identify proteins expressed in E. coli isolates from CD patients and healthy individuals (listed in Supplementary Value of the data -The dataset contains the first published wide-range proteome analysis of Escherichia coli isolates from Crohn's disease patients and healthy individuals (104 raw HPLC-MS/MS analyses searched against three different databases) and is valuable for researchers interested in bacterial proteomics -The data can be of value for the studies of pathogenic/nonpathogenic Escherichia coli -The data might be useful in studies of Crohn's disease pathogenesis mechanism

Patients and samples
Escherichia coli isolates achieved from feces, ileum biopsy and liquid ileal content of Crohn's disease (CD) patients and ileal content and feces of healthy patients. Samples from CD patients were collected during diagnostic endoscopy at Central Scientific Institute of Gastroenterology (Moscow Clinical

E. coli isolation and cultivation
Isolation of E. coli was as follows: liquid aspirates were diluted approximately Â10 6 fold with sterile PBS. Approximately 0.05 ml volume of feces were placed into 0.5 ml of sterile PBS, vortexed to homogeneity, an aliquot was diluted approximately Â10 6 fold. Biopsy samples were vortexed in 0.2 ml of sterile PBS. For all samples 0.1 ml of resulting liquid was spread onto LB agar plates. After overnight incubation on 37 C, isolated colonies were identified as Escherichia coli on MALDI Mass-spectrometer Bruker Microflex with the MALDI Biotyper software (Bruker Daltonics, Germany) using the mass spectrometer Microflex LT (Bruker Daltonics, Germany).
Isolates were cultivated in LB at 37 C (200 RPM) for 14 h. It was 3rd passage from the initial sample. Overnight cultures were diluted to 0.04 OD (540) and grown under the same conditions till mid-log phase (0.4 OD (540)). Bacterial cells were harvested by centrifugation (3500 g, 15 min) and pellet was washed twice with PBS.

Tryptic digestion of E. coli proteins
Protein trypsinolysis was performed as described in (Matyushkina et al., 2016) [2] with some alterations. Cell pellets were washed with PBS. Bacterial pellet was resuspended in 10 ml 100 mM NH4HCO3 with 0.5 mg/ml of lyzozyme and 1/10 volume of protease inhibitor mix. The suspension was incubated for 1 h at þ4 C. Then 10 ml of 10% of sodium deoxycholate (DCNa) and 1 ml nuclease mix (Promega) were added to the sample. The suspension was incubated for 1 h at þ4 C. Then the sample was diluted with 100 ml of 100 mM tris-HCl pH 8.0 with and 2.5mM EDTA. Cells were lyzed with ultrasonication for 1 min. Proteins were extracted with urea that was dissolved in each sample up to 6M concentration and incubated for 20 min at room temperature. After centrifugation for 10 min at 12 000 g, protein concentration was measured in supernatant by Bradford assay (Quick Start Bradford Protein Assay, BioRad) and samples were equalized.
The alcylation was performed as follows. 10 mM of reducing agent tris(2-carboxyethyl)phosphine (TCEP) was added and samples were incubated at 37 C for 30 min. Then 30 mM of iodacetamide was added (IAA) and samples were kept at room temperature in the dark for 30 min. To avoid chemical modifications and remove the unreacted IAA, samples were treated with 5 mM TCEP and incubated for 20 min at RT. Protein hydrolysis was performed by trypsin (20 mg per sample, Trypsin Gold, Mass Spectrometry Grade, Promega) for 16 h at room temperature. After that samples were diluted with 6Â volume of 100 mM tris-HCl pH 8.0 and protein hydrolysis was performed by addition of trypsin (in ratio trypsin : protein equal 1 : 50, Trypsin Gold, Mass Spectrometry Grade, Promega) in 0.1% SDS and incubation at 37 C for 17h. At this point trypsinolysis stopped by addition of 10% TFA and incubation at 37 C for 30 min. After centrifugation for 10 min at 12 000 g, supernatant was collected and cleaned with cartridges C18 (Discovery DSC-18 Tube, (Supelco)) according to the manufacturer's protocol.
Achieved peptide extracts were dried at SpeedVac (Labconco) and dissolved in 15 ml of LC-MS-MS sample buffer containing 3% acetonitrile and 0.1% trifluoracetic acid. The equivalent of 5 mg of protein was loaded onto HPLC-MS/MS analysis.

HPLC-MS/MS analysis
The HPLC-MS/MS analysis of the tryptic peptides was carried out using an Ultimate-3000 HPLC system (Thermo Scientific) coupled to a maXis qTOF after the HDC-cell upgrade (Bruker) with a nanoelectrospray source. The chromatographic separation of the peptides was performed on a trap-elute system: trap column (Zorbax 300SB-C18, 5 mm Â 0.3 mm, particle diameter 5 mm, Dionex) and column (Zorbax 300SB-C18, 150 mm Â 75 mm, particle diameter 3.5 mm, Agilent). The gradient parameters were as follows: 5e35% acetonitrile in aqueous 0.1% (v/v) formic acid, the column flow 0.3 ml/min. The gradient duration was 120 min. The positive MS and MS/MS spectra were acquired using an AutoMSMS mode (the capillary voltage 1700, the curtain gas flow is 4 l and the temperature is 170 C, the spectra rate 4 Hz, 20 precursors, m/z range 200e1500, the active exclusion after 2 spectra, release after 0.5 min). The lists of compounds (mgf files) were generated after a lock mass calibration (m/z 445.1200) with a Compass DataAnalysis (Bruker).

Protein identification and quantitative analysis
The protein identification was performed by the peptide search with a Mascot Data Search with the following parameters: Peptide Mass Tolerance 0,05 Da, Fragment Mass Tolerance 0,1 Da, variable modifications Carbamidomethyl (C), Oxidation (M), cutting enzyme trypsin, 1 missed cleavage per peptide was allowed.
Peptide search for protein identification was performed versus database of proteins (peptides). Databases for protein search by Mascot search were created as follows: Ecoli-16032016-kerat.fasta -was created by translation and annotation by PROKKA 1.7 of 14 CD E. coli isolates and 12 isolates from healthy patients (summarized and described in Rakitina et al., 2017 [3]). Similar proteins (>80% homology at >80% sequence) were united and the one showing maximum similarity with the other group members was used as representative. The database included: total sequences 92600, total residues 32006615. The cut-off ion score was >28 as an indicator of identity (pvalue <0.05).
Nissle1917_goodProt_kerat.fasta e was formed on the basis of genomes of genome of typical symbiotic E. coli strain).
Escherichia_coli_LF82_uid161965-1.fasta e was formed on the basis of genomes of genome of typical CD E. coli strain).
Aminoacid sequences of trypsin (Promega) and Human keratins were added to all databases to avoid misinterpretation of contaminating proteins. The protein was considered as identified by no less than two unique peptides with the score above the threshold. Lists of identified proteins are given in Supplementary Tables 2, 3 and 4.
The protein abundances were evaluated by a label-free method using an emPAI (Exponentially Modified Protein Abundance Index) determined by Mascot Data Search for each identified protein (Shinoda et al., 2010) [4]. Proteins significantly overrepresented in CD or healthy group are listed in Supplementary table 5. Numbers of proteins, significantly overrepresented in CD or healthy isolates, identified during search against three databases are given on Fig. 1.

Proteins abundance comparison between CD and healthy groups of E. coli isolates
The data of over-or under-represented proteins in CD and healthy groups of E. coli isolates, was achieved by the two-way Fisher test was used separately for each protein.
Principal component analysis (PCA) and T-distributed Stochastic Neighbor Embedding (T-SNE) analysis were used for data analysis. Principal components were constructed, representing orthogonal transformation of the analyzed data set. The principal component plot showed directions along which variation of data was maximum, so the 2d plot we can see the projection of distances among variables in multidimensional space. Variables in the 2d plot can group in clusters reflecting the correlation among variables like in clustering analysis. The test was performed in R with prcomp. T-SNE is a machine learning algorithm for visualization of high-dimensional data based on nonlinear dimensionality reduction. T-SNE analysis was performed in R with Rtsne.
Plotted 2D projections are given on Figs. 2e4. Patient's sex, isolate sources and diagnoses are indicated.