Dataset reporting 4654 cow milk proteins listed according to lactation stages and milk fractions

Milk contains numerous proteins including bioactive molecules that may be important in human nutrition. Thanks to improvements in proteomic methods, hundreds of proteins identified in milk are available through open data from different publications. We gathered these public data to produce an atlas reporting the cow milk proteins. We aggregated data from 20 publications reporting milk proteome and produced an atlas of 4654 unique proteins detected in milk from healthy cows. In this atlas, proteins are categorized according to four milk fractions: skimmed milk, whey, milk fat globule membranes (MFGM) and exosomes; and five lactation stages: colostrum period, early lactation, peak of lactation, mid-lactation and drying-off. These 9 protein lists were compared and annotated by Gene Ontology (GO) terms to identify the pathways they contribute to, the molecular signatures of different milk fractions and lactation stages. This data article compiles the 4654 cow milk proteins. This atlas may be used by researchers on human nutrition interested in milk protein allergy and/or digestibility in humans, and for milk processing industry. The atlas may be useful to i) find molecular signatures of physiological adaptations of dairy cows, ii) facilitate the isolation of proteins of interest, thanks to the knowledge on their presence in milk fractions and their period of secretion during lactation.


a b s t r a c t
Milk contains numerous proteins including bioactive molecules that may be important in human nutrition. Thanks to improvements in proteomic methods, hundreds of proteins identified in milk are available through open data from different publications. We gathered these public data to produce an atlas reporting the cow milk proteins. We aggregated data from 20 publications reporting milk proteome and produced an atlas of 4654 unique proteins detected in milk from healthy cows. In this atlas, proteins are categorized according to four milk fractions: skimmed milk, whey, milk fat globule membranes (MFGM) and exosomes; and five lactation stages: colostrum period, early lactation, peak of lactation, mid-lactation and drying-off. These 9 protein lists were compared and annotated by Gene Ontology (GO) terms to identify the pathways they contribute to, the molecular signatures of different milk fractions and lactation stages. This data article compiles the 4654 cow milk proteins. This atlas may be used by researchers on human nutrition interested in milk protein allergy and/or digestibility in humans, and for milk processing industry. The atlas may be useful to i) find molecular signatures of physiological adaptations of dairy cows, ii) facilitate the isolation of proteins of interest, thanks to the knowledge on their presence in milk fractions and their period of secretion during lactation.
© 2020 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons. org/licenses/by/4.0/).

Data
The three datasets supporting this article are available in the "Portail Data INRAE" online repository (https://data.inra.fr/). Each dataset contains one table file: i) The dataset "Distribution of 4654 cows milk proteins among different milk fractions" (doi.org/ 10.15454/SUJJSQ) contains the file: "milk_fraction_4654_proteins" (.tab format, possible to download with Excel software), which reports the comparison of the four milk fractions resulting in specific and common Gene Name (GN) lists. The sheet entitled "Milk fractions" Specifications Table   Subject Animal Science and Zoology Specific subject area We aggregated data from 20 publications reporting milk proteome and produced an atlas of 4654 unique proteins detected in milk from healthy cows. Type of data Value of the Data The atlas provides the presence of proteins within a milk fraction to facilitate extraction and quantification. The atlas provides information on the lactation stage at which a protein of interest is secreted into milk. The applied output is for nutrition researchers interested in milk protein allergy and/or digestibility in humans, and for dairy industry. The atlas may be used to identify potential biochemical properties of proteins/peptides in bovine milk, or to isolate proteins of interest.
This atlas provides information about the molecular signatures of metabolic adaptations that occur throughout lactation. This computational approach to obtain an atlas of milk proteins using publicly available data may be an elegant alternative or complementary to animal experiments.
reports the list of proteins (identified by GN) which are either specific of a milk fraction, or present in one, two, three or four milk fractions. ii) The dataset "Distribution of 4654 cows milk proteins among lactation stages" (doi.org/10.15454/ MKM1P4) contains the file "lactation_stage_4654_proteins" (.tab format, possible to download with Excel software), which reports the comparison of the five lactation stages resulting in specific and common GN lists. The sheet entitled "Lactation stages" reported the lists of proteins (identified by GN), which are specific of a lactation stage and present in one, two, three, four or five lactation stages. iii) The dataset "Gene ontology of proteins detected in different milk fractions and lactation stages of dairy cows" (doi.org/10.15454/1RF3R2) contains 10 files "GO_specific_fraction or lactation stage_4654_proteins" (.tab format, possible download with Excel software), which reports the GO enrichment analysis performed on the GN lists specific to a milk fraction or a lactation stage using the ProteINSIDE web service. As an example, the file entitled "GO_3_Whey" reported the GO enrichment of the lists of GN exclusively found in whey. Fig. 1 is a flowchart reporting the workflow for the construction and analysis of the milk proteome atlas. Table 1 reports the list of the 20 publications on cow milk proteome used to build the atlas. Table 2 reports numbers of GN without duplicate, numbers of datasets (under parenthesis) and the associated references (with publication identifiers from Table 1 in superscript) depending on lactation stage and milk fraction.   Table 1 The list of the 20 publications on cow milk proteome used to build the atlas.

ID Authors
Title Year Journal Volume Pages URL

Experimental design, materials, and methods
By a computational approach, we gathered and mined public data in the context of lactation, a complex and dynamic physiological process. The data collection was driven by the free access to large lists of proteins. These lists were organized according to the physiological states of cows and the milk fractions in which proteins were detected.
A computational workflow (Fig. 1) was used to aggregate data from 20 publications reporting cow milk proteome and to produce an atlas of 4654 unique proteins. The atlas was categorized in lists according to four milk fractions: skimmed milk, whey, MFGM and exosomes; and five lactation stages: colostral period, early lactation, peak of lactation, mid-lactation and drying-off. These protein lists were compared depending on milk fractions and lactation stages and annotated by GO terms to identify pathways and molecular signatures of each milk fraction and lactation stage. Consequently, the present data article reports an atlas of 4654 unique proteins distributed according to i) four milk fractions; ii) five lactation stages, and iii) GO terms enrichment.
The data presented here were generated as part of an accompanying publication on the identification of putative biomarkers of negative energy balance in dairy cows by using milk proteome from in silico data aggregation [1].

a) Publications database
First, we collected publications on bovine milk proteome whatever the effect studied by the authors. The thesaurus to collect the targeted publications combined a description of three terms or their related words: "bovine" with (bovine OR cows OR cattle), "milk" with (*milk OR dairy) and "proteome" with (proteom* OR protein* OR secretom* OR exosom* OR secreted OR biomarker* OR vesicle* NOT (content OR concentration OR production)). This thesaurus was submitted to the title section of the search engines of the PubMed.gov (NCBI) and Web of Science Core Collection (Clarivate Analytics) until February 2018. The 87 resulting publications were curated based on the availability of sufficient information, such as accession of supplementary data, precision of days in milk (DIM) and health status of cows, in order to retrieve and annotate proteins. Twenty publications (Table 1) were selected in order to extract proteins lists. A flowchart (Fig. 1) reports the workflow for the construction and analysis of the computational milk proteome atlas.

c) ID conversion into Gene name
Protein ID were standardized and converted into the corresponding GN, as unique identifier to be free from species, by use of three tools: Retrieve/ID Mapping tool of the Uniprot database (The UniProt [2]), the Protein Identifier Cross-Reference service [3] and/or the ProteCONVERT tool of the ProteINSIDE web interface [4].

d) Class definitions
We sorted data according to four milk fractions and five lactation stages of dairy cows, based on paper's statements on protein extraction and DIM. The milk fractions were: i) skimmed milk, aggregating proteins isolated by centrifugation under 100 000g combined with or without casein depletion by acidification; ii) whey, aggregating proteins isolated by centrifugation over 100 000g; iii) MFGM, aggregating proteins isolated from cream milk, and iv) exosomes, aggregating proteins isolated from skimmed milk by protocol based on sucrose gradient [5]. The lactation stages were: i) colostrum period, aggregating proteins from colostrum collected during the first 5 days post-partum; ii) early lactation, aggregating proteins from milk collection between 6 and 21 DIM; iii) peak lactation, aggregating proteins from milk collected between 22 and 80 DIM; iv) mid-lactation, aggregating proteins from milk collected after 81 DIM, and v) drying-off, aggregating proteins from milk collected at 3   publications (Table 2): among them 14 datasets focused on skimmed milk, 13 on whey, 4 on MFGM and 4 on exosomes.
The number of datasets available in the literature decreased when the complexity for milk fractionation increased: only four datasets were available for the exosome fraction while 14 were available for skimmed milk fraction. According to lactation stages whatever the milk fraction, 12 datasets dedicated to colostrum's proteins, 8 to early lactation, 6 to peak lactation, 8 to mid-lactation and 2 to the drying-off period. From the 20 publications, the 35 datasets referred to experimentations carried out with different cow breeds and in various countries (Fig. 2).

e) Dataset aggregation
The full lists of GN coming from the 35 datasets were aggregated in an atlas of 8841 GN. Redundancies were discarded in each class providing 7135 GN useful for the milk fractions comparison and 6323 GN for the lactation stages comparison. After aggregation and discarding redundancies, an atlas of 4654 unique milk proteins was produced.

f) Comparison of classes
Venn diagram (Draw Venn Diagram tool from VIB/Ugent) were used to identify GN specifically identified in one and up to four milk fractions or lactation stages. The comparison of GN lists according to milk fractions identified 95 GN common to all four milk fractions whereas 93, 488, 15 and 3139 GN were unique to skimmed milk, whey, MFGM and exosomes fractions. Fourty-four GN were identified both in skimmed milk and whey, 2 in MFGM and skimmed milk, 95 in exosomes and skimmed milk, 7 in whey and MFGM, 407 in exosomes and whey, 65 in MFGM and exosomes. Fourteen GN were identified in MFGM, skimmed milk and whey; 142 in exosomes, skimmed milk and whey; 11 in exosomes, MFGM and skimmed milk; 37 between exosomes, MFGM and whey. The lists of GN in milk fractions are reported in the "milk_fraction_4654_proteins.tab" file.
The lists of GN according to lactation stages highlighted 105 GN present in all the lactation stages whereas 3288, 59, 185 and 155 GN were unique to colostrum period, early lactation, peak lactation and mid-lactation. One hundred ninety-seven GN were identified in both colostrum and early lactation milk; and 252 in both colostrum and peak lactation milk. Seventy-eight GN were identified in colostrum and in milk from early lactation, peak lactation and mid-lactation; 14 from colostrum and milk from early lactation, peak lactation and drying-off; 65 from colostrum and milk from early lactation and peak lactation; 9 from colostrum and milk from early lactation and drying-off. The lists of GN by lactation stage are reported in the "lactation_stage_4654_proteins.tab" file. The Venn diagrams compiling protein lists and the biological mining of protein categorization (are reported elsewhere [1]).

g) Code availability
The code used for protein designation was the GN. Last conversion from ID to GN was February 2018 using tools described in the Methods section. The PDF extractor tool was Tabula (www.tabula. technology, Last update February 11, 2017). The used version of ProteINSIDE Workflow was 1.2 (last update November 17, 2016).

Data validation and quality control
Our search was based on a thesaurus conceived to target the milk bovine proteome and was submitted to two search engines of scientific publications (PubMed and Web of Science). From the resulting 87 publications, we applied exclusion criteria such as absence of protein ID access, species name, health status, or DIM of lactating cows; and selected only 20 publications for the atlas construction. These 20 publications referred to milk proteome from healthy cows characterized for breed, DIM and milk fraction. The main objective of this computational data aggregation is to obtain an overview of milk proteins independently of breed, age, country (Fig. 2), and whatever the methodologies of protein isolation and identification. Among those methodologies, iTRAQ labelling [6e11], LCxLC-MS/MS detection [5,12] and 15 repetitions of nanoLC-MS/MS runs [13] allowed the detection of thousands of proteins. In order to verify the reliability of the atlas to mine pathways and biomarkers of the lactation processes, we combined Venn diagram to compare lists and annotations according to GO using ProteINSIDE web service [14]. ProteINSIDE was previously bench tested and the reliability and accuracy of GO annotations for ruminants species were published [14]. Lists of mined proteins were enriched for GO terms related to lactation process, and were composed of the major expected milk proteins as reported in Ref. [1], thus validating the atlas. The protein diversity may arise from the thousands of proteins identified from exosomes that are expected to derived from various cell types and found in the milk [5], and thus pave the molecular basis of the lactation process. However, recent publications prove the benefic wealth effect of minor milk components such as MFGM [15] that have bioactive properties [9], a potential to be markers of technological and sensorial milk qualities [16] and a role in the dynamics of digestion of human and bovine milk proteins for the improvement of infant formula [17]. During the building and analysis of our atlas, we found that proteins from the skimmed and the whey fractions were mostly different.
The content and proportions of protein fractions have notable effects on the nutritional value and technological properties of milk [18]. This atlas allows the identification of fractions containing specific proteins that may be of interest for research and industry. This knowledge is useful for scientists working on the isolation of protein fractions in milk and dairy process. Particular interest concerned proteins from colostrum that reflect in part the physiological state of dairy cows in the 5 days of the post-partum period. Datasets on whey, largely studied in literature, covered all the lactation period, therefore allowing comparisons among lactation stages. Colostrum period provides thousands of proteins compared to hundreds in later lactation stages, which represents a limitation of our atlas. This observation encourage proteomic efforts onto the milk from late lactation stages.
Some limitations are reported. The first limitation of the computational approach was the use of only part of the 87 relevant publications on cows. Thus, 67 publications were unusable either because they concerned cows with mastitis or because of lacking information on sampling collection period, methods of milk fractionation, animal characteristics (phenotype, feeding, husbandry conditions …), or protein IDs. The second limitation is the conversion of protein ID into GN that led to the loss of some data, such as the protein isoforms. The number of proteins identified according to fractions (Table 2) is strongly imbalanced due to the diversity of the proteomic methods used. Indeed, thousands of proteins were identified for exosomes by 15 successive LC-MS/MS analyses, compared to only hundreds of proteins for the others fractions, determined either by gel-based or gel-free nanoLC-MS/MS proteomics. The atlas aggregates proteins that were identified in milk, whatever their abundance. Lastly, due to the nature of the protein identification algorithms, false-positives may be present in datasets because not all datasets are equally filtered.

Re-use potential
The atlas allow enhancing our knowledge of the diversity of cow milk proteins. The major benefit of making this atlas available is to provide information of interest to the scientific community. The applied output is for nutrition researchers interested in milk protein allergy and/or digestibility in humans, and for industrials working on milk processing.
The atlas may be used to identify potential biochemical properties of proteins/peptides in bovine milk, or to isolate proteins of interest. For example, according to our atlas, carbonic anhydrase 6 (CA6), an essential factor for development of gastrointestinal tract of the human new born, is present exclusively in the whey fraction of cow milk. This atlas provides information about the molecular signatures of metabolic adaptations that occur throughout lactation. As part of our research on dairy ruminants, the protein lists and the molecular signatures of the lactation stages provide information about secreted proteins and physiological adaptations of cows, which are a prerequisite to the identification of molecular biomarkers and the understanding of dairy cows adaptations to husbandry conditions.
Finally, this computational approach using publicly available data is an elegant alternative to animal experiments to obtain an atlas of milk protein, without conducting new experiments on ruminants,