CSF-PR 2.0: An Interactive Literature Guide to Quantitative Cerebrospinal Fluid Mass Spectrometry Data from Neurodegenerative Disorders*

The rapidly growing number of biomedical studies supported by mass spectrometry based quantitative proteomics data has made it increasingly difficult to obtain an overview of the current status of the research field. A better way of organizing the biomedical proteomics information from these studies and making it available to the research community is therefore called for. In the presented work, we have investigated scientific publications describing the analysis of the cerebrospinal fluid proteome in relation to multiple sclerosis, Parkinson's disease and Alzheimer's disease. Based on a detailed set of filtering criteria we extracted 85 data sets containing quantitative information for close to 2000 proteins. This information was made available in CSF-PR 2.0 (http://probe.uib.no/csf-pr-2.0), which includes novel approaches for filtering, visualizing and comparing quantitative proteomics information in an interactive and user-friendly environment. CSF-PR 2.0 will be an invaluable resource for anyone interested in quantitative proteomics on cerebrospinal fluid.

Given the challenges highlighted above, and the lack of suitable systems to integrate results from the various studies, it is difficult to see trends and correlations in the field of CSF biomarker proteomics. The ideal approach would be to conduct full proteome quantification for large numbers of patients across all the disease stages and categories to find the most appropriate biomarkers. This is however currently not feasible. An alternative approach is to gather and unify the results from relevant studies and in this way, build a more comprehensive understanding of the effects the different diseases, including stages and sub categories, have on the CSF proteome. This was exemplified in our recent review presenting ten promising candidates for sub categories of MS (14).
Although such reviews, and shared lists of biomarker candidates, are of great value to the field, they are static and leave limited opportunities for researchers to interact with the data (26), or to inspect multiple neurological disorders simultaneously. In the current work, we have gathered and organized recent mass spectrometry based proteomics data from CSF biomarker discovery and verification studies for multiple sclerosis, Parkinson's disease, and Alzheimer's disease in the freely available and interactive CSF-PR 2.0 (http://probe.uib.no/csf-pr-2.0). This enables users to search, browse and compare quantitative mass spectrometry data related to the CSF proteome ( Fig. 1), thus serving as a much-needed interactive resource for quantitative CSF data.

MATERIALS AND METHODS
Literature Search and Inclusion Criteria-To find relevant CSF proteomics literature, searches in PubMed (http://www.ncbi.nlm.nih.gov/ pubmed) and Web of Science (https://webofknowledge.com) were conducted in January 2016, finding publications published since 2010 containing the key words "multiple sclerosis," "Parkinson's disease," or "Alzheimer's disease," in combination with "AND cerebrospinal fluid AND proteomics." This resulted in more than 300 scientific publications, based on 98 MS-studies, 155 AD-studies, and 48 PD-studies.
When investigating the identified literature, we found that most of the publications were based on either ELISA or other affinity based studies for one or a few proteins, variants of 2D-DIGE, or other types of gel based quantification in combination with MALDI or SELDI. The number of patient samples included in many of the studies was also low, and in some cases, all the patient samples within one disease group were pooled before the analysis.
It should be noted that although the pooling of samples can sometimes be a necessity, e.g. because of low sample amount and/or a desire for extensive fractionation, pooling will completely remove the possibility to look at individual differences. However, ensuring enough patients per pool and several pools per condition are common strategies to reduce the problem.
Because of the above mentioned observations, we decided to mainly focus our resource on liquid chromatography mass spectrometry (LC-MS) data, and to apply strict criteria for which studies to include, thus making it easier to compare the gathered data and improve the reproducibility of the data in the resource.
For experimental data to be included in our database, the publication had to contain quantitative mass spectrometry generated proteomics data from either a bottom-up shotgun experiment or a targeted experiment. The experiment had to fulfill the following criteria: (1) the CSF samples were collected from living humans, (2) the experiment included Ն 20 patients; (3) each disease group included Ն 5 patients; (4) if containing pooled samples, each disease group included Ն 3 pools; and, (5) the publication contained Ն 5 quantified proteins. Finally, the experimental workflow and the proteomics results had to be presented in such a way that extracting the quantitative data for each protein (and/or peptide) was possible. When the publication also contained ELISA verification of protein abundances in CSF or serum/plasma, this information was extracted and included as well.
It should be noted that many publications included more than one experiment, and that the selection criteria were applied individually for each experiment. We have defined the term data set in CSF-PR 2.0 as a quantitative proteomics comparison between two disease groups, or between a disease group and a control group. For example, if a publication describes an experiment where three different patient groups were all compared with each other, this results in a total of three data sets. Any variation in the data generation or processing, such as using different sample matching strategies or methodological approaches, resulted in a new data set. This explains the significantly higher number of data sets (85) compared with publications (17). For all the data sets passing the filters, the experimental and quantitative information were extracted and stored in a common format for inclusion in the CSF-PR 2.0 database. Extracting Published Quantitative Information-Because of the lack of standardization in the presentation of both experimental procedures and patient information in journals, the amount of information available for each data set is not uniform. We have aimed to present the data as close to how it was originally published as possible, e.g. not recalculating relative protein expression values (fold changes, ratios) and p values. For the graphical displays, we used the definition or value given in each article to categorize a protein's abundance as increased, decreased, or equal between two patient groups, and considered reported p value thresholds, but not expression value thresholds for significance. We also show the expression values (log 2 transformed by us if not done by the authors) if provided in the publication.
Unifying the various data sets in order to make comparisons across data sets can be difficult because of the many variants used for the disease group definitions. By default, the variants of healthy controls were therefore grouped together; the user can however easily restore the original group names from the publications, alter the groups or rename the categories.
Web Server Implementation Details-CSF-PR 2.0 has been developed using the Vaadin 7.6 framework. In addition, JFreeChart, JavaScript and HTML5 are used for managing the client side inter-active visualizations. The server side is implemented using Java EE and the TableExport from the Vaadin library is used to enable users to export the tables in the XLS format. All the application charts and data visualizations are developed based on the DiVA concept (44), and the software is compatible with all modern web browsers as long as the most up to date versions are used.

RESULTS
The goal of this work was to create an interactive and user-friendly database containing the latest mass spectrometry based quantitative CSF proteomics data sets published related to MS, PD, and AD biomarkers. More than 300 publications were investigated and 17 publications containing 85 data sets passed our stringent filtering criteria. From these data sets the quantitative information for 1956 proteins with 3068 corresponding peptides was extracted and made available via CSF-PR 2.0 (http://probe.uib.no/csf-pr-2.0).
A total of 38 disease category comparisons are available in CSF-PR 2.0 with the majority of the data being related to MS, and much less data for PD and AD (indeed there are no shotgun data set present for AD and PD). Although most studies provide data for proteins that are increased, decreased, or have an equal abundance between the compared groups, some studies only report the proteins changed in either direction. An overview of the properties of the data sets included in CSF-PR 2.0 is shown in Fig. 2.
CSF-PR 2.0 overview- Fig. 1 shows an overview of how CSF-PR 2.0 can be used to interact with quantitative pro- Interactive Quantitative Proteomics Data-In order to inspect the quantitative information the user starts by selecting the disease category of interest, upon which the available disease subgroup comparisons will be displayed as an interactive disease group comparison table in the next tab, where each cell represents the number of data sets available for the given comparison (Fig. 3). To show the underlying protein data, click the corresponding cell in the table. Upon selecting one or more cells, an overview bubble plot is displayed in the next tab (Fig. 4A). This indicates the number of proteins (as represented by the bubble size) found as Increased, Equal, or Decreased in the currently selected group comparisons. By clicking the protein group of interest, a protein table is shown in the next tab containing the corresponding quantified proteins and their abundance trends across the selected comparisons (Fig. 4B). After selecting a protein in the table the final tab is activated displaying further details about the selected protein and its peptides across the selected comparisons.
Example Use Case 1: Searching For a Specific Protein of Interest-The Search feature enables the user to search for one or more proteins or peptides, using the protein name(s), the accession number(s) or the peptide sequence(s). As an example, we have chosen to search for the protein chitinase-3-like protein 1, a protein studied by several groups in relation to MS (28 -30, 33, 37, 45).
The quantitative search results are displayed and interacted with in the same way as described in the previous section. In the protein table the overall group regulation of the chitinase-3-like protein 1 is displayed (Fig. 5), showing that chitinase-3-like protein 1 is increased in MS, AD, and Lewi body dementia (LBD) compared with noninflammatory controls. It is also increased in relapsing-remitting MS (RRMS) compared with CIS-MS (CIS, clinically isolated syndrome) and in baseline RRMS compared with Natalizumab-treated RRMS (RRMS Nataliz.), but decreased in RRMS versus progressive MS (PMS). There is however no change between MS and inflammatory controls. This indicates that chitinase-3-like protein 1 could be a prognostic marker for MS, although not specific for MS, and that it could be used to monitor disease status after treatment (with Natalizumab), which has also been emphasized in the included studies from which the data was extracted (30,33,37).
The data set specific group regulation for chitinase-3-like protein 1 can be inspected by selecting the protein and move to the protein details tab. This reveals that some of the protein comparison values are based on single data sets (Fig. 5). Additional protein details, such as the analytical method used, the actual fold change, and the p value (if provided), in addition to data set specific details such as the type of study and the patient group information are also available in this tab.
Example Use Case 2: Revealing Potential Prognostic/Disease Status Markers-Potentially interesting proteins for spe-

FIG. 2. The available data types in CSF-PR shown through the filter data option.
The numbers in the charts indicate the number of data sets the given filter will apply to. The feature allows multiple selections and adapts as filters are applied. cific disease groups can be revealed by sorting the protein information table based on a specific column.
As an example, we sorted the table with all the available data on Alzheimer's Disease/Aged controls and selected osteopontin as one of the proteins increased in Alzheimer's disease compared with aged controls (Fig. 6A). It becomes clear that osteopontin is found increased in AD compared with most of the available control groups. Upon further inspection, it is also increased in LBD, and increased in secondary-progressive MS (SPMS) when comparing before and after treatment with Lamotrigine. Furthermore, it appears to be regulated in MS and PD when compared with certain controls indicating that the protein would likely not serve as a good diagnostic marker for any of the three diseases. However, it could have potential as a disease status marker for MS after treatment (with Lamotrigine). Osteopontin is a regulator of inflammation and immunity and has indeed been suggested to have a role in immune mediated diseases (46), but it is also implicated in fibrosis, tumorigenesis, and cancer metastasis (47).
After a potential protein marker is located, one can inspect the provided protein coverage across the individual studies in order to locate promising peptides for targeted follow-up studies, see Fig. 6B.
Example Use Case 3: Comparing Against Your Own Data-The Compare feature enables the users to compare their own quantitative protein data with the information in the resource. This feature is very useful as it puts the researcher's own findings directly in context with the relevant literature, and helps visually highlighting similarities and differences. This comparison is done by inserting the accession numbers of the query proteins found to be Increased, Equal and Decreased in user-selected disease group comparisons, i.e. patient categories. As exemplified using the included example data userprovided data, appearing as a horizontal line across the charts, can easily be compared with the resource data as indicated in Fig. 7.

DISCUSSION
When constructing CSF-PR 2.0 we had to overcome the challenge of extracting, unifying and organizing data presented in the different publications, in order to allow for the data to be comparable. The first set of challenges were related to the various ways in which researchers report quantitative data, e.g. in Excel sheets, PDF files, plots or figures, making it a tedious task to understand and extract the relevant data. Next, the level of details regarding information about methodology, statistics, abundance changes etc., varied significantly across the papers. To fully comprehend the statistics applied in the various experiments, i.e. how the reported values for fold changes were calculated and which data they were based on, was particularly time consuming. In some cases, it was also difficult to link the values presented in the paper to the data provided in the corresponding supplementary files. In order to enable the comparison of data from different studies it is therefore FIG. 4. A, The distribution of the quantitative data currently in CSF-PR 2.0. The bubbles represent proteins that are reported with increased (red), equal (blue) or decreased (green) abundance in Patient Subgroup A (numerator) compared with Patient Subgroup B (denominator). The size of each bubble represents the number of proteins in the given category. B, Table of the proteins quantified in at least one of the user selected comparisons (comparisons not specified in this figure), with color coded symbols indicating if the protein was on average found to be increased (red triangle), equal (blue square) or decreased (green triangle) across the data sets, or not reported (gray square).
vital that results are accessible and easily extractable, and that sufficient information is provided to be able to categorize the studies correctly (48,49).
In our opinion, the existing standard formats of reporting quantitative proteomics data (50,51) ought to be much more frequently used in quantitative biomarker studies, which are lagging behind in this respect. This is crucial in order to make it feasible to organize such data into common databases, as exemplified in this work, which would make the data more easily available and thus enable more comprehensive systems biology approaches.
Another major issue to be dealt with was the multitude of disease group definitions used across different publications. Consensus guidelines are available (52,53), but there is still a large variety in the names and definitions used when describing disease and control groups in CSF biomarker studies. This represents a problem when attempting to group and compare data across studies, especially if limited information is provided regarding the individual patients in the groups. We therefore encourage researchers to, when applicable, apply to the current consensus guidelines in terms of disease group definitions, and to also supply sufficient clinical data for all patients included in a published study.
We also consider it urgent that more peptide level information is made available from shotgun data sets, allowing for a more detailed comparison of the data. This will make it possible to address concerns related to protein inference (54) and the quantification of nonunique peptides. A peptide centric, compared with a protein centric approach, often provides important and complementary information, especially when investigating post-translational modifications.
A final challenge was how to best visualize the quantitative results from the various experimental results. This has been discussed thoroughly in a recent review by Oveland et al. (55), and had to be optimized for the specific use cases of CSF-PR 2.0. Our main goal was to simplify the data presentation as much as possible, in order to make the usage intuitive and straightforward without hiding important details that could be interesting for certain users.  5. Overview of the quantification values for chitinase-3-like protein 1 across the relevant disease group comparisons contained in CSF-PR. The gray dotted line indicates the overall trend. Data set specific details are displayed as individual blue, green and red symbols. Green triangles pointing downwards indicate an average decreased abundance in the numerator disease group compared with the denominator disease group, whereas red triangles pointing upwards indicate an average increased abundance. The further down or up, and the more intense the color is, the stronger is the evidence (i.e. more data sets and/or less conflicting results) in either direction. Blue squares indicate an overall equal abundance between the compared groups. The size of each data set symbol is related to the number of patients included in the data set. In our opinion CSF-PR 2.0 has brought a new dimension to the sharing and accessibility of CSF proteomics data, something that was clearly missing in the field. The resource can easily be extended with new data on CSF proteomics in the future, further increasing the strength and relevance of CSF-PR. The framework also allows for extensions to include other body fluids and tissues in future versions. CONCLUSION We have created CSF-PR 2.0, an online database with an interactive and user-friendly interface for inspecting quantitative CSF proteomics data generated by mass spectrometry. This is an important step toward making quantitative CSF proteomics data more accessible to the scientific community. It will spare researchers the time-consuming task of manually inspecting large amounts of publications and supplementary files to extract the desired information. CSF-PR 2.0 greatly increases the availability of the published data and makes it much easier to compare across disease stages or categories in order to find new correlations or trends in the existing data, and should thus become an invaluable tool for anyone working with biomarker discovery in neurodegenerative diseases. ʈʈ Cosenior authors. Disclaimer: The data included in CSF-PR was manually extracted from the publications and available supplemental files, and there may therefore be undetected errors in the resource. We encourage readers to report any errors to the authors. FIG. 7. Protein table displaying seven randomly selected proteins compared against the example user provided data input. The horizontal red bars represent increased abundance in the user input data and the horizontal green bars represent decreased abundance. Each symbol (triangle or square) represents the found regulation direction in one disease group comparison (not specified in this figure).