Establish data infrastructure to compile and exchange environmental screening data on a European scale

Robust techniques based on liquid (LC) and gas chromatography (GC) coupled with high-resolution mass spectrometry (HR-MS) enable sensitive screening, identification, and (semi)quantification of thousands of substances in a single sample. Recent progress in computational sciences has enabled archiving and processing of HR-MS ‘big data’ at the routine level. As a result, community-based databases containing thousands of environmental pollutants are rapidly growing and large databases of substances with unique identifiers allowing for inter-comparison at the global scale have become available. A data-archiving infrastructure is proposed, allowing for retrospective screening of HR-MS data, which will help define the ‘chemical universe’ of organic substances and enable prioritisation of toxicants causing adverse environmental effects at the local, river basin, and national and European scale in support of the European water and chemicals management policy.


Challenge
Non-target screening (NTS) workflows are a powerful method for the large-scale analysis of environmental samples. They consist of wide-scope target, suspect, and non-target analysis. Recently, NTS has developed rapidly with the advance of HR-MS techniques, as reviewed elsewhere [1]. Smart monitoring combining cost-effective methods for wide-scope target and suspect screening with a battery of well-established high-throughput bioassays could be used routinely to reduce the risk of overlooking toxic chemicals in the environment [2,3].
Continental scale wide-scope target and non-target screening required for an appropriate monitoring of complex chemical contamination is rapidly developing in many monitoring laboratories, as recommended in [4]. This will provide an amount of information unprecedented so far in environmental monitoring. Currently, monitoring data are typically stored and evaluated in a closed and decentralised way using non-harmonised formats and without substantial data exchange between the scientists and agencies involved. These deficiencies hamper the recognition of newly emerging contaminants and mixtures, the prioritisation and identification of the newly recognised chemicals, and the efficient exploitation of these data for quality assessment and management on a European and even global scale. So far, the infrastructure for storage, long-term archiving, open exchange, processing and analysis of these data is largely lacking, although the required technology for 'big data' repositories is already available [1,5].
Any LC-HR-MS or GC-HR-MS technique needed for the detection of suspect and non-target chemicals generates large amounts of data, up to tens of GB per analysis. This brings environmental monitoring into the arena of 'big data' . Currently, only a fraction of the information from HR-MS measurements is extracted and the rest is discarded. The challenge is (i) to extract the minimum necessary information for a quick overview of presence/ absence of a large number of suspects in the samples and (ii) to save all information from HR-MS (raw data) in a format harmonised at the European (and possibly global) level for retrospective screening of environmental samples for the currently known and future pollutants.

Open Access
*Correspondence: werner.brack@ufz.de 3 UFZ Helmholtz Centre for Environmental Research GmbH, Permoserstraße 15, 04318 Leipzig, Germany Full list of author information is available at the end of the article Dealing with tens of thousands of substances, their transformation products, technical mixtures, salts, isomers, etc. may lead to a great confusion when not coordinated. Neither the CAS No. nor the name is a sufficiently unique identifier for a compound of interest. At present, the US EPA CompTox Chemicals Dashboard (https :// compt ox.epa.gov/dashb oard; > 875,000 chemicals, [6]) is used as a reference for extracting quality checked information. Still, many of the chemicals with high production volumes and their transformation products are not found in this or any databases.
The identification of compounds with experimentally obtained mass spectra is more reliable than just exact mass matching of compound databases [7]. To ensure this, community-based databases containing measured mass spectra need to grow considerably. In addition, the mass spectra of 'unknowns' frequently recorded in environmental samples should be stored for future identification, as done in prototype form in the European (NORMAN) MassBank (https ://massb ank.eu/MassB ank/).
Complex mixtures of chemicals should be considered together with their complex effects and ecosystem impacts. Technical developments that now allow for recording extensive chemical fingerprints from NTS, toxicity profiles, and omics responses in laboratory test systems and wildlife and environmental DNA to address biodiversity are delivering enormous amounts of data. The challenge is to establish the infrastructure needed for data storage and the tools for multivariate biological and chemical analysis to facilitate the use of such data.

Recommendations
• Establish a federated European infrastructure storing raw non-target screening data converted into a common (open) format allowing for 'on demand' accessibility for retrospective screening • Establish a central platform/database storing regularly updated information on available data sets Europe-wide and, eventually, at a global scale • Establish a common European platform where the unique identifiers of newly discovered environmental pollutants can be shared in a harmonised format • Apply commonly agreed workflow(s) for retrospective analysis to identify and prioritise pollutants frequently detected in environmental samples.

Requirements
Establishing the data infrastructure for compilation and exchange of screening data on a European scale requires: • Recognising the need for screening data within the framework of European water policy, air and soil pollution, and waste management • Providing incentives by the European Commission to scientists, monitoring agencies, and Member States to share the screening data • Providing incentives by the scientific journals to scientists to share the raw screening data in a harmonised format as a supplementary information to the publications using these data • Securing European and national scale funding for establishment of the interoperable infrastructure • Support of the European MassBank for systematic storage of mass spectral information of environmentally relevant substances (https ://massb ank.eu/ MassB ank) • Further harmonisation of wide-scope target and suspect screening techniques in Europe • Further development of HR-MS data processing workflows.

SOLUTIONS/NORMAN database system
The NORMAN network (https ://www.norma n-netwo rk.net); a network of more than 80 reference laboratories, research centres and other organisations for monitoring of emerging environmental substances in Europe and North America; [8]) and the SOLUTIONS project (https ://www.solut ions-proje ct.eu); [9]) have pushed the limits of NTS further using European case studies. It is now possible to screen more than 2000 target compounds and more than 40,000 suspect substances in environmental samples. An online database for wide-scope target and non-target screening data was developed as a part of the NORMAN Database System (https ://www. norma n-netwo rk.com/nds) and the SOLUTIONS Database System (https ://www.norma n-netwo rk.com/solut ions/norma n.php). The latter contains also a unique list of modelling-based prioritised substances, whose presence in the environment is not determined on actual occurrence measurements, but rather on the predictions related to their production volumes, use pattern, and how easy they can be released into environment.

NORMAN suspect list exchange
A collaborative trial organised by the NORMAN network on a surface water sample from the Danube river basin revealed that suspect screening using specific lists of chemicals to find "known unknowns" was a very common and efficient way to expedite non-target screening [10]. As a result, the NORMAN Suspect List Exchange was founded (https ://www.norma n-netwo rk.com/nds/ SLE/) and members were encouraged to submit their suspect lists. To date, more than 50 lists of highly varying substance numbers have been uploaded. Over 40,000 substances are available in the correspondingly merged SusDat database (https ://www.norma n-netwo rk.com/ nds/susda t). This database contains harmonised names, CAS Nos., SMILES, InChIKeys, "MS-ready structure forms" with chemical substances provided in the form observed by the mass spectrometer (e.g., desalted, as separate components of mixtures [11]), exact masses, retention indices, and modelling-based predicted ecotoxicity threshold values. Further > 40,000 substances are in the pipeline. The curation was done within the network using open-access cheminformatics toolkits. Starting in 2017, the NORMAN Suspect List Exchange and US EPA CompTox Chemicals Dashboard (https ://compt ox.epa. gov/dashb oard) pooled resources in curating and uploading these lists to the Dashboard (https ://compt ox.epa. gov/dashb oard/chemi cal_lists ).

NORMAN digital sample freezing platform (DSFP)
A retrospective screening platform for hosting mass spectrometric data obtained by LC-HR-MS was created in 2017 (https ://norma n-data.net), with the ambition of becoming a European and possibly global standard for retrospective suspect screening of environmental pollutants [5; Fig. 1]. This platform enables a quick and effective overview of the potential presence of thousands of substances either known or suspected to be present in the environment (based on the SusDat database), including a wide range of contaminants of emerging concern, their transformation products and unknowns, across a large number of samples and different matrices. A tool for semi-quantitative estimation of concentrations of any detected compound based on their structure similarity is being tested.

European (NORMAN) MassBank
A database for MS (mainly high resolution) spectra of substances of environmental and metabolomic relevance was created in Europe in 2011, using a format developed previously in Japan. European (NORMAN) MassBank (https ://massb ank.eu/MassB ank/) now contains 57,472 unique mass spectra of 14,667 substances (accessed on 10 May 2019). The exact mass, fragmentation, and measurement information on all substances are feeding into the NORMAN DSFP. In SOLUTIONS, the joint efforts of the environmental and metabolomics community on Mass-Bank development improved and a developer consortium was founded (https ://githu b.com/MassB ank/).

Demonstration and evaluation in case studies
The databases developed within NORMAN/SOLU-TIONS presented above have already been applied in Fig. 1 Adopted workflow for obtaining harmonised raw screening monitoring data through the Digital Sample Freezing Platform (DSFP) interface [5] several case studies related to SOLUTIONS. In the Joint Danube Survey 3 (2013; [12]), a wide-scope target and suspect screening using comprehensive substance lists was tested by several laboratories. Wide-scope target screening tools combined with bioassays were systematically used at the assessment of abatement options in the River Rhine catchment [13]. The NormaNEWS study was carried out in 2017, establishing a global emerging contaminant early warning network to rapidly assess the spatial and temporal distribution of contaminants of emerging concern in environmental samples through performing retrospective analysis on HR-MS data. The effectiveness of such a network was demonstrated through a pilot study, in which eight reference laboratories with available archived HR-MS data retrospectively screened data acquired from aqueous environmental samples collected in 14 countries on 3 different continents [14]. Wide-scope target (> 2100 substances) and suspect screening (NORMAN SusDat; > 40,000 substances) were performed in water, sediment, and biota samples in the Joint Black Sea Surveys (2016, 2017; [15]). A thorough analysis of waste water treatment plant effluents with a battery of SOLUTIONS/NORMAN bioassays was applied using wide-scope target and suspect screening in the Danube River Basin in 2017 in cooperation with the International Commission for the Protection of the Danube River (ICPDR) [16]. The outcomes of the case studies support further development of harmonised databases for archiving 'big data' from NTS.