An assessment of Quality Assurance/Quality Control Efforts in High Resolution Mass Spectrometry Non-Target Workflows for Analysis of Environmental Samples

The application of non-target analysis (NTA), a comprehensive approach to characterize unknown chemicals, including chemicals of emerging concern has seen a steady increase recently. Given the relative novelty of this type of analysis, robust quality assurance and quality control (QA/QC) measures are imperative to ensure quality and consistency of results obtained using different work ﬂ ows. Due to fundamental differences to established targeted work ﬂ ows, new or expanded approaches are necessary; for example to minimize the risk of losing potential substances of interest (i.e. false negatives, Type II error). We present an overview of QA/QC techniques for NTA work ﬂ ows published to date, speci ﬁ cally focusing on the analysis of environmental samples using liquid chromatography coupled to HRMS. From a QA/QC perspective, we discuss methods used for each step of analysis: sample preparation, chromatography, mass spectrometry, and data processing. We then ﬁ nish with a series of recommendations to improve the quality assurance of NTA work ﬂ ows. © 2020 The Authors. Published


Introduction
A growing threat to human health is cumulative and complex exposure to chemicals in the natural and man-made environment.In 2016 the World Health Organization reported 1.6 million deaths and 45 million disability-adjusted life-years lost due to known chemical exposures [1], and that number is increasing.A large number of new chemicals are introduced to the market annuallymore than 10 5 per year since the late 20th century e which represent a drastic increase in the chemical space (Fig. 1), i.e. the totality of all chemical species (in a sample) [2].Relatively few of these so-called 'chemicals of emerging concern' (CECs) are adequately characterized with respect to their toxicity and environmental fate, preventing accurate risk assessment [3].This issue is compounded by our limited understanding of biotic and abiotic transformation processes, which can produce mixtures and byproducts that are potentially more toxic than their parent compounds [4e6].Overall, the small number of regulated chemicals including in targeted environmental and/or human biomonitoring studies misrepresents the complexity of the chemical exposome [7,8].Non-target analysis (NTA) (see Table 1 for definitions), aiming for the identification of CECs or general underlying trends (e.g.spatial or longitudinal comparison studies), utilizing highresolution mass spectrometry (HRMS) has emerged over the past few years as an approach to address this knowledge gap.Thus, NTA has the potential to provide the specificity and accuracy required for high confidence identification [9e13], which should, however, always be followed by a targeted confirmation/validation. HRMS methods using either liquid (LC) or gas chromatography (GC) for the characterization of unknown compounds in complex samples or comparison of samples in a non-targeted way share five main steps which should be refined by an appropriate experimental design depending on the purpose of the study [9,14e24]: (1) sample collection and preparation; (2) chromatographic separation; (3) data acquisition (via mass spectrometry); (4) data processing and (5) reporting of results (Fig. 2).Each of these steps has differences to targeted mass spectrometry workflows, where QA/QC protocols are often well-defined, as the objectives can differ significantly (Fig. 2), and thus, requires unique quality assurance/quality control (QA/QC) measures [25].While proponents of NTA recognize the need for validated and harmonized QA/QC measures, they currently do not exist in a generally accepted form for all steps and addressing all topics of concern (Figs.3e6).Existing measures focus mostly on the confidence in reporting the results of compound identification [12,26e28], i.e. the reduction of false positives (Type I errors), and on ensuring reproducibility of compound identifications, for example, by demonstrating high mass accuracy or fitting isotope profiles [27].Although this is crucial, it is also important to assess what might be omitted, i.e. false negatives (Type II errors), as result of the used approach in each respective step and how this can, as much as possible, be avoided.As of now, robust and reproducible QA/QC workflows remain a challenge, due to a) the diversity of sample matrices and composition, b) undefined unknown analyte constituents, their inherent physiochemical properties and concentrations and c) the complexity and volume of HRMS data produced, especially in regard to the potentially vast number of false negatives.One main aim should therefore be to reduce this number as much as possible or at least give an informed discussion on what might not have been identified.Common QA procedures in NTA are often limited in their ability to pick up on this and a more fit for purpose QA/QC is therefore needed to understand the explored chemical space and any losses or limitations in that analysis.This is difficult to demonstrate without commonly defined QA/QC guidelines.
Here we present an overview of QA/QC procedures applied to NTA over the past decade.Our analysis includes an overview of the potential sources of uncertainty in NTA experiments and possible steps to mitigate those issues; an assessment of the effectiveness of existing QA/QC tools; and concludes with a series of recommendations to advance the field of NTA.We chose, as far as possible, to remain vendor-neutral, to ensure broad applicability of the findings of this study.
To assess existing QA/QC measures in the field of NTA, we investigated peer-reviewed papers applying an NTA workflow to an environmental matrix, while identifying which QA/QC measures have been used.The main focus was on publications using LC in combination with Orbitrap or Time-of-flight (TOF) mass spectrometers as these are the most common.Furthermore, we also focused on research concerning small molecules (<1200 Da) in environmental samples, as this is one of the main areas for NTA outside of metabolomics and proteomics.A list of publications considered can be found in the supplementary information.It must be stated that this list is by no means a comprehensive list of all Table 1 Glossary of terms.

Candidate
When using the term candidate, it typically refers to either a potential molecular formula or chemical structure associated with a feature and/or component in a sample.

Chemical space
Chemical space is associated with all the chemicals that are present in a specific sample, independently from their nature and/or source.

Component
Component is the entirety of data, including all the measured information during the analysis, associated with a unique chemical constituent in the samples.A component, depending on the method use (e.g.GC vs LC) may potentially include information about molecular ion, fragments, adducts, and isotopes.

Feature
Feature is a three-dimensional construct with a potential Gaussian shape that may represent a chemical constituent in a sample.A feature is the combination of chromatographic and mass spectral peaks, and is represented as a tensor of time, mass, and intensity.

Peak
A peak is a two-dimensional entity with a Gaussian-like shape that has intensity as the dependent dimension (i.e."y" axis) and either time or the mass values as independent dimension (i.e."x" axis), thus chromatographic or mass peak.QA/QC To clarify the definition, we use for QA/QC in this paper we refer to protocols and procedures implemented to ensure that sample analysis is consistent, comparable, precise and accurate.

Suspect screening
Suspect screening describes the identification of known unknowns using a combination of spectral databases and additional information such as retention indices and physico-chemical properties.

Non-target screening/ analysis
Are associated with studies where the HRMS is used for the identification of known unknowns, unknown unknowns, sample fingerprinting, and source tracking, with little to know prior knowledge regarding the chemical composition of the samples [29e36].

Pre-processing
Steps within the workflow that do not directly contribute to the hypothesis testing, i.e. noise removal, smoothing etc.

Post-processing
Steps taken by researchers after data processing by the program.

Candidate ions
Precursor ions selected for fragmentation.

Internal standard
Chemical used for the comparison in various stages.Usually isotope-labelled because otherwise researcher cannot confirm the source of the chemical.

Known unknowns
Compounds with well-documented structural information, such as high-resolution mass spectra, and recorded in databases.Unknown Compounds which have not been reported before and therefore missing the information available for "known unknowns".

Ion suppression/ enhancement
Changed intensity of specific m/z's in the chromatogram as a result of matrix effects.environmental non-target research, as transitions to related fields, such as, metabolomics and proteomics can be fluent and these have not herein been considered.In the next sections we examine each of the aforementioned workflow steps in detail.

Step 1: sample preparation
Broadly speaking, the aim of sample preparation is to isolate components of interest (e.g.CECs) from a crude sample matrix, thereby reducing sample complexity and reducing or removing potential matrix interferences [37], and concentrating low concentration substances.The a priori nature of NTA requires generic sample preparation methods that preserve as much of the chemical space of the sample as possible, across a wide range of physicochemical properties, while minimizing background interferences [16,22,38,39].However, each sample preparation approach may result in loss of information about chemicals that are not amenable to the method of extraction due to solubility and/or polarity [40,41] (i.e.extraction bias as a result of the selectivity of the method) (Fig. 3).In addition, artefacts of sample preparation can occur during any step (e.g.formation of degradation and transformation products, deconjugation, formation of adducts, contamination of the sample, introduction of constituents to the sample from sample collection, handling, preparation/extraction).
To overcome these issues, some methods employ generic wide scope sample preparation protocols with minimal sample adulteration (e.g.direct injection of liquid samples; 'dilute-and-shoot' methods [37,42]).The advantage of this approach is that when sample preparation is kept to a minimum, sampling of the chemical space is more comprehensive.The disadvantage is that concentrations are low compared to pre-concentrated sample extracts, and very sensitive analytical detection methods may be required.Therefore, these options are possible only in cases where expected analyte concentrations are sufficiently high to be detected without the use of a pre-concentration step (e.g.examination of wastewater influent or highly contaminated samples).
For samples (e.g. when using passive sampling) or approaches (e.g.NTA in combination with suspect screening) that require extraction and/or concentration, a range of methods such as solid phase extraction, typically utilizing a range of sorbent material from ionic exchange to conventional reverse phase (e.g.Octadecyl silane), multi-purpose polymeric phases such as reversed-phase hydrophilic-lipophilic balance (HLB), liquid extractions (LLE), ultrasonic extraction or QuEChERS approaches can be used [43e46].The more complex the sample preparation step(s) the higher the likelihood of introducing artefacts into samples as well as losing chemicals of interest.Normally, contamination brought into the sample during preparation and analysis steps can be accounted for by using blank Fig. 2. General overview of workflow steps and their main differences to be considered in targeted vs non-target approaches.Individual steps will differ based on the experimental design and aim of the study.and control samples [47].While these samples (for example procedural, solvent and field blanks, depending on the experimental design) are primarily used to identify sample contamination in targeted workflows and are aimed only at the chemicals of interest, they play a much more pivotal role in NTA where they provide a crucial way of identifying the introduction of any extraneous constituents into samples post sampling.However, as multiple types of artefacts can be introduced during several sample processes (as mentioned above), the use of multiple field, procedural, solvent, (pooled) matrix and analytical blanks, as well as positive controls is essential for accounting for and eliminating these NTA artefacts.
The majority of QA/QC approaches applied to assess the performance of NTA sample preparation, when used/reported, are an extension of those employed for quantitative target analysis [21,48].Analyte recovery from each matrix is calculated from extraction recovery experiments often employing isotopically labelled internal standards [21].This typically involves a comparison of calculated concentration for samples fortified with internal standard pre-and post-extraction.However, recent studies have shown that for highly complex samples this assumption may not be valid, due to the limited number of active sites and/or high levels of interferences caused by complex sample matrix background, which may reduce the method sensitivity [39,49].Nevertheless, so far this is one of the most effective solution to account for issues.However, internal standards do not exist for every analyte, and the use of multiple standards can be costly.
Recommendation: The use of fortified samples and/or (groups of) internal standards or native standards that are of no interest for the particular study, that represent the widest range of physicochemical properties (e.g.Log Kow's) that is most relevant to the sample matrix investigated should be considered.Recovery experiments for the respective matrix using these standards can and should be conducted and completely reported.Additionally, several procedural blanks must be included during the analysis and their results should be reported alongside with the samples themselves.

Step 2: chromatographic separation
In general, the purpose of chromatographic separation is to.I. Achieve sufficient retention of analytes across the time axis of the chromatogram to optimize mass spectrometer cycle time, thereby also providing additional means of identification [26] and II.Reduce ion suppression [50], i.e. by resolving matrix interference not removed by the sample preparation Understanding the role of chromatography in any NTA workflow is essential because it defines the region of the chemical space explored for each sample.
LC and GC are the two main chromatographic approaches used in NTA.Briefly, LC is used for the separation of a wide range of polar and semi-polar to non-polar analytes, depending on the solid stationary phase used and one or more liquid aqueous mobile phases.GC is used for the separation of non-polar and semi-polar analytes, which must be both volatile (or at least semi-volatile) and heat resistant and therefore less amenable for aqueous environmental samples.
Thus, the majority of the NTA studies reviewed here use conventional one-dimensional, C18 reverse phase (RP) LC due to demonstrated robustness and reproducibility (e.g.Ref. [51]), as can be also seen in previous collaborative studies [9].Additionally, the partitioning processes of compounds between C18 stationary phases and the mobile phases are better established [52,53] compared to more polar stationary phases like hydrophilic interaction chromatography (HILIC) [54e56], implying that retention behaviours of well-known chemicals are better modelled and therefore easier to be used for retention time prediction models (see step 5: Identification and reporting).
Developing and validating chromatographic methods requires the optimisation of a number of different parameters e chromatography type (e.g.gas chromatography, LC, etc.), column chemistry, mobile phase composition, profile and flow rate etc., based on the experimental design and the hypothesis to be tested.In targeted analyses, these parameters are optimized for a finite analyte list, and assessed for small variations (i.e.method robustness) during method validation.However, for NTA, changes to any one of these parameters may alter the analysable fraction of the chemical space being explored in that experiment (i.e.selectivity (Fig. 4)).For example, column selection may result in a portion of the sample chemical space not being explored, e.g. the focus of a generic C18 column will be on nonpolar to semipolar compounds while HILIC is generally better suited for the analysis of (highly) polar substances [57,58].Options to increase the explored chemical space by combining these through using 2D-chromatography applications [59e61] -with its own challenges -or as mixed mode chromatography [62] have been tested.
To address many limitations, the analyst may utilize a set of internal standards during method development and validation.Depending on the number of standards used e ranging between zero and 2000 [37] e and their physical and chemical properties, they may not represent the entirety of the explored chemical space (Fig. 1).For example, in a recent publication two RP columns, both with pentafluorophenyl ligands as stationary phase, showed extreme differences in retention behaviour for few distinct substances, while all others had comparable retention times [57].A phenomenon which might not have been picked up by a finite number of ISs, e.g. when making changes to a method.Also, a well resolved chromatographic peak and strong, characteristic MS/MS fragmentation pattern using a set of parameters optimized for that specific standard compound may not necessarily ensure similar performance for unknown compounds in a complex sample.Additionally, analyte performance is likely to be matrix-specific, adding further challenges.Finally, as for the sample preparation steps, the introduction of features not related to the sample itself (e.g.impurities in the liquid phase, column bleed or contamination during the injection, carry over) must be monitored by using enough blanks (i.e.blank subtraction).Most current tools do not facilitate automated blank subtraction (see 5.1 Noise Removal/Data Compression).The risk of systematic errors as result of carry-over and other batch effects (i.e.general retention time shifts, decreasing sensitivity) can, however, be easily reduced/checked by randomizing the sample run order, regular injection of solvent blank samples spiked with (internal) standards and sufficient use of duplicate samples.
Recommendation: As for sample preparation, ISs should be used in a way to cover the complete chemical space the specific study is investigating.Pooled samples consisting of small aliquots of all samples in the experiment, therefore acting as an average sample, similar to pooled biological quality controls (PBQC) in metabolomics [63], can be used for the development process and help to report the suitability of the method.Both, pooled samples and ISs can also be used to monitor, report and potentially correct the daily performance of the system, i.e. controlling the reproducibility of the method.Moreover, multiple injection of the same sample as well as the replicates is an efficient way to assess the instrument stability as well as providing enough statistical power for later stages of NTA experiments.

Step 3: data acquisition via mass spectrometry
For this section, we focus on high-resolution instruments, such as (quadrupole) orbitrap and quadrupole time-of-flight (QTOF) mass spectrometers, as they are the most commonly used for NTA [9].Other high-resolution instruments such as Fourier Transform Ion Cyclotron Resonance mass spectrometers (FTICR-MS) are only rarely used in currently applied "routine" NTA as they are often used without chromatography [64], therefore missing out on retention time as a means for identification.

Ionization and fragmentation
Independent of the instrument used, the first step in mass spectrometry is the creation of ions (Fig. 5).Ionization strategies are broadly defined as either soft (electrospray ionization, ESI; atmospheric pressure chemical ionization, APCI; atmospheric pressure photoionization, APPI), or hard (electron impact ionization; EI) depending on the amount of energy applied to the system [65].It is important to note, that if a particular (or all) ionization technique/s are not able to ionize a specific compound, this compound will not be detected, effectively decreasing the explored chemical space (Fig. 1) again.
As a 'hard' high-energy technique, EI produces many fragments making it ideal for structural characterization, but rarely produces a signal for the molecular ion.EI is mainly used for the analysis of volatile and semi-volatile chemicals in the gas phase via GC-MS.This method of ionisation is robust and reproducible across mass spectrometry platforms and vendors, facilitating the curation of large spectral reference libraries, such as the National Institute of Standards and Technology (NIST) Mass Spectra Library.Such spectral libraries are invaluable for identification purposes, but difficulties persist [12,66,67] (see Step 5: Identification for further discussion).
Of the soft ionization techniques, ESI is more commonly used than APCI or APPI due to its broad applicability to many different compound structures, high efficiency in ionising organic compounds containing heteroatoms and easy coupling to liquid chromatography [68].Consequently, it is used frequently for the analysis of polar and semi-polar organic chemicals via LC-MS [65].Under ideal conditions and independent of the type of ionisation, the relative intensity of the generated ions is directly proportional to the ionisation efficiency of the parent compound.However, ionisation efficiency in ESI is highly dependent on the mobile phase, mostly its pH [69,70].Also, for complex samples, individual compounds are often not completely chromatographically resolved, meaning multiple species enter the source simultaneously.These analytes then compete for the available ionisation potential, giving rise to ion suppression (or enhancement) [71,72].Ion suppression can be caused by matrix constituents (i.e.matrix suppression, or matrix effects), or a high ion population in the source.Contrary, ion suppression seems to be reduced when using APCI [50], however the use of APCI in NTA is still limited (e.g.Ref. [73,74]).It should be noted that even in proteomics experiments where samples are highly complex (e.g.cell lysate), sample loading is low (in the range of nanograms), and chromatographic gradient profiles are long (ranging from 30 min to several hours), ion suppression remains a challenge [75e77].Typically an analyst will normalise against a range of internal standards which a) have a similar retention time and/or b) a similar chemical structure in attempt to account for ion suppression/enhancement effects.The signal from these internal standards is compared with or without the presence of the sample matrix, however, this is not possible for every single chemical constituent in the sample and their potentially unknown structure, due to their sheer number [37,78,79].However, there are no commonly accepted guidelines for the minimal required number and type of internal standards to be added to the sample to adequately assess the effect of ion suppression/enhancement [80].
As result of the 'soft' ionisation, an additional fragmentation step (Fig. 5) is needed that is typically achieved by applying a collision energy (CE) within a collision cell (often the second quadrupole).The obtained fragmentation patterns are crucial for structural elucidation and compound identification by comparison with databases [12,20,81,82], but often show variation in the spectra, particularly the relative intensity and number of different fragments as result of instrument type, collision gas and energy [67,83].In theory, every compound has its own ideal CE for optimal fragmentation.Consequently, a QA/QC effort within the NTA literature is on standardizing experimental design with respect to collision energy for a multitude of compounds at the same time, for example, employing a collision energy ramp (e.g.10e45 eV [29,84e86]) to collect an average spectrum or collecting three or more spectra at nominal collision energies (e.g. 10, 20 and 40 eV [87e89]).However, some compounds like per-and polyfluoroalkyl substances (PFAS) might need higher collision energies [90], therefore making it difficult to cover the complete chemical space in one MS/MS experiment.Potentially this could be further approached by using the chemical structure to predict the optimal CE based on the m/z of the precursor ion [91,92].While these measures have increased the potential of comparison of the generated spectra across different instruments, more effort is needed to transfer this to universally usable spectral libraries, which ideally could be used for all types of instruments.

Resolution and mass accuracy
NTA typically employs HRMS with mass resolution, defined by the ratio of m/z peak height to peak width, ranging from 10,000 to 300,000 (only orbitrap; QTOF normally up to 35,000).Mass accuracy is typically between 0.1 Da and 0.0001 Da and is dependent on compound mass.However, without sufficient resolving power one can still obtain mass errors when two peaks are not completely resolved.If instead of two peaks (high resolving power) only one (lower resolving power) is observed and its apex is used for mass determination, this mass will be in between the actual two exact masses, leading to an increased mass error and the possibility of missing out on at least one substance.Resolving power is dependent on the time that each ion spends in the mass analyser before reaching the detector [93].On orbitrap instruments there is a direct relationship between spectral resolution and scan rate (i.e.speed).So, while mass resolution can be improved by increasing the time ions spend in the trap, it is limited by the time required to perform a scan.If scan time is too long, analytes in the sample eluting from the chromatography in narrow peaks might not be sampled.For QTOF systems, instrument resolution is determined by the length of the flight tube and cannot be changed.Sampling speed remains crucial to data quality [10,11].
For mass accuracy, all instruments rely on stable electronics of the mass analyser and continuous readjustment of mass calibration equations.For orbitraps, this is the relationship between orbiting frequency and m/z, and for QTOF systems it is the relationship between the time-of-flight and m/z.The adjustment is performed via instrument tuning and mass calibration.Normally, tuning is performed periodically as part of routine instrument maintenance with the objective to check for mass shifts caused by the electronics.The goal of mass calibration, on the other hand, is to correct for mass shifts caused by the presence of the sample and mobile phase, by aligning the mass axis with the masses of known compounds [94,95].Both tuning and mass calibration are mostly performed by infusing one or more well-known chemicals, in the absence of matrix, directly into the source of the MS.
The composition of the tuning mix, the frequency of tuning and calibration, and the methods for mass correction may vary from one vendor to another, or from one lab to the next.Often used are external calibrations, i.e. calibration of the MS in between two chromatographic runs, using solutions supplied by the vendor of the instrument.Some articles report the use of in-house calibration solutions such as a caffeine reserpine solution [96] or amino acid solutions [91].However, external calibration can possibly have an influence on the equilibration time of a column and potentially retention of compounds when executed between two chromatographic runs, posing a problem for especially sensitive columns like ones used for HILIC [97].Alternatively, internal calibration -i.e.measuring one or several compounds additionally during the whole analysis -can be used.Results can be used for immediate recalibration of the instrument or for post-acquisition calibration (e.g.Ref. [20,98]).Internal calibration can deliver improved mass accuracies, however, it can also lead to additional ion suppression [99].One option is the "MassLock" technique, recalibrating the instrument in fixed time intervals, for example every 10 s, using a specific well-known mass (e.g.Leucine Enkephalin (Leu-Enk) [84,86,100]).Otherwise it is also possible to continually use known contaminants, mobile phase adducts, column bleed, etc., i.e. background ions [28,101].This could help circumvent the problem that mass correction factors are mostly calculated from a limited number of m/z, and then extrapolated over the measured mass range.Inadequate application of mass calibration may cause an increase in the mass error in the individual m/z values.These cases are extremely difficult to detect and may potentially have negative effects on the quality of the generated spectra, therefore increasing the difficulty for deconvolution and other algorithms which rely on a narrow mass window, additionally to the general identification.

Centroid versus profile data
When ions reach the detector of a mass spectrometer they ideally have a Gaussian-like profile, where all the ions associated with the distribution of each mass are recorded [102].When profile data is stored, this mass distribution profile is retained in the final dataset.Conversely, when data is 'centroided', the mass distribution profile is represented by either the mean or median, and additional information is discarded.Most modern HRMS instruments are set to generate profile data by default, with the option of centroiding afterwards [103].Profile data has the advantage of including all available information related to the distribution of a certain mass.This implies that actual mass resolution can be calculated for every single mass in the spectra rather than relying on the nominal mass resolution of the instrument.Additionally, the shape of these profiles as well as the number of points associated with a mass peak (which is not available in centroided data) may be informative of sample/mass purity, which is essential information for the structure elucidation (see Step 5: Feature Identification and Reporting).The trade-off for these advantages is larger and more complex data files.For comparison, the file size of a centroided dataset may be several times smaller than the same data in profile mode [12].
Centroiding can be performed either during the data acquisition (referred to as "on-the-fly"), or as an initial data processing step (i.e.collecting and potentially archiving profile data, then centroiding it as a secondary step).Centroiding usually consists of fitting a Gaussian-like distribution to the data using the nominal resolution of the instrument.When performed on-the-fly, this process employs simple mathematical approaches to match speed requirements of the detector and is typically prescribed by the instrument manufacturer.A disadvantage of on-the-fly centroiding in that there is no opportunity to detect or correct any errors.Conversely, centroiding as a post-processing step may employ more sophisticated signal processing approaches [102], and offer the opportunity to validate the result at a later stage.Most of the studies reviewed here can be assumed to have used the default centroiding (provided by respective instrument vendor software), as a pre-processing step rather than "on-the-fly" options, since the specific method is not stated most of the times.Only rarely open access tools were used for centroiding (e.g.Ref. [23,98]).These vendor-provided approaches are generally considered superior to post-data collection centroiding methods, due to convenience and the assumed advantage of access to proprietary instrument information.To our knowledge this assumption has not been tested.

Data dependent (DDA) and data independent acquisition (DIA)
During DDA, each precursor ion selected from the survey scan (MS 1 level) is fragmented to produce a specific mass spectrum (MS 2 level).Candidate precursor ions are selected based on a priori-defined criteria e.g.intensity-based sampling or as part of an inclusion list.For NTA where, by definition, the target analytes are unknown, intensity-based sampling is commonly deployed.This ensures high quality of the taken spectra.However, it must be noted that the substances of interest (e.g. because of their toxicity) may not necessarily have the most intense ions [104].In practice, the number of candidate ions (i.e.precursor ions selected for fragmentation) ranges from three to 12, with a median of five, based on the papers investigated (see SI).The higher the number of different m/z transmitted to fragmentation, the more information can be obtained on different compounds, however it also increases the cycle time of the method, decreasing the number of data points across a chromatographic peak.Therefore, the speed of the mass spectrometer is crucial for maintaining data integrity.
DIA, on the other hand, is a comprehensive sampling approach where no a priori assumptions are made; MS/MS data is acquired for all precursor ions detected in MS1 scan.This approach vastly increases sample coverage and reproducibility while facilitating retrospective data mining.From a hardware perspective, this is achieved by either submitting all precursor to fragmentation simultaneously [11], or dividing the mass range into a series of discrete m/z windows that are sequentially sampled for fragmentation.Commercial examples of this strategy include SWATH® by SCIEX [105], and SONAR by Waters [106,107].The challenge of DIA is that any given MS/MS spectrum contains fragments from all the precursor ions captured by the m/z window i.e. one MS/MS cannot be linked to a single precursor ion.To do so requires data deconvolution (discussed in 'Step 4: Data Processing').
Analogous to the need to optimize the number of candidate ions in a DDA experiment, in a DIA experiment the number and width of isolation windows should be optimized: The more and the smaller the windows, the easier the data processing/interpretation, however, the cycle time increases leading to less data points and possible loss of sensitivity as result of a reduced dwell time and therefore less detected ions.Overall, DDA seems to be more popular than DIA for non-target analysis: 60% and 19% respectively (4% using both and the rest not describing a specific method), based on the papers investigated, despite the demonstrated capability of DIA for wide scope data screening and potential for future mining via data archiving [108].This could partly be attributed to the fact that advanced DIA methods have only been developed over the last few years, with the later introduction of software that is capable of deconvoluting complex data.
Recommendation: Similar to the chromatographic separation, pooled samples and ISs covering the entirety of the chemical space to be explored can help to discover and report issues with ion suppression and other reasons for the loss of information.Generally, the thorough reporting of all parameters is recommended to enable others to reproduce the results.Furthermore, if possible the data should be acquired in profile mode as the advantages outweigh the disadvantages (i.e.increased file size/processing time) significantly.

Step 4: data processing
Data processing (Fig. 6) encompasses all procedures from data conversion to feature identification.Apart from data conversion, each data processing step aims to reduce the complexity of the acquired data.These workflows may include noise removal and/or data compression; feature detection; feature grouping or componentization; and feature prioritization, followed by the feature identification.Given the detailed description of each step and the tools associated with each has been discussed previously [37,80,109], here we briefly explain these steps along with an assessment of new and existing QA/QC approaches for each.

Noise removal/data compression
During this step the size of the dataset is drastically reduced, decreasing processing time and facilitating data archiving.The main objective of this step is to remove the recorded datapoints that do not belong to the sample and may potentially come from fluctuations in the instrument itself.The process can vary from simple intensity thresholding to adjustable region of interest (ROI) detection [110].The simplest version of the noise removal (i.e.threshold setting) removes all the datapoints below a user-defined intensity threshold [37,111].More sophisticated options simultaneously model the noise and region of interest (i.e. a segment of the data associated with analytical signal) prior to removal of any data points [109].All these approaches rely on user-defined parameters that are typically defined on a case-by-case basis, based on the experience of the individual analyst and limited by the suite of available internal standards.These parameters may be inadequate for processing a specific dataset.For example, only recently a study has shown how two different implementations of the same algorithm for the ROI detection could result in substantial differences in the final output [112].A comprehensive assessment of these algorithms, including parameter optimization, is necessary to assure that relevant signals are not omitted due to the algorithm/parameter selection during the noise removal/data compression which currently is often not possible due to proprietary software where source code and algorithms are not publically available.

Feature detection
During feature detection information from mass and chromatographic peaks are combined to produce a three-dimensional entity, i.e. a feature, comprised of mass, retention time and intensity/area.Several options exist, both commercial (e.g.Waters, ThermoFisher, Agilent, SCIEX, Bruker) and open access (e.g.MAT-LAB, R, Python, and julia) [23,113,114].Feature detection algorithms can use both centroided and profile data, but always assume a Gaussian-like distribution [115e117].Using that assumption, either a model is fitted to the data (e.g.Gaussian fit and/or inverted Mexican hat), or a decision tree is applied to raw or transformed data (e.g.first and second derivatives or apex detection) [24,118,119].To process data with these algorithms, analysts use a combination of expert knowledge and mixtures of ISs to optimize the feature detection parameters, given the sheer number of features in such samples (e.g.5000 to 10,000).The consensus within the NTA community is to minimize the rate of false negative detection during feature detection [37].However, Hohrenk et al. performed a direct comparison of different feature detection algorithms showing that as low as ~10% overlap on the detected peak lists for the same sample using different methods [113].The authors were not able to identify the exact sources of these discrepancies.At the same time, the authors question the potential implications of such discrepancies on the final interpretation of large-scale studies, which further suggests the need for clear QA/QC criteria for the assessment of feature detection.

Componentization
Componentization groups isotopes, adducts and fragments associated with a single feature into one component.These components are used in later stages of the workflow for structural elucidation, and the quality of componentization is typically inversely proportional to the number of database queries.An effective componentization could potentially reduce the total number of the features by a factor of two or more [82,120e122].Componentization is informed by user-defined tolerances for retention time, mass difference (i.e.isotopes and adducts) and the similarity of the chromatographic profile [121].These parameters are selected based on software developer and analysts' individual knowledge and experience, and evidence collected from optimized experiments using labelled or native internal standards.However, the criteria for assessing the quality of componentization remain an open question.As such, there has not been a comprehensive assessment of performance of the existing tools for componentization.

Feature prioritization
Feature prioritization involves ranking of features based on their perceived hypothesis-driven relevance, based on the experimental design [80].There are two common hypotheses tested during NTA workflows: (1) Intensity-based prioritization, where features with high intensity usually represent higher sample concentrations and/ or greater biological or environmental relevance [24,25]; and (2) Statistical feature prioritization, where known sample differences (e.g.treated versus control) are used to select the features that are responsible for describing those differences [98,123,124].Other possible approaches are based, for example, on the toxicity (via effect-directed analysis) [125] or elemental composition [30].The main advantage of intensity-based prioritization is that it does not require many replicates to be effective.However, this method has a low rate of discovery for the structurally unknown compounds, if those do not have a high enough concentration in the samples, and for some cases especially for CECs, a compound can be harmful even though its concentration is not that high [98].
Conversely, while statistical feature prioritization is concentration-agnostic, it requires a comparatively large number of replicates to provide enough statistical power to describe observed differences between experimental groups.Insufficient statistical power may select random artefacts, such as background noise, as relevant features [126].
Recommendation: Development of new and/or improved (open access) algorithms for processes which rely as of now still mainly on expert knowledge and manual examination, to increase transparency and reproducibility.These algorithms should then further be optimized and evaluated in a way that it is eventually possible to reduce the amount of expert knowledge needed as far as possible.One approach could be more interlaboratory trials, for example, focusing only on specific steps of the workflow or one specific dataset being processed by all participants.

Step 5: Feature Identification and Reporting
Feature identification is the process of assigning information collected during componentization to a tentative chemical structure.The goal is unequivocal compound identification, achieved by comparing a likely candidate identification against a known reference standard.The identification process is considered a workflow itself given the number of steps necessary to generate the final structure.Most of the identification workflows are divided into two categories: 1) 'known unknowns' i.e. compounds with welldocumented structural information, such as high-resolution mass spectra, and recorded in databases; or 2) 'unknown unknowns' i.e. compounds with no known information on their chemical structure [127].Here we focus on known unknowns as most of the developments in data processing approaches employ this workflow.

Identification of known unknowns
Identification of known unknowns is usually achieved by spectral library matching and/or comparison with chemical databases [127,128].Spectral library matching compares the generated components to a commercial or open access spectral library using different library search algorithms [129].The library search generates a list of candidates that the analyst must select from to identify the most likely structure in combination with algorithm specific scores associated with each candidate.This curation/postprocessing is both manual and highly subjective, relying heavily on the expertise and the experience of the analyst, alongside the evidence generated during earlier data processing steps (e.g.componentization).For example, multiple collaborative global trials have demonstrated that the experience of the analyst and their knowledge of the sample itself has shown to affect the search outcome [129,130].
Additional influential factors are the library search algorithm, the still low number of MS2 data in libraries and the quality of both the spectral library and the generated components (i.e. the mass accuracy and number of the generated fragments).As previously discussed, QA/QC measures specifically addressing the componentization step are very limited.QA/QC measures for spectral libraries have been discussed extensively elsewhere [130].
Library search algorithms and the results they produce vary, from simple dot products (i.e.forward and reverse matches between spectral profile of sample and library) to the likelihood of correct identification [127] often expressed as "scores" (depending on the used method/system) to which thresholds can be applied (e.g. between 70 and 80% for most of the publications that reported a score in percent).This information combined with retention indices and chemical physical properties is used by the analysts during data post-processing to select the most reasonable identification of each feature.Typical QA/QC of library search algorithms is assessed by the accuracy of identification of known components in a sample (e.g. a matrix sample fortified with known concentrations of internal standard(s)) and is expressed as with a confidence level of identification (discussed in detail below).However, as previously mentioned, the finite number of known compounds used to assess library search algorithm results may not represent the entirety of the chemical space, particularly for CECs.

Molecular formula assignment
For features and components that cannot be identified using spectral libraries, molecular formula assignment and in-silico fragmentation can aid in subsequent chemical database searches.Molecular formula assignment determines discrete molecular formulas for accurate mass by applying some chemical rules [131,132].It can be employed to reduce the number of potential candidates given the high number of theoretically possible structures for a mass range of 50e1200 Da.There are two main strategies (1) the use of public chemical databases (e.g.ChemSpider [133], PubChem [134], CompTox US EPA [135]): for comparison with existing chemical lists; or (2) a combination of user-defined elemental compositions and predefined rules (i.e. the seven golden rules [136,137]).
When using databases the exact formula assignment is very sensitive to the selected mass tolerance, and the retrieved candidate formulas may be very different [128,130].When using these tools, the analyst must have a good knowledge of the databases, the potential mass error in the data, and the sample itself to adequately assess the performance of search.For example, prior knowledge of true positives in the samples (e.g.different pharmaceuticals in wastewater influent) maybe an asset for the QA/QC of this step as it is possible to determine the accuracy of the data.
For unknown unknowns, where chemical structures and/or formulas have not previously been reported, chemical database searching is unreliable.Rule-based methods, on the other hand, are not limited to known chemical entities.They are, however, very sensitive to both mass tolerance and the nominated theoretical elemental composition.Additionally, rule-based approaches require more computation power than database methods, particularly for masses >500 Da [138].Recent developments in these approaches have incorporated the use of fragments and the neutral losses in order to increase the confidence levels to the assigned molecular formulas [132,137,139,140].
As mentioned, for both the database and rule-based approaches, mass accuracy of the feature is critical for correct molecular formula assignment.In most cases, an observed mass difference of <5 ppm between theoretical and observed mass has been used as a criterion for formula assignment [141].Additionally, in both cases, there is an underlying assumption that the observed mass used for chemical formula assignment belongs to the molecular ion rather than isotopes, adducts, and/or fragment.As such, a single QA/QC criterion relying on mass accuracy and failing to use all the available componentization information may not sufficiently assess suitability of the formula assignment.

In-silico fragmentation
In-silico fragmentation uses multiple algorithms with different levels of complexity, from simple bond theory to machine learning and quantum chemistry to predict the fragmentation pattern from a chemical structure [142].Regardless of the algorithm used, the analyst must generate a list of potential structures by matching either the mass of the feature or its assigned molecular formula to a chemical database.Next, the theoretical and experimental spectra are compared, and the similarity indicated by a score.These scores may incorporate additional information such as the number of available literature references and/or previous measurements to give additional weight to the selection [143,144].At this stage the analyst must curate the ranked list of candidates based on their scores, employing expert knowledge and any additional information such as physiochemical properties or knowledge of the sample matrix to make the final selection.These post-processing steps are highly subjective, yet rarely reported in the literature.Instead, what is typically reported is the structure of the potential candidate, the predicted spectrum, the measured spectrum, and the level of confidence in the identification [26].

Communication of confidence
There are multiple scales for the communication of confidence of identified features [26].In the environmental sciences, the "Schymanski scale", which is the most commonly used, has five levels; level 1 is the highest confidence, and describes an unequivocally identified feature, and level 5 is the lowest, describing only a measured feature with its accurate mass [26].These levels were one of the major steps towards increasing comparability and transparency of identified CECs in complex environmental samples.However, results of multiple interlaboratory collaborative trials in the past decade suggest that the collection of evidence to assign these levels of confidence remains subjective [145].This subjectivity is clearly observable in the number of fragments of 1e5, used as evidence for stating the level of confidence in a candidate structure [26].Depending on the structure in discussion, these 1 to 5 fragments (excluding the molecular ion) could represent ~10e100% match of the experimental spectra (e.g.2b,3a-Dihydroxy-5b-cholan-24-oic, MassBank ACCESSION: NU000383).
Another example of subjective use of evidence for confidence reporting is the use of predicted retention time and/or ion mobility [146e153].Predicted retention times are calculated using quantitative structure activity relationships (QSARs), where a set of calibration standards are injected alongside the sample, and retention indices calculated from the relative retention times [154].This can be useful to refine a list of candidate compound identifications.However, currently the best retention prediction methods have approximately 1 min of uncertainty in the result [147,149]; depending on the run time, this may represent a significant fraction of the gradient.Additionally, this information would be unhelpful differentiating compounds with similar elution profiles.Similar conclusions were reached by Dodds et al., 2017 regarding ion mobility, i.e. even state-of-the-art ion mobility did not provide enough resolving power to separate very similar structures in NTA because the parameters cannot be optimized [155].If these techniques get developed further they could become more routinely used tools.
Recommendation: There is a clear need for detailed guidelines for objective assignment of confidence levels associated with an identified feature.In metabolomics and proteomics communities this is achieved by archiving a description of the experimental and processing steps alongside the raw data in public repositories for the community to test [145], but this approach has not yet been adopted by the environmental sciences.Furthermore, it is necessary to further develop already existing and new tools, algorithms and databases to facilitate the identification of compounds.Additionally, the use of statistical tools such as Monte Carlo simulation as well as Receiver operating characteristic [156,157] for objective assessment/optimization of the identification parameters is needed.

Conclusions and recommendations for future non-target analyses
Thus far we have discussed the gaps in QA/QC procedures for each stage of an NTA workflow.Although several commonly applied methods for quality assurance exist, most of them address the problem of false positives (Table 2).However, except for manual examination there are not yet many procedures in place to reduce the number of/check for potential false negatives.Internal standards might provide a starting point for this issue but cannot give a comprehensive view of all the potential chemicals in an environmental sample.
We acknowledge that many of these stages are interconnected and thus independent assessment of one without the other is challenging.However, we have identified QA/QC opportunities where small improvements to the NTA workflow has the potential to have a significant impact on the rigour of the contemporary application of NTA: 1. Clear guidelines for the minimum number of internal standards to be used for sample matrices with different levels of complexity.This is of utmost importance for every step of the workflow, including extraction efficiency, suitability of the chromatographic separation, ionization/matrix effects during data acquisition, and evaluation of data processing methods.ISs should cover the entirety of the investigated chemical space, and each IS should be successfully identified by the end of the workflow.If this does not occur, missed identifications can assist in troubleshooting processing steps, which can either be modified or the limitations justified in the final report.One approach could be to internationally define and harmonize groups of ISs by matrix type, general goal of the research etc.These groups could then be used and reported for every analysis, even (or especially) if some of them are not detected (see (2) for further discussion).improving the graphical user interface and all round userfriendliness of open access tools may help popularize their uptake.Furthermore, developing robust tools for processes which are not yet automated would reduce the individual influences of the analyst (i.e.subjectivity). 5. Expansion of open access spectral libraries.These enable researchers to potentially get better transparency/comparable results, while also giving reviewers and other interested parties the possibility to better verify results, which is not always possible with vendor software, as discussed for other open access tools before (see above).However, it is critical that these libraries are curated properly (e.g.all necessary metadata associated with an entry is stated, naming conventions are used correctly, quality of the spectra, etc.).6.The use of pooled samples for method development.A pooled sample, created by combining small aliquots of each sample in an experiment, represents an "average" sample matrix.This pooled sample can then be used to assess and optimize different parameters during sample extraction, data acquisition and data processing, and should ideally be reported alongside unknown samples.Similar approaches are already commonly used in metabolomics (pooled biological quality controls (PBQC)) and food analysis and could be easily transferred to environmental NTA [63].7. Detailed and clear guidelines for the reporting of the identification of unknown features and the assessment of the level of confidence are needed for NTA to be widely accepted as a powerful means for comprehensive chemical characterization of complex samples.For example, a set of processing and statistical parameters that should be reported every time to prove the suitability and reliability of a method, which is not yet done consistently.Additionally, expanding the current guidelines for reporting levels of confidence for identification [26] may increase its applicability, as they were originally only developed as a "generic" approach to be specified on a case to case basis, for example as done for multi laboratory experiments conducted by Letzel et al. [28].
Adoption of these preliminary QA/QC guidelines could further facilitate the large-scale implementation of environmental NTA, including as routine analyses within regulatory frameworks.

Declaration of competing interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: ALH is an employee of AB SCIEX Australia.

Fig. 1 .
Fig.1.Illustration of a representative chemical space containing potential compounds of investigation based on molecular weight (MW; x-axis) and polarity (logKow; y-axis) of 699013 organic compound entries (mass < 1200 Da) from the "Distributed Structure-Searchable Toxicity" (DSSTox) Database (grey dots).For comparison, included as an example are the range of pharmaceuticals and personal care products (PPCPs) routinely targeted in our lab (green dots; n ¼ 72) and the PPCP labelled internal standards used (blue dots; n ¼ 27)[14].

Fig. 3 .
Fig. 3. General points of uncertainty and considerations that should be addressed by quality assurance and quality control measures during sample preparation steps of a non-target analysis.

Fig. 4 .
Fig.4.Points of uncertainty that quality assurance and quality control measures should address during the application of chromatographic separation for non-target analysis.

Fig. 5 .
Fig. 5. Points of uncertainty that quality assurance and quality control measures should address during mass spectrometric data acquisition for non-target analysis.

Fig. 6 .
Fig.6.Points of uncertainty and considerations that quality assurance and quality control measures should address during the data processing and identification for non-target analysis.

Table 2
2. A clear understanding of the fraction of the chemical space explored, every time an NTA workflow is applied, and transparent reporting of the limitations therein.A combination of as many as possible relatively newly available tools (e.g.advanced statistical tools, open access data repositories and retention time prediction models) and existing approaches could be used to assess the suitability of a specific NTA workflow for specific chemical classes.Such tools could drastically reduce the rates of false detection and identification.For example, retention time prediction models may help reduce the number of possible candidate identifications.To assess and improve the suitability of the actual data processing workflow, one option could be an open access, already well defined dataset, to be used by researchers.3. Development of automated algorithms to use multiple (different) blanks simultaneously for blank/noise subtraction.While the final choice of blanks depends on the experimental design, the use of harmonized methods/algorithms for blank removal would make reporting of the removal process easier (i.e. which blanks and algorithm have been used).Also, being able to easily include more and especially different types of blanks (field blank, matrix blank, instrument blank, etc.) could potentially reduce the number of false positive results even further.4. The development and wide uptake of open access data processing tools for transparent and reproducible results.While large companies, such as instrument vendors, have the resources to develop efficient and user-friendly data processing software, the proprietary nature of these commercial products can inhibit the complete and smooth transfer of the NTA workflow between different laboratories.Although sometimes difficult to use, open access tools with transparent descriptions of algorithms and parameters may facilitate the broader sharing of data and workflows within different communities, while removing cost-of-use barriers to access.Dedicated effort to Overview of existing quality assurance (QA) measures, the workflow step they are applied to, the issues which they address and their limitations. .Schulze, Y. Jeon, S. Kaserzon et al.Trends in Analytical Chemistry 133 (2020) 116063 a The lack of stated limitation does not indicate that the measure is flawless, but rather that the measure is likely to adequately address its respective issue.B