An analysis of the limitations and uncertainties of in vivo developmental neurotoxicity testing and assessment to identify the potential for alternative approaches

Limitations of regulatory in vivo developmental neurotoxicity (DNT) testing and assessment are well known, such as the 3Rs conflict, low throughput, high costs, high specific expertise needed and the lack of deeper mechanistic information. Moreover, the standard in vivo DNT data variability and in the experimental animal to human real life extrapolation is uncertain. Here, knowledge about these limitations and uncertainties is systematically summarized using a tabular OECD format. We also outline a hypothesis how alternative, fit-for-purpose Inte- grated Approaches to Testing and Assessment (IATAs) for DNT could improve current standard animal testing: Relative gains in 3Rs compliance, reduced costs, higher throughput, improved basic study design, higher stan- dardization of testing and assessment and validation without 3Rs conflict, increasing the availability and reliability of DNT data. This could allow a more reliable comparative toxicity assessment over a larger proportion of chemicals within our global environment. The use of early, mechanistic, sensitive indicators for potential DNT could better support human safety assessment and mixture extrapolation. Using kinetic modelling ideally these could provide - eventually context dependent - at least the same level of human health protection. Such new approaches could also lead to a new mechanistic understanding for chemical safety, permitting determination of a dose that is likely not to trigger defined toxicity traits or pathways, rather than a dose not causing the current apical organism endpoints. The manuscript shall motivate and guide the development of new alternative methods for IATAs with diverse applications and support decision-making for their regulatory acceptance.


Alternative approaches for DNT testing -need, availability and concepts for use
human DNT effects [3], [Toxicology and Applied Pharmacology DNT special issue 1 ].
Given the complexity of the biological processes underlying developmental neurotoxicity, it is unlikely though that any of these alternative methods will provide a standalone solution for DNT testing. It is therefore necessary to develop Integrated Approaches to Testing and Assessment (IATA), based on the combined use of multiple sources of information. Therefore, available in vitro assays should be incorporated into an IATA together with other sources of information including mechanistic knowledge built in Adverse Outcome Pathways (AOPs) [4], kinetic quantitative in vitro to in vivo extrapolation models (QIVIVE) and other in silico methods such as read across and QSARs, but also non-mammalian 3Rs models (e.g. zebrafish embryos) as well as existing animal and human data. Such IATA development should be driven by a concise problem formulation, enabling fit-for-purpose data to answer the different regulatory needs, i.a. hazard identification/characterization or risk assessment [5].
Recently, this approach has been taken on board by the Organization for Economic Cooperation and Development (OECD), aiming to develop an OECD Guidance Document on in vitro methods to DNT testing in collaboration with multiple stakeholders including academic scientists, industry representatives and regulators from different countries and jurisdictions [6]. Globally, several authorities/agencies and projects have committed significant investments in data generation to support this guidance development. Roughly, a 120 compounds (pesticides, drugs, environmental chemicals etc.) are being run through an in vitro battery in projects supported by the European Food and Safety Authority, the Danish Environmental Protection Agency (EPA), the US-EPA, NTP/NIEHS/NIH US and The H2020-funded project" EU-Tox-Risk" and on this basis IATA case studies will be developed for the purpose of the guidance.
It is proposed that a battery of in vitro test methods should be preferably based on human induced pluripotent stem cells-derived mixed neuronal/glial cultures, since they permit evaluation of a chemical impact on critical neurodevelopmental processes, mimicking different stages of human brain development [1]. Moreover, readiness of the existing in vitro DNT assays for different regulatory purposes has been recently evaluated following thirteen (semi)-quantitative criteria established by DNT experts [5]. Evaluation of 16 in vitro assays and zebrafish behavioural tests at early development (0-5 days post fertilisation) with respect to various regulatory uses was performed and the scoring results suggest that several assays are currently at high readiness levels.

Transparency in scientific uncertainties is key for responsible decision making
In all fields of applied science, transparency in uncertainties of data and knowledge is key for a responsible decision-making [7]. This is especially important, since assessment and decision making are usually carried out by different institutions. It guarantees the scientific integrity of the assessment on one hand and well-informed, independent decision-making on the other hand. The latter usually needs to integrate the scientific information on risk metrics with various additional, socially relevant aspects. This principle has been increasingly acknowledged in recent years for the risk assessment and risk management of chemicals. Guidance and tools have been developed for transparent characterization of the uncertainty of chemical risk metrics, such as ratios between human exposure and human limit values [8,9].
Similarly, the need for a transparent analysis of uncertainties of the performance metrics for alternative methods, i.e. reliability and relevance, within their validation process has recently gained recognition. Specifically, the uncertainty characterization of standard reference methods for the validation of alternative methods facilitated the acceptance of alternative methods in the field of human regulatory toxicology (see following paragraphs). The types of analysis done and their consequences may be inspiring also for the DNT field.
For instance, information on the reproducibility of Test Guidelines (TGs) in the field of eye-irritation/damage and skin sensitization sets some limits for expectable correlation between data from alternative methods and from animal reference methods [10][11][12].
Others analyzed how the experimental variability of acute rodent lethal dose (LD 50 ) data translates to variability of Globally Harmonized System (GHS) classification [13] and it was highlighted that from a scientific perspective a borderline-range between GHS potency categories should be established. Test results falling into this borderline range, shall be considered as uncertain due to limited reliability of any test result [14]. DNT data may lead to classification for reproductive toxicity and 10 % effective dose (ED 10 ) values may inform the use of specific concentration limits for mixture classification [15].
Uncertainties of the rodent carcinogenicity study-based assessment were summarized, including quantitative and qualitative information. It was concluded inter alia that a fully quantitative validation is not feasible due to the high complexity and potential variability in data generation and assessment and this represents an uncertainty of the reference method as such [16]. Nevertheless, the applied assessment scheme allows a semi-quantitative and qualitative comparison of the limitations and uncertainties of current standards versus new alternative approaches. This may support a best-possible informed decision on the acceptability of the new approaches.
Information on the limitations and uncertainty of standard in vivo DNT testing has also been published, e.g. [5,17,18]. However, a comprehensive summary of all potential limitations and uncertainties, similar to the work for carcinogenicity [16] is still missing. Consequently, here such a similar summary for limitations and uncertainties is provided for the current regulatory standard in vivo DNT testing and assessment, following the same OECD template [19,20].

Building the hypothesis: limitations and uncertainties of in vivo DNT testing and assessmentcan be reduced with alternative approaches
In vivo DNT testing and assessment has a number of limitations and uncertainties and these are summarized here. Furthermore, we indicate which of these limitations and uncertainties may be reduced with alternative approaches including in vitro testing and in silico modelling.
For a top level overview, the main aspects of this discussion are illustrated in Fig. 1 and summarized in Section 2. The figure and summary intend to provide a structure that could support a targeted discussion by regulatory experts, for the one or the other aspect that may be relevant for the development of DNT IATA(s) for diverse regulatory purposes. On the one hand, the results of such targeted discussion may motivate and guide the development of new alternative approaches to improve the current methodological toolbox. On the other hand, it shall support regulatory decision-makers, for agreeing what performance of IATAs including alternative approaches may represent an overall improvement for the various regulatory needs.
In the supplement Table S1, information about the limitations and uncertainties of the use of in vivo standard DNT TGs is presented within an OECD standard tabular format. This format was originally developed to characterize alternative methods as individual information sources to be used within IATAs [20]. The tabular summary was applied to carefully consider all potential limitations and uncertainties of the use of the in vivo DNT TGs and to develop the summarizing Fig. 1 and text for Sections 2.1-2.3 in this manuscript. Table S1 may be further amended and refined, as far as useful, within the ongoing regulatory experts' project on DNT IATA(s) development. Applying the same systematic characterization scheme for both, the use of current DNT in vivo standards and the use of alternative methods, may facilitate decision making on the acceptability of fit-for-purpose IATA(s) for various regulatory applications. An internationally reviewed/revised version of the S1 table could also be included as an annex to the OECD IATA guidance, which is work in progress. It could serve as a reference for some specific discussions and sections within this OECD guidance.

Regulatory limitations of standard in vivo DNT testing & assessment
The standard in vivo DNT testing and assessment currently represents as a screening method for the evaluation of many chemicals with a broad set of potentially relevant endpoints. However, this regulatory purpose is limited by several aspects: The regulatory limitations of standard rodent in vivo DNT testing and assessment relate to its low throughput, requiring more than one year for one substance, the high costs, i.e. above 1 million Euro [21] and the conflict with the global 3Rs goals due to the use of vertebrates (Table S1, lines 5.7.).
Moreover, a high level of specific expertise in conduct, assessment and DNT data interpretation is needed, requiring continuous, speciesand method-specific (re)training. This limits the availability of experienced experts, especially since -due to the many other limitations-the frequency of in vivo DNT testing and assessment is low [5,22,23] creating a negative feedback circuit (Table S1, lines 5.7.).
Furthermore, due to the 3Rs conflict [24], practicalities and costs, the basic study design is limited in terms of feasible animal numbers. This restricts the minimal effect size that can be detected as statistically significant. Minimal detectable effects size estimates are available, but these vary considerably (and include values beyond 100 % effect size; [25][26][27][28], Table S1 line 5.4.2.3). Therefore, as true for all in vivo assessments, standard in vivo DNT data assessment requires consideration of biological relevance in addition to statistical significance and this increases uncertainty in the data assessment due to complexity, e.g. [29,30]. Moreover, negative historical controls need to be considered in this context, but this is hampered due to low frequency of testing, the heterogeneity of test designs and large flexibility in data interpretation. The limited animal number may also lead to testing of high doses, findings of indirect DNT effects and missing non-monotonic-dose-response relationships [25,31,32]. Further uncertainty relates to the lack of study internal positive controls for study validity assessment [33] (Table S1, lines 5.7.).
Last, but not least current standard in vivo DNT testing does not provide mechanistic information, supportive for the identification of early sensitive indicators of potential DNT effects and suitable for mixture toxicity investigations [34,35] (Table S1, lines 5.7.).

Uncertainties in variability of in vivo DNT testing and assessment
There is uncertainty related to knowledge about the variability of standard in vivo DNT data. Replicate testing and assessment is available for about 20 chemicals [36]. Estimates for variability vary, depending on the chemical, the endpoints and the approach to generate the variability estimate (available estimates include similarity values below a similarity proportion of 0.5 from 1 [37,38]. Of course, the estimate cannot be comprehensive for all DNT endpoints and related methods. Published, non-DNT related estimates for reliability of animal test data appear to be in the range between 60 %-70 % for a three category GHS classification and 70-80 % for a two category GHS classification (for skin sensitization and eye irritation/damage [10][11][12][39][40][41] or about 57 % for carcinogenicity [42]. No internationally agreed quantitative variability estimate is available for DNT [36]. By default may we expect lower variability for the very complex DNT study endpoints (Table S1, lines 5.8.)?
However, experimental variability is expected. Behavioral endpoints are reported to be sensitive to several experimental variables (e.g. for motor activity, the shape, size, movement detection system and related activity metrics and data processing). Endpoints are also sensitive to potential external influences to the experiment (e.g. noise, light, odor, handling, time of observation relative to light-dark cycle and dosing, age, test history, stress), to individual animal variability and to some subjective measurements (e.g. clinical observations, histopathology). Some guidance is available to limit some of this variability (Table S1 lines 5.3.) [25].
There are some differences between the in vivo DNT test guidelines. For example, in contrast to the OECD TG 426 [43], the OECD TG 443 [44] includes pre-mating exposure, but uses fewer pups for testing and includes cognitive learning and memory tests only if triggered based on available data. Therefore, OECD TG 443 may be more comprehensive in terms of the exposure regime, but less comprehensive in terms of potentially relevant DNT endpoints covered and overall conclusions may differ between tests from different guidelines. Moreover standard test guidelines allow a great deal of flexibility in terms of endpoints and related methods, exposure time, exposure route and top dose selection. This allows the testing laboratory or registrant to use the endpoints most appropriate for the specific chemical, the related toxicological knowledge and regulatory need. However, in regulatory practice, information to select the best study design may often be very limited and different study directors may decide for different study designs. An informed hypothesis-based targeted testing is rarely possible. Therefore, the guideline inherent flexibility limits comparisons of toxicity between studies and chemicals (Table S1, lines 2.1.).
It is a highly complex task to select a specific study design in terms of animal strain, endpoints and related methods, exposure periods and exposure routes as well as top-doses (Table S1 lines 5.3.). In addition, data analysis and interpretation requires the need to consider various main effects, interactions, data-dependencies, uncertainties, statistical methods and integration of statistical significance and biological relevance. Often data from more than one study need to be assessed, which may be conflicting (Table S1 lines 5.4.). Taken together this is a complex task and different expert groups may come to various scientifically legitimate conclusions. Data informing on a realistic probability for such situations of ambiguity are available for non-DNT related regulatory toxicology assessments [29,30,45] and include a chance of 40 % and more for different expert assessments (Table S1 line 4.12.11).

Uncertainties in human hazard and risk extrapolation
Some uncertainty relates also to the extrapolation from the standard rodent in vivo DNT testing to human hazard and risk estimates. For about 20 chemicals, positive human and at least high dose positive standard animal DNT data are available, indicating the potential to predict human DNT with standard animal approaches [23,46,47]. Of course this does neither mean that all chemicals with human DNT hazard can be identified with standard animal DNT testing and assessment, nor that all identified animal DNT hazard is relevant for humans (see Table S1 lines 5.9.).
Quantitative extrapolation from the experimental Benchmark-Dose (BMD) or No-Observed-Adverse-Effect-Level (NOAEL) to a human reference value may contain high uncertainty especially in the absence of specific kinetic and metabolism (ADME) data (according to [46], up to a factor of 10.000, see table S1, line 5.9.1.2). It is also noted that estimating the exposure and ADME in pups and extrapolating this to the human situation may contain very significant uncertainties [33,48,49] ( Table S1,  Applying the usual pragmatic standard assessment factors (often 100) to BMDs or NOAELs results in human reference values with an unknown protection level (unknown in terms of probability for population fractions under risk for the critical effect and the related uncertainty). However, using data-based probabilistic extrapolation models human reference values are estimated with a 5th/95th percentile range that spans at least a factor of 100 for a protection target of ≤ 1% population under risk. This uncertainty range becomes unknown for higher protection targets of e.g. ≤ 0.0001 % population under risk due to limitations in the data available for modelling. It is also noted that the current database for this probabilistic model does not include DNT specific data [9]. Furthermore, there are qualitative differences between species that are not captured by the probabilistic assessment factors (as summarized in [50] about 23 % of data sets; see Table S1 line 5.9.1.3.).
Extrapolation uncertainties may also relate to the fact that primary DNT effects sometimes may not be distinguished from secondary, nonspecific consequences of maternal and other toxic effects (e.g. acute neurotoxicity or strongly reduced feed/water intake or general systemic toxicity leading to strong maternal body weight decrease, reduced litter size & weight and slower postnatal development). In this case, the DNT effects are pragmatically considered to be of concern [25], though the relevance for expectable human hazard in the real world situations remain uncertain (see Table S1 lines 2.2.3).
In qualitative terms, extrapolation uncertainties relate to rat to human differences in brain development (morphology and functional anatomy). While the general processes and sequence in brain development are conserved between species, the major brain growth spurt is postnatal in rat, but prenatal in human. Thus exposure route, related kinetics and metabolism differ for that critical period, i.e. milk and/or feed for rats versus placenta for human [33,49]. It is reported that the similarity of the functional nervous system is higher for reflexes but lower for learning and memory [25] and other higher cerebral tasks (like reasoning, reading, planning, organizing) and complex nervous system functions (like advanced sportive or artistic activity) cannot be tested in animals as such. Given the complexity of neurodevelopmental disorders, e.g. autism which prevalence dramatically increased during the recent years [51,52], the use of animals causes significant uncertainties since they do not develop the full spectrum of features characteristic for such diseases [53,54]. Some differences were also reported at the level of molecular signaling and cell differentiation [55][56][57][58]. Currently, it remains an unanswered question how comprehensive the current standard animal DNT testing and assessment is for relevant human neuronal functionalities (Table S1 line 5.1.1.).
Last, but not least, DNT may be the result of many causes, i.e. a combination of epigenetic and genetic background, socio-economic status, diet, life style, stress and co-exposure including environmental contaminants and drugs or maternal infection and viruses, for instance [59]. The dose at which chemicals may contribute to DNT will depend on these other real-world modulators, which currently cannot be covered by any testing (Table S1 line 4.12.1).

Building the hypothesis: how could alternative approaches reduce current regulatory limitations and uncertainties?
Recently alternative models for DNT testing and assessment have become available. For instance, mixed neuronal/glial cultures derived from human induced pluripotent stem cells (hiPSCs) can be used as they allow evaluation of chemical impacts on key neurodevelopmental processes, by reproducing different windows of exposure during human brain development [1]. If these processes (e.g. cell proliferation, migration, synaptogenesis, neuronal network formation and function) are impaired as a result of chemical exposure, they can be assessed in a quantitative manner using in vitro assays and serve as reliable readouts for DNT effects. Recently, the European Food Safety Agency (EFSA) has published a detailed report on the evaluation of the currently available DNT in vitro test methods, including human models, concluding that a variety of in vitro methods, covering early and late stages of neurodevelopment are already available and could be used to predict DNT effects [60]; for a summary of methods and concepts, see also Section 1.1).

Potentially reduced limitations and uncertainties?
Such alternative approaches could reduce several of the current regulatory limitations, due to their relevance to human biology, better mechanistic understanding of toxicity and potential for much higher throughput and strongly reduced costs and coherence with the global 3Rs goals. This could permit to assess many more chemicals, eventually nanomaterials, mixtures and environmental media and allow assessing potentially less problematic, including green chemistry, early in chemical development [61]. Toxicological data may also be updated more frequently by retesting of chemicals in line with the scientific progress.
With such an expansion of toxicological data, the overall human safety could be increased. Moreover, with alternative approaches, a higher number of replicates, broader concentration ranges and study internal positive controls could be tested without 3Rs conflict. In addition, alternative testing and assessment approaches are usually standardized to a higher degree. Altogether, this could reduce the variability of data and the "biodiversity" of expert assessments. A parallel assessment of statistical and biological relevance a posteriori to testing may not be necessary for alternative approaches. Due to the absence of a 3Rs conflict with retesting, the improved standardization may also facilitate better validation for reproducibility of specific study designs. A higher data reliability could also increase the sensitivity of the method and thereby favor the distinction between primary and secondary DNT effects. Because of all these advances, the comparability of data generated for similar or different chemicals could be increased and this would translate to more reliable toxicity comparisons between chemicals, more reliable classifications and more reliable human reference values. This would be of advantage for global trade and global regulation of the more than 100.000 chemicals on the market 2 .

Potentially new uncertainties?
Alternative approaches may test for early, sensitive, mechanistic indicators of toxicity. It may be that this introduces some new uncertainties. However, these potentially new uncertainties are conceptually not different from current in vivo based testing and assessment (Table 1).
In summary, effects at the molecular and cellular level may be the earliest testable biomarkers or key events that may or may not progress to an adverse outcome at the organism and population level, depending on many biological factors that cannot be tested with any in vivo or in vitro system (Fig. 2). In this context, the AOP framework has been recognized in regulatory science as an efficient and effective tool for capturing existing knowledge describing the linkage between mechanistic information at these different biological levels and adverse outcome (AO) of regulatory relevance.
From that perspective, alternative approaches could allow the most precautionary testing approaches that could also be calibrated to a desired sensitivity. Moreover, the increased data reliability and the mechanistic information gained from alternative approaches could provide several other advantages: The distinction between primary and secondary DNT effects could be better supported. Extrapolation from individual chemical testing to mixture toxicity assessment could become better informed.
Recently the AOP concept has been applied for the assessment of developmental neurotoxicity induced by mixtures of Persistent Organic Pollutants (POPs) [35] and mixture of metals [66]. These studies showed that identified common key events within a network of AOPs relevant to DNT can serve as reliable and robust anchors, for in vitro assays selection for DNT testing, triggered not only by single chemicals but also mixtures. Furthermore, mechanistic knowledge built in DNT AOPs facilitates data interpretation and their possible application for regulatory purposes. Within the currently available DNT AOPs the most frequent AO is defined as cognitive damage or learning and memory impairment in children [67].
On the one hand, uncertainties in extrapolation from animal in vivo endpoints and related BMDs or NOAELs/LOAELs to human effects and reference doses (Section 2.3) are well recognized. On the other hand, uncertainties in extrapolation from alternative methods endpoints to human organism level effects are also obvious. Considering the potential for high precaution and calibration of endpoints from alternative methods, maybeeventually context dependent -at least a similar level of human health protection can be reached with alternative approaches?
Recently a few AOPs have been developed where impairment of learning and memory in children has been defined as adverse outcome (AO). These AOPs facilitate mechanistic understanding of disturbed signaling pathways and/or neurodevelopmental processes such as neurite outgrowth, neuronal/glial differentiation, synaptogenesis, neuronal network formation and function etc., defined as Key Events (KEs) involved in impairment of cognitive functions. These KEs can serve as anchors for in vitro assays, permitting to characterize in a quantitative manner which chemicals (at which concentration and time of exposure) Thus, the tested standard in vivo endpoints are considered "just" as coarse indicators for a more general neurological system health status [62].
AOPs rely on a broad mechanistic knowledge coming from many diverse data sources. Both (AOPs and basic science) are in continuous development.
Aligning the rodent DNT endpoints (for motor activity, startle response, conditioned fear, various water mazes) with functionally similar neuropsychological effects that can be tested and observed in humans is ongoing work (Table S1, line 5.1.1.9).
The early mechanistic events tested may not cover all potential mechanisms leading to DNT.
The issue of comparability of human and rodent cognitive measures in studies of pharmacology and neurotoxicology has generated a large body of literature [53,63,64,65]. How comprehensive is the mechanistic coverage of the current standard in vivo approach? (Table S1, lines 5.1.1.5., 5.1.1.9) Therefore, "positive" data may be more reliable than "negative" data.
Given the many potentially relevant variants of standard in vivo DNT test designs (in terms of endpoint selection, testing and assessment approaches), "positive" in vivo DNT data are more reliable than "negative" in vivo DNT data. In vitro metabolism is limited to the biotransformation capacity of the isolated test system and may be dissimilar to in vivo metabolism.
Metabolism may vary between species and environmental influences. How much do we know about this variability? ( However, also the latter may be disputed, at least for quantal data: We may consider that the dose-response relationship in the animal-test is very much influenced by experimental conditions including species/strain selection and therefore does not contain really relevant information. Theoretically, we may be more interested to know the dose that causes the effect in an average animal. In fact all experimental standardization aims to reduce variability between animals. In this case, for quantal data, the BMR 50 would be more relevant. Human variability will in any case be modelled separately based on human data [9]. In summary, defining the critical effect size for in vivo endpoints, needs convention.
Testing and/or modelling in vitro kinetics as well as QIVIVE modelling is necessary for translating in vitro results to human in vivo reference values. This introduces measuring and modelling uncertainty b .
a For understanding, how much the selection of a BMR5 versus e.g. a BMR50 matters we may consider: On average, how high is the difference between selecting the corresponding concentrations (i.e. BMC5 or BMC50) as PoDs? If expressed as a probability distribution over a larger set of chemicals, what is the proportion of this uncertainty distribution relative to the total uncertainty of the human reference value after QIVIVE and intra-species variability modelling? b Furthermore it is noted that QIVIVE modelling is work in progress at OECD level and it may take several years for their regulatory application. Currently Guidance is in development by the OECD Working Party for Hazard Assessment aiming to: 1) Summarize a scientific workflow for characterizing and validating physiology based kinetic (PBK) models, with emphasis on the use of in vitro and in silico data for ADME parameters, and in scenarios where in vivo kinetic data are limited or unavailable to parameterize model parameters. 2) Identify knowledge sources on in vitro and in silico methods that can be used to generate ADME parameters for PBK models. 3) Develop an assessment framework for evaluating PBK models for specific purposes, with emphasis on the major uncertainties underlying the model predictions. 4) Provide a template for documenting PBK models in a systematic manner. 5) Provide a checklist to support the evaluation of PBK model applicability according to context of use. In addition to this case studies are in development.
could be considered as potential triggers of cascade of events leading to cognitive damage in children [35]. Such in vitro assays are, first of all, suitable for mapping a chemical-induced disturbance of signaling pathways at the cellular level and nowadays, can be performed using human models (e.g. mixed neuronal/glial culture derived from induced pluripotent stem cells) avoiding species extrapolation [1]. Current perspectives for the development of an IATA foresee a tiered strategy. After evaluating available DNT evidence, alternative approaches may be included within tier 1 and 2 assessments. Depending on the resulting data and the regulatory context this may be followed up by targeted rodent testing [1,5], if necessary. Of course, reliability and relevance of the alternative tier 1 and 2 testing needs to be sufficiently high, i.e. at least significantly above 50 %, in order to justify their use. Considering the uncertainties summarized above in the standard in vivo approach (Fig. 1, Sections 2.1-2.3 and Table S1), the extra margin of certainty achievable with additional tier 3 in vivo testing, still needs to be evaluated. However, with further progress of this specific regulatory science, the current perspectives for tiered testing and assessment strategies may evolve further.

(How) may regulatory toxicology evolve to make better use of alternative approaches?
In principle, discussion of two questions may be helpful for the evolution of regulatory toxicology: 1) Which new approaches can be developed to serve the current regulatory toxicology system. 2) (How) may the current regulatory toxicology framework be adapted to make better use of the new alternative approaches?
While a lot of work is in progress to provide answers to the first question (see e.g. Toxicology and Applied Pharmacology DNT special issue 3 ), discussions tackling the second question is available, but still scarce [16]. Therefore, one thought-starter is further spread here (Fig. 2): Can we use alternative methods data as early sensitive, mechanistic indicators for a potential DNT hazard and QIVIVE modelling to derive a "probably acceptable human effect level", in the sense that the alternative approaches may predict likely safe exposures for specific toxicity traits (e.g. no effects on key neurodevelopmental processes including cell differentiation, migration, neurite outgrowth, synaptogenesis, network function), rather than the current organism level endpoints? AOPs provide the mechanistic understanding of relationships between the molecular initiating event which triggers a cascade of key events at different biological levels (cellular, tissue, organ) and results in adverse outcome at organism level.
Different approaches provide different types of data and this may justify evolving current adversity definition and GHS classification criteria or even creating a new in vitro mode of action (MoA) hazard GHS class, depending on how different the data are and if supported by substantial social benefits. QIVIVE supported subcategorization to account for potency differences could be key for future regulation.
As discussed earlier [16], it is noted that many biological modifiers (such as (epi)genetic background, diet, life-style, socio-economic background, stress, infections, co-exposure) can influence the progression of effects from the molecular/cellular level to an adverse outcome at the organism level. Therefore, the concentrations at which the effects at the molecular/cellular level can progress to organism level effects can be very variable in real life. These biological modifiers are neither assessed with in vivo methods. Therefore, quantitative modelling of the complex, potential compensatory, mode of actions may not be as important as highly reproducible test results. If molecular initiating events (MIEs) and/or KEs at the cellular/tissue level are affected this may be considered as an increased chance that AOs may develop, depending on many variable and also unknown biological modifiers; in Fig. 2, this is indicated by the green "non-adverse" to red "adverse" background shading. Furthermore, consider the white arrow in Fig. 2, which broadens from the sub-cellular to the population level, indicating that one could expect that the experimental variability is likely to increase with the complexity of the system. For alternative and in vitro (MIE, KE) based test systems this may provide the added benefit of relatively reduced uncertainty for variability compared to the more complex in vivo systems?
Several considerations of conceptual similarity of current animal testing approaches and future in vitro and in silico approaches indicate that such an evolution of regulatory toxicology may be possible (Table 1).

Fig. 2.
Understanding chemical safety, as a dose that is likely not to trigger defined toxicity traits or pathways, rather than the current apical organism level endpoints (adapted from [16]).

Conclusion
The current in vivo DNT testing and assessment approach provides a hazard characterization with relevant regulatory limitations and uncertainty. A summary of respectively available knowledge is illustrated here and provided in a systematic OECD standard-template in the supplement (Table S1). This knowledge base may be reviewed, revised, eventually updated and further developed within the broad expertise of the OECD and EFSA DNT Working Groups. It could serve to support decision making for the acceptance of alternative methods based IATAs, since the latter should represent an overall improvement of the current approach.
Theoretically, also the in vivo DNT test guidelines as such could be improved, in terms of their basic study design, further mechanistic and quantitative validation of endpoints and a higher standardization of testing and assessment. However, such a validation would require many animals and it would be very costly in terms of time and money. Moreover, for a reliable use, historical control databases would need to be built, which would take even more time. Thus, it would be difficult to keep any standard in vivo approach up to date with scientific progress. Moreover, many of the limitations and uncertainties of the standard in vivo DNT approach, are of generic nature and also apply to other animal tests, e.g. limitations of reliability and sensitivity due to constrains for the basic test design, lack of deeper mechanistic information, uncertainties for species differences in metabolism and kinetics and limited quantitative knowledge of variability and uncertainty related to hazard and risk extrapolation (Table S1, uncertainties specified with "G" for generic or "S" for DNT specific). Last, but not least, the practical regulatory limitations in terms of testing throughput and 3Rs conflict may not be sufficiently improved considering current data needs.
Here, we outlined a hypothesis which regulatory limitations and uncertainties of the in vivo DNT test guidelines could be reduced by new alternative approaches: Gains in 3Rs compliance, reduced costs, higher throughput, improved basic study design (replicates, study internal positive controls, concentration range), permitted better mechanistic understanding of toxicity, higher standardization of testing and assessment and validation without 3Rs conflict -could increase the availability and reliability of DNT data for thousands of chemicals. This could allow a comparative toxicity assessment over a larger proportion of chemicals within our global environment, which are currently without DNT data. The use of early, sensitive, mechanistic indicators for potential DNT could support precautious human safety assessment and mixture extrapolations. Combined with QVIVE modelling, ideally this could provideeventually context dependent (fit-for-purpose) -at least the same level of human health protection. Such new approaches could also lead to a new understanding for chemical safety, as a dose that is likely not to trigger defined toxicity traits or pathways, rather than the current apical organism level endpoints.
This hypothesis could now be tested in upcoming work, which aims for specifying and characterizing alternative approaches and aims for analyzing how these new approaches may reduce regulatory limitations and uncertainties of current DNT assessments based on animal experiments and observational human data.

Disclaimer
The scientific views presented in this paper are those of the authors alone and do not necessarily reflect official views of their respective institutions.

Declaration of Competing Interest
The authors declare no conflict of interest.