Limitations and Uncertainties of Acute Fish Toxicity Assessments Can Be Reduced Using Alternative Methods

Tab. S1: Characterization of the acute fish toxicity test (OECD TG 203) According to the OECD template for individual information sources to be used within IATAs (OECD, 2016; gray shaded and nonshaded lines), supplemented with information on uncertainties related to each characterization element (red shaded lines) and how alternative methods or approaches may reduce these uncertainties (green shaded lines). Cross references are indicated in blue.

Based on the hypothesis that acute fish toxicity is often due to nonspecific modes of action, endpoints such as metabolic activity and cell-and lysosomal membrane integrity measured in the RTgill W1 cell-line test (Fischer et al., 2019) could be considered as a mechanistic key event, even without a fully characterized Adverse Outcome Pathway (Volz et al., 2011) 2 . Such effects on rainbow trout gill cellssimilar to sublethal endpoints in the FETare likely to increase the likelihood for environmental population level lethality. This hypothesis appears mechanistically plausible, because compromised or weakened fish are likely to die in the real-world environment due to predators, competitors and/or other environmental stressors 1 . If needed, also other specific endpoints could be tested in vitro, such as acetylcholinesterase-inhibition (Arini et al., 2017) 3 .
Such methods may be combined within an Integrated Approach to Testing and Assessment (IATA), which is work underway at OECD level (OECD WNT project 2.54). Computational approaches for data integration, e.g., Bayesian Networks (Lillicrap et al., 2020;Moe et al., 2020) could complement the IATA. doi:10.14573/altex.2006051s It is noted that considering the environmental extrapolation uncertainty, the huge environmental variability and the uncertainty about this variability (see 2. 1, 2.3, 4.3, 5.1, 6.1, 8.1, 9.1, 10), LC50 values from TG 203 should be considered as coarse indicators of potential population level lethality in real environments. Thus, it remains an open question, how precisely such coarse indicators should be predicted by other toxicity indicators from alternative methods.

2
Fish are considered to be one of the highest trophic levels in ecotoxicology. Fish acute toxicity data are used together with acute invertebrate (daphnids) and algae data to inform on potential aquatic toxicity (in terms of these regulatory base-set requirements).

Uncertainties:
Information is available regarding the variability of LC50 values between the 6 fish species most frequently used in TG 203 tests (see 10), however, it remains uncertain how well this variability estimate represents the sensitivity distribution for all fish species and life stages in the environment. For 65 to 90% of chemicals, pesticides, and pharmaceuticals, daphnids and algae are more sensitive to toxicants than fish (Hutchinson et al., 2003;Jeram et al., 2005;Hoekzema et al., 2006;Rawlings et al., 2019). This range estimate is based on several analyses and thus indicates biological variability without a high uncertainty component. However, it indicates how often routine acute fish toxicity testing increases the protection level (compared to testing with daphnids and algae only). The range estimate also indicates that LC50 variability based on TG 203 studies, or estimated from alternative methods, may often not influence the PNEC derivation and GHS classification.
Regarding the broader environmental protection goal, the use of TG 203 data causes numerous sources of extrapolation uncertainty relating to the complexity of natural environments. Water temperature, hardness, oxygen content, pH, sunlight and turbidity, water current, total organic carbon, oxidation, evaporation, presence of sediment, variable fish biotransformation and other aquatic degradation or transformation, and co-exposure with other stressors could impact on the toxicity of chemicals. Acute fish toxicity is estimated, maximizing the bioavailabilty of a single chemical for a single species, using standardized parameters not reflecting the potential natural environmental variability. Organism health, stress from variable food webs and other biotic factors may also influence the fish lethality in natural ecosystems. These enviornmental variables may increase or decrease the toxic potential of a compound relative to the experimental laboratory situation. Furthermore, the aquatic environment contains hundreds of thousands of species (Mora et al., 2011) and various life stages (embryonic, juvenile, adult, senescent). It is an extreme environmental simplification to cover this biodiversity by using short term tests for a specific life stage of three species, representing the three trophic levels.
Data-based knowledge is available that single species tests are "in a majority of cases, reliable qualitative (some level of response seen) predictors of aquatic ecosystem community effects" (de-Vlaming and Norberg-King, 1999). This latter US-EPA review comprised 57 studies (74%) that support this conclusion and 16 studies (21%) where single species testing underestimated aquatic ecosystem effects, and 4 studies (5%) that were inconclusive. The review also explains that full quantitative validation of single species tests through field studies is not feasible and not meaningful given the substantial environmental variability. This also means that a single environmentally "true" value does not exist, regardless of the assessment method applied.

2.2
Informed species extrapolation is possible using computational methods, engaging for example, chemical structure, physicochemical properties, MoA classification, existing in vivo fish data, and/or other parameters (see 6.2). However, it is important to consider that models are only as good as the data used to build and train them. If poor quality, or highly variable AFT data is used in model construction, the models will suffer. Thus, models may not accurately predict experimental in vivo responses.

2.3
Uncertainties: By default, TG 203 does not provide information on mechanism of toxicity. This limits inter alia extrapolation to other species and taxa (via Inter species Correlation Estimate (ICE) models and Ecological Threshold of Concern, EcoTTC; see 6.2). This extrapolation is more uncertain, without knowledge of the underlying mechanism and its relevance for the broader ecosystem. Introducing moribundity and eventually additional mechanistic endpoints by default into TG 203 would be theoretically feasible. As mentioned before, moribundity would also partly address the animal welfare concern. However, there is currently no agreement on the link between potential clinical signs and related mechanisms leading to mortality. Therefore, OECD VMG-Eco agreed to incite voluntary collection of clinical signs first and their relation to lethality at a later stage. The use of moribundity would not improve the standardization and the throughput for testing and assessment. Moreover, it would require the development of reference databases and validation (which would necessitate to continue animal testing). Considering the current validation stage and perspectives for alternative methods, it may be more efficient to invest into alternative methods based IATAs.

2.4
Alternatives can provide some mechanistic information, see 1.2.

DESCRIPTION
3 At least 7 juvenile fish are exposed to the test chemical for 96 h under static, semi-static or flow-through conditions. Mortalities and visible abnormalities are recorded, and where possible the LC50 is determined.
One of the following species are recommended: zebrafish, fathead minnow, common carp, Japanese medaka, guppy, bluegill, rainbow trout, three-spined stickleback, sheepshead minnow, European sea brass and sea bream. If other species are used, the rationale for the selection of the species must be reported together with any adaptation of the test guideline's recommendations.
3.1 Uncertainties for the selection of a specific test design (within the range of TG 203) TG 203 stipulates that the selection of any fish species should depend on regulatory requirements and environmental exposure scenarios (cold, temperate or warm water species, freshwater or estuarine/marine fish), but does not provide further guidance for this species selection or discuss whether/when multiple species should be tested. TG 203 requires that juvenile fish are used. A range of acceptable fish-lengths for the different species is provided. However, because length is correlated as a cubic function to weight, small differences in length can still translate to larger differences in weight and developmental stage; see Table 6 in (Belanger et al., 2013). TG 203 does not specify the use of males or females. TG 203 does not precisely specify the number of fish to be used per concentration (see 5 and 5.1). This flexibility of the TG may introduce variability into the data, with disadvantages for global regulation (see 10, 10.1 and 16).
3.2 Alternative experimental methods are generally more standardized. They are not intended to provide an (uncertain) lethality estimate for a specific species and related environment. They are intended to provide a useful basis for estimating fish toxicity at least for the current standard species included in TG 203. For example, TG 236 (FET) limits variability of species to zebrafish, precisely defined developmental stages and related water parameters. The RTgill W1 fish cell line test (ISO, 2019) clearly defines the cell line, culture and exposure conditions. This may lead to significant advantages for regulatory toxicology, see 6.2, 10.2, 16.1.

RESPONSE MEASURED
4 Lethality: Fish are considered dead, if there is no visible movement (e.g., gill movements) and if touching the caudal peduncle produces no reaction. Visible abnormalities are also observed and recorded, including loss of equilibrium, swimming behavior, respiratory function and pigmentation.

Uncertainties:
In regulatory practice not lethality, but moribundity is legitimately 9 used as the endpoint in several countries, though this is not the case in all countries and TG 203 still prescribes lethality to be used. This introduces variability and uncertainty into the LC50 estimates. On the one hand, the use of moribundity may reduce the LC50 on average by a factor of 2, compared to lethality (Rufli, 2012). On the other hand, observations of moribundity are likely more subjective than observations of mortality and may introduce additional variability to the assay result. However, the variability and uncertainty from the use of moribundity or lethality can be estimated and reduced as soon as unambiguous criteria for moribundity have been agreed (Rufli, 2012). Thus, there is still uncertainty related to the use of these endpoints, but in principle the lethality to moribundity difference could be scientifically calibrated.

Uncertainties:
Since TG 203 is a vertebrate animal test, the basic test design has to be limited in terms of animal numbers. Therefore, there is no true biological replication in TG 203. All animals stem from one cohort and there are no tank replicates. Furthermore, TG 203 does not require positive controls, which would allow documenting the laboratories proficiency in acute fish toxicity testing, expectable intra-and inter-laboratory variability of LC50 values, within and between species and the validity of the individual study.

4.4
Alternative methods may include more replicates and include positive controls without significantly compromising 3R goals or increasing testing costs (especially if in vitro methods or computational methods are used, see 16.1).

5
The LC50 is calculated from the concentration-response relationship after 96 h. It is not mandatory to identify a concentration causing 0 or 100% mortality. The OECD TG 203 requires at least 7 fish to be used per concentration.

Uncertainties:
The precision of experimental LC50 estimates is reduced when fewer than 20 fish are used per concentration, with flat doseresponse relationships or in cases where the LC50 is off-center relative to the boundaries of the tested concentration range (Carr et al., 2018) 10 . A variety of models are used to estimate the LC50 which accommodate these difficult data structures and the LC50 estimate is sensitive to model selection in these cases. Confidence and prediction intervals are also subject to uncertainty when LC50 estimates from many tests, summarized by different methods, are being compared (Carr et al., 2018).

5.2
Alternative experimental methods may use more replicates and a larger number of test concentrations with no or less conflict with the 3Rs (e.g., fish cell lines (ISO, 2019) or TG 236, respectively).
Computational models are built using a large amount of toxicological data for many chemical structures 11 . 6 TG 203 LC50 may be used to derive an aquatic Predicted No-Effect Concentration (PNEC): current regulatory guidance suggests the use of default extrapolation factors (up to 1000) depending on the availability of just acute or also chronic toxicity data, from additional trophic levels (algae and/or daphnids) (ECHA, 2008).
6.1 Uncertainties: The variability of TG 203 derived LC50 values (see 10 to 10.1) may translate to variable PNEC extrapolations (in up to 35% of cases, see 2.1). Furthermore, in the absence of a standard validation of TG 203, there is uncertainty in this variability (see 9 and 10). Furthermore the assessment factors used according to current regulatory guidance are pragmatic, simplistic and result in PNECs, for which the environmental protection level (e.g., % species under risk) and related uncertainty is unknown (though there is some qualitative science basis for the regulatory use of AFT LC50, see 2.1 last paragraph). With regard to potential fish interspecies differences see 10.
PNECs derived by expert risk assessors can vary by 3 orders of magnitude with the largest contributor being selection of the input data for PNEC derivation based on study quality (Hahn et al., 2014). When different expert groups may come to different, scientifically legitimate conclusions this represents ambiguity. It results from data complexity as one form of uncertainty. With increasing number of toxicological methods and approaches and the regulatory requirement for Weight of Evidence assessments, this type of uncertainty is likely to grow. Several tools including ICE, SSD, CTD and EcoTTCs models can be used to describe, model, and account for the variability across species. On this basis a single experimental fish LC50 (or prediction thereof) can be recognized as a very uncertain estimate for environmental aquatic toxicity. Confidence intervals for more comprehensively informed environmental toxicity estimates, such as the 5 th percentile of SSDs may span 2 orders of magnitude (Awkerman et al., 2014;Bejarano et al., 2017). Moreover, the SSD models do not include the variability of fish toxicity due to the variability of environmental factors.
6.2 Variability for LC50 or EC50 estimates (and respective PNECs) from alternative methods may be lower (see 10.2) To counterbalance increase in complexity due to the regulatory requirement for Weight-of-Evidence assessments, alternative methods are usually intended to be used within standardized Integrated Approaches to Testing and Assessment (IATAs) and/or Integrated Testing Strategies (ITS) or Defined Approaches (DAs). The extrapolation-uncertainty from the use of pragmatic assessment factors would apply also to alternative experimental approaches. However, computational models may reduce the uncertainty of the current default assessment factor applications.
Interspecies Correlation Estimations (ICEs; from species with test data to species without test data) may increase the number of chemicals with reliable Species-Sensitivity-Distributions (SSDs). The latter may inform on expected interspecies differences for chemical groups and inform on useful limit value derivations (Bejarano et al., 2017). Chemical Toxicity Distributions (CTDs) may inform on chemical-group specific expected lowest No Observed Adverse Effect Concentrations (NOAECs) and may build the basis for the development of an Ecological Threshold of Concern (EcoTTC) (Belanger et al., 2015). It is noted that the data used for model development stem from laboratory experiments and can never inform on the almost endless real environmental variability, but they allow best informed use of available knowledge.

7
The TG 203 LC50, in combination with algae and daphnids toxicity data and eventually information for biodegradability and bio-concentration, may be used to classify chemicals and mixtures 1) according to GHS 12 and 2) to define the T-criterion within the assessment of PBT (persistent, bio-accumulating, toxic) properties (LC50 < 0.01 mg/L; ECHA, 2017b), with potentially severe legal consequences, like the requirement to be banned and substituted on the market.

Uncertainty:
There is variability in any type of experimental data. Therefore, especially for LC50 values close to classification cut off levels, the classification may change due to data variability, i.e., chance, rather than true differences in toxicity (Dimitrov et al., 2016). In the absence of agreed variability estimates for the acute fish, daphnia and algae data and the variability of the lowest LC50 or EC50 from such tests, there is uncertainty in this variability (see 9 and 10).

METABOLIC COMPETENCE
11 QSAR predictivity may be improved by local validity assessment (Benfenati et al., 2013. VEGA-QSAR: AI inside a platform for predictive toxicology. CEUR Workshop Proceedings. http://ceur-ws.org/Vol-1107/) and/or developing QSARs specific for chemical groups sharing a molecular initiating event. Thus QSARs may also trigger investigation of experimental data, in case the latter deviate from high quality computational predictions (see, e.g., figure 1 in Thomas et al. (2019). How in silico and QSAR approaches can increase confidence in environmental hazard and risk assessment. Integr Environ Assess Manag 15, 40-50. doi:10.1002/ieam.4108). 12 According to GHS and CLP regulation (EC No 1272/2008), LC50 values higher than 1 mg/L do not lead to acute aquatic toxicity classification whereas LC50 values ≤ 1 mg/L lead to classification in category 1. No other acute categories are defined, but further LC50 stratification into magnitudes (1-0.1 mg/L, 0.1-0.001 mg/L, etc.) allows the attribution of M-factors to the classified chemicals and this supports better mixture classification. Also classification for chronic toxicity categories is possible based on acute fish LC50 data, depending on information for ready biodegradability (ECHA, 2017c).

8
The metabolic biotransformation in juvenile fish can activate compounds resulting in enhanced toxicity when compared to the parent compounds. Alternatively, biotransformation could reduce the internal bioavailable concentration of the toxic compound and lead to a reduced toxicity.

8.1
Uncertainty: For compounds that require activation, differences in the LC50 of > 50-fold have been identified for different species (Scholz et al., 2016). Since TG 203 does not cover embryonic life stages, potential acute toxicity specific for this life stage might not be covered (e.g., life stage specific MoAs and/or absence/presence of specific biotransformation).
There is a lack of knowledge about the biotransformation variability of biotransformation between the species recommended in TG 203 as well as a lack of knowledge about the variability of biotransformation between the different fish life stages. This has not been systematically assessed. Furthermore there is uncertainty with respect to modulations within real-world ecosystems (Schlenk et al., 2008). Therefore, a comparison between a juvenile fish and alternative test systems is difficult given the lack of information.
Thus the uncertainty about variability due to potentially variable biotransformation is reduced. This does not exclude the possibility that for some compounds a reduced toxicity may be observed due to a lack of bioactivation when using some alternative methods. So far, however, only one example (allyl alcohol) has been described as having a lower toxicity due to a lack of bioactivation when tested using TG 236 (Kluver et al., 2014;Knobel et al., 2012) or the RTgill-W1 fish cell line (Tanneberger et al., 2013) compared to TG 203. However, this may also apply when a certain single species is used for the AFT.

9
TG 203 was adopted by the OECD in 1981OECD in , and updated in 1983OECD in , 1992 and 2019. The most recent update is based on the review of the fish testing framework (OECD, 2012a) and includes: a) specifications, in terms of fixing the formerly flexible test duration and the maximum fish load for flow through conditions (g/L), guidance for the use of solvent and solvent-controls (OECD, 2019), guidance on analysis of the dilution water control and solvent control, guidance on the use the statistical methods, the fish length terminology, specifying suitable estuarine and marine species, specifying the use of juvenile fish (by species specific ranges for fish length), a 14 day interval between treatment against diseases and test initiation and indicating potential range finders. b) adaptations, in terms of deleting the need for a concentration range leading to a 0 to 100% lethality range and enhancing the recordings for visible abnormalities (for future use of more humane endpoints in line with (OECD, 2002). Some other elements recommended for review in (OECD, 2012a) could not be improved so far, such as the use of moribundity as a definitive indicator of lethality and as a trigger to terminate the experiment (see 1, 1.1 and 4.1), the lack of a positive control, lack of test-tank replicates (see 4.3), lack of guidance on fish-species selection among the 11 species recommended (see 3.1).

9.1
Uncertainties: There is some qualitative scientific basis for the regulatory use of AFT LC50 (see 2.1, last paragraph). However, no international official validation is available according to the present requirements for alternative methods, in terms of reliability and relevance for real aquatic environments. No validation is available with respect to the effect of environmental modifiers.
With a view to analyzing the variability of TG 203 data, an existing TG 203 database was used, which was engaged earlier for re-assessing the correlation between TG 203 and TG 236 data (Sobanska et al., 2018). However, the TG 203 data in that database had not yet been filtered with the same quality criteria as the TG 236 data. A subsequent assessment of TG 203 variability used more extensive study quality filters and resulted in a reduction of a data set from originally 2936 studies on 1842 chemicals to 364 studies on 266 chemicals (Braunbeck et al., 2020). This reduction was of a similar magnitude to that seen in the TG 236 database (from 2065 to 123 chemicals, (Sobanska et al., 2018)) and indicates the same limitations for TG 203 as for TG 236 data in terms of data availability for lipophilic, volatile, reactive, inorganic and high molecular weight chemicals. Thus, in case the TG 236 database is considered too limited for reliably assessing the validity of TG 236, the same appears to hold true for TG 203.

9.2
For alternative methods, validation is obligatory for their acceptance and also available for the OECD FET TG 236 (Busquet et al., 2014;OECD, 2011OECD, , 2012bBelanger et al., 2013) and the fish cell line test (Tanneberger et al., 2013;Fischer et al., 2019;Natsch et al., 2018;ISO, 2019). These validation studies support reliability in terms of intra-and inter-laboratory reproducibility of test results (see 10.2) and also relevance in terms of correlation with AFT data with slopes near 1 and intercepts close to 0. Some chemicals with a neurotoxic MoA and one chemical for which toxicity is increased due to bioactivation via alcohol dehydrogenase, appear to be less toxic in the two alternative methods compared to AFT. They represent outliers from the strong correlation observed between data from alternative methods and AFT. Such outliers may still be within the range of AFT variability, inter alia, due to the TG inherent species variability (see 10). However, with the aim of improving the current level of environmental protection, the development of an IATA, composed of several methods and information sources is currently envisaged at OECD level (WNT Project 2.54). Compared to in vivo methods, validation for reliability is relatively easy due to usually lower costs, higher throughput and 3Rs benefit (reducing the constraints for conducting a higher number of tests). Validation of relevance for real aquatic environments and with respect to interference with environmental factors is similarly difficult for alternative methods; however, computational approaches (as one kind of alternative method) can start to address this (see 6.2).

10
The variability of protocol variants associated with acute fish toxicity testing was investigated and it was demonstrated that when including all test species, all life stages and all test conditions and not filtering for any type of exposure confirmation, the range between minimum and maximum 96 h LC50 values may be as great as six logarithmic units within the 44 substances analyzed. Even when data for rainbow trout only were considered, the range between the minimum and maximum 96 h LC50 value could still reach three logarithmic units (Hrovat et al., 2009). Another data set and analysis (Belanger et al., 2013;Lammer et al., 2009) illustrates that the span of toxicity reported in the literature may be as large as 4 logarithmic units. Two older ring trials indicate a maximum range of LC50 inter-laboratory variability of one logarithmic unit, if fish interspecies variability is excluded, but other variability such as fish size and exposure conditions (flow-through or static) is included. The two assessments were based on one chemical in two replicates and two fish species within 6 laboratories (Lemke, 1981) and another chemical in one or two replicates and one fish species within 13 laboratories (US EPA, 2001). Below, in Table S1/10.0., a representative compilation of aquatic toxicity tests was devised to provide additional perspective on expected reproducibility from both accepted standardized assays used for regulatory purposes (macrophyte, algae, daphnid, and fish) and new (fish embryo, gill cell line) assays that have undergone international validation. The types of compounds used in ring trial and assay validation programs range from highly variable wastewater effluents, inorganic compounds, industrial organic compounds, pesticides, polymers, biocides and pharmaceuticals suggesting that the primary source of variability for the assessment of acute aquatic toxicity is the difference between the biological components used within the different test systems. Some ring trails included determination of assay conditions (e.g., Lemke (1981) who compared flow-through and static designs for daphnids and two fish species).

Tab. S1/10.0: Overview of representative levels on intra-laboratory (I) and inter-laboratory (E) repeatability acute toxicity tests for various taxa and test types
All results are expressed as Coefficients of Variation (%). Alternatives to acute fish toxicity are shaded in gray.  Lemke, 1981 Without exception, inter-laboratory coefficients of variation exceed intra-laboratory coefficients of variation when both were available for comparisons within a ring trial. From this summary, based on coefficient of variations and excluding the TG 203 inherent inter-species variability, it seems that both intra-and inter-laboratory repeatability are in a similar range for traditional ecotoxicity tests and alternative assays. However, data sets referenced in the table above are larger for the alternative methods in terms of number of chemicals and replicates tested. Therefore, the slightly lower range estimates for variability in alternative methods compared to AFT, as derived from the references above, support the expectations for lower variability of alternative methods (see 10.2). Moreover, a more comprehensive, and more recent analysis applying stringent data quality filters indicated that about 8% and 0.5% of 181 chemicals showed differences in AFT LC50 data by factors of > 10 or > 100, respectively, if interspecies variability is excluded. If the TG 203 inherent interspecies variability is included these percentages increase to about 15% and 10% of 178 chemicals (Braunbeck et al. 2020). Also other work including AFT data for the neurotoxic biocide malathion indicates that AFT interspecies difference may be in the range of 4 orders of magnitude, depending on the chemical ( Figure 5 in Fischer et al. 2019). For compounds that require activation, differences in the LC50 of > 50-fold have been identified for different species (Scholz et al., 2016). An earlier analysis using less stringent AFT data-quality filters compared to Braunbeck at al. 2020used a database of 337 chemicals and 4 fish species. This study indicated an AFT interspecies variability between a factor of 1 and 43 for the mean, and a factor of between 8 and 95 for the 95% quantiles (Scholz et al., 2016;Sobanska et al., 2018).

10.2
If the OECD guidance on good in vitro method practice (GIVIMP) is followed to develop robust protocols (OECD, 2018b), alternative methods may produce less variable and less uncertain results due to their higher standardization, improved basic study design (see 3.2, 4.2 and 5.2) and the availability of standard validation data. This increases data comparability, data reproducibility and reliability of classification, PNEC derivation and (PB)T criterion identification, which is an advantage for (international) regulation (see 16.1) In fact, it is noted thatcompared to the AFT ring-trials with single species and one or two chemicals (US EPA, 2001;Lemke, 1981; first two columns in Tab. S1/10.2)the data-sets for the validation of alternative methods are larger in terms of number of chemicals and replicates tested. Therefore, the slightly lower range estimates for variability in alternative methods compared to these first two AFT variability estimates (US EPA, 2001;Lemke, 1981) support the expectations for lower variability of alternative methods. The more comprehensive, most recent analysis indicates clearly highest range estimates (excluding and including interspecies variability), but this is also based on many more chemicals. It is noted that TG 203 recommends the use of different fish-species and this clearly further increases LC50 variability.
2 For semi-quantitative information see 2.1, last paragraph. For quantitative information see 6 and 6.1. TG 203 and alternative methods are conceptually similar with regard to the fact that data outputs from both types of methods need to be fed into extrapolation models that can provide information on potential effects and PNECs in the real environment. However alternatives allow improvements (see 6.2).

14
Using TG 203 for testing chemicals with slow bioaccumulation is highly questionable from an environmental relevance point of view. The acute exposure period may be too short for building up a critical body burden and hence the toxicity could be underestimated. It may also be discussed whether the current approach of testing for acute and/or chronic fish toxicity depending on yearly production volumes, exposure scenarios and results in acute fish tests is scientifically adequate at all. Previously it was suggested that testing with chemical specific exposure periods, in order to allow testing at conditions of steady state body burdens, could be scientifically more defensible (Sprague, 1969). 9

14.1
In vitro methods and computational models are available that may inform on the potential for bioaccumulation in fish (OECD, 2018a), OECD QSAR toolbox, US-EPA TEST, VEGA, and potentially useful follow up assessment strategies.

15
Low water solubility, high volatility, chemical instability, mixtures and UVCBs. The OECD Guidance Document 23 for aqueous-phase aquatic toxicity testing provides recommendations for the testing of difficult-to-test test chemicals (OECD, 2019).

15.1
Computational modelling may allow extrapolation to a chemical space that would be technically impossible or difficult to test (Thomas et al., 2019). For alternative experimental methods, the same limitations with regard to testing difficult substances would apply. As for the AFT, appropriate approaches have been suggested to test difficult compounds such as, e.g., sealing culture dishes to avoid evaporation, frequent renewal of exposure solutions and/or saturation of the culture dishes by pre-incubation with chemicals, which is potentially relevant for highly hydrophobic substances (OECD, 2018b).

16
TG 203 is a vertebrate animal test based on lethality as an endpoint using a minimum of 42 animals beyond the embryonic stage per test in a full concentration-response study. These two aspects are ethically and politically problematic. NL: Netherlands National Committee for the protection of animals used for scientific purposes. Transition to non-animal research on opportunities for the phasing out of animal procedures and the stimulation of innovation without laboratory animals. Published December, 12, 2016. file:///C:/Users/Gilly/Downloads/NCad+Opinion+Transition+to+non-animal+research.pdf (accessed 23.05.2020). 14 Though this testing-ban has currently only manifested for human toxicity, the strong international political interest to end animal testing in the field of cosmetics has become obvious: Ban on animal testing. European Commission Internal Market, Industry, Entrepreneurship and SMEs. https://ec.europa.eu/growth/sectors/cosmetics/animal-testing_en (accessed 23.05.2020).
The European strategy for a non-toxic environment (European Commission, 2017) states it would be desirable to test a larger proportion of the 100.000 chemicals currently on the market, which are still lacking reliable toxicity data 15 . Low assay throughput as a result of the time taken for the study and the large volumes of chemical and tank volumes represent a limitation to this goal. Moreover, there is regulatory interest for newer areas, like toxicity of nanomaterials 16 , mixtures (European Commission, 2018) and bio-analytics in environmental aquatic media that may require acute aquatic toxicity estimates (Norberg-King et al., 2018;Schroeder et al., 2016). The design of "green chemistry" molecules may also require additional testing (Maertens et al., 2014). LC50 data variability (see 10 and 10.1) may translate (in up to 35% of cases, see 2.1) to variable PNECs and GHS classifications, for toxicologically similar chemicals, which is a disadvantage for global regulation. 16.1 Providing absolute environmentally relevant toxicity information for the extensive real-world environmental variability is extremely difficult and practically impossible to achieve with any method. From a scientific point of view, hazard identification (i.e., GHS classification and (PB)T criterion identification) and hazard characterization (i.e., concentration-response modeling and PNEC derivation) may support the comparison of the toxicity of chemicals. This will support a global reduction in environmental exposure to the most dangerous chemicals. From this perspective, alternative methods may provide the following advantages: o 3Rs benefits • Replacement of AFT in most cases by using a threshold approach-based IATA (including physico-chemicaldata, algae and daphnids tests, fish-cell-tests, computational methods and other approaches), • Refinement by using FET instead of AFT in most cases within the threshold approach-based IATA, • Reduction in use of AFT and FET, by using a threshold approach-based IATA, (which also supports the use of limit tests for AFT and FET), o reducing costs (by using, for example, computational models, fish cell line test), o increasing the testing and assessment throughput (by using, for example, computational models, fish cell line test), o improved similarity of test designs and assessments (engaging IATAs, if necessary), o reduced variability and uncertainty thereof (see 3.2, 4.3, 5.2, 8.2, 9.2 and 10.2). This shall permit more reliable data generation in a more contemporary time frame for many more chemicals. Altogether, this may increase the comparability of test results between chemicals and thus global reliability of GHS classification, PNEC derivation and risk assessment. Such improved data quality may also better support the development of computational methods, which, if sufficiently valid, may be highly efficient for the ecotoxicity assessment of chemicals.