Drivers of divergent assessments of bisphenol-A hazards to semen quality by various European agencies, regulators and scientists

.


Introduction
Exposure to bisphenol-A (BPA), a synthetic compound used in the production of polycarbonate plastics, epoxy resins and thermal papers, is widespread.There is concern about associated human health effects.Recently, the European Food Safety Authority (EFSA) has published a revised BPA Health-based Guidance Value (HBGV) based on immunotoxicity (EFSA 2023), 20,000-times lower than their previous temporary Tolerable Daily Intake (TDI).However, with the publication of an alternative TDI by the German Federal Institute for Risk Assessment (Bundesinstitut für Risikobewertung, BfR), the effects of BPA on male reproductive health have recently also come into focus.The BfR alternative TDI is derived from effects of BPA on sperm counts (BfR 2023).It is 1000-fold higher than the EFSA HBGV.
The new HBGV of 0.2 ng/kg/d developed by the EFSA Panel on Food Contact Materials, Enzymes and Food Processing Aids is 20,000-fold lower than their previous temporary TDI.Based on an extensive review of relevant studies, EFSA found that BPA exposure was related to a broad spectrum of harmful effects, the most sensitive outcome (critical toxicity) being an increase in the number of a specific type of immune cell involved in inflammatory diseases and obesity (T-helper cells).The new HBGV it is derived from a mouse study which demonstrated a rise in T helper cells (Th 17) at very low BPA exposures.As European populations are predicted to experience exposures 400 to 1,200 times above the new HBGV, pressure on further regulating BPA, e.g. by eliminating it from food contact materials, is likely to increase.
However, the strong downward correction of EFSA's temporary TDI has spawned criticism from the European Medicines Agency (EFSA and EMA, 2022) and the BfR (EFSA and BfR, 2022), among others.The disagreements between EFSA, EMA and BfR concern several aspects of EFSA's re-evaluation, but in essence, derive from both EMA's and BfR's refusal to accept effects on T helper cells as the basis for establishing a new HBGV.
The BfR has now countered the EFSA HBGV by deriving an alternative, 1000-fold higher TDI of 200 ng/kg/d (BfR 2023).The BfR value is based on the deterioration of sperm counts in animal studies as the critical effect and is the result of a probabilistic approach to hazard characterization according to principles elaborated by Chiu and Slob (2015) and WHO/IPSC (2018).Compared to the standard deterministic approach, in which uncertainty and variability are dealt with by combining default extrapolation factors, probabilistic approaches have the advantage of quantifying the degree of uncertainty inherent in estimations of "safe" exposures, but little guidance exists as to when to use them for deriving HBGVs.Current European exposures are predicted to be lower or slightly in excess of 200 ng/kg/d, and in consequence, the BfR TDI would likely not trigger any regulatory measures for the protection against risks from BPA exposures.
While EFSA developed their revised HBGV, we were finalising a systematic literature review to establish a BPA reference dose (RfD) for deteriorations of semen quality (Kortenkamp et al. 2022 a).This RfD was intended to support a mixture risk assessment focused on male reproductive health which included BPA and a range of other chemicals (Kortenkamp et al. 2022 b).As a reasonable estimate of BPA exposures without effects on semen quality, this value does not take account of other toxicities and therefore cannot meet the degree of protection expected from a normative value such as a HBGV.Our BPA RfD was derived specifically from animal studies of gestational (i.e., the period from conception to birth) and neonatal (i.e., first 28 days of life) BPA exposures and effects on semen quality.We chose 3 ng/kg/d as the midpoint of a range of possible values derived from relevant studies, using deterministic approaches.Our interest was not in defining a value that can protect against the whole range of BPA toxicities, but to estimate an exposure no longer associated with declines in semen quality, in line with the goal of our mixture risk assessment.
In contrast to our evaluation of gestational BPA exposures and semen quality (Kortenkamp et al. 2022 a), EFSA (2023) came to the view that deteriorations of semen quality resulting from gestational or post-natal BPA exposures are unlikely to be of relevance to human health and therefore of no consequence for the derivation of a new HBGV.However, they concluded that BPA exposures in adulthood can lead to poor semen quality.Yet, the application of the same assessment factors chosen by EFSA for the derivation of their HBGV to the key semen quality study selected by them (Wang et al., 2016) produces an RfD of 1 ng/kg/d for diminished sperm motility, below our RfD of 3 ng/kg/d.
Another area of dispute between EFSA and BfR concerns the use of assessment factors to support species-to-species extrapolations.Due to interspecies differences in toxicokinetics, doses in animal experiments must be converted into so-called human equivalent doses.BfR have argued that the factors used by EFSA for extrapolations from mouse studies are too high, leading to excessively low extrapolated human doses.This dispute is of consequence as the key study to support EFSA's new HBGV is in mice.
Because of BfR's derivation of an alternative TDI, estimates for "safe" human exposures protecting against BPA effects on semen quality now range from 1 ng/kg/d to 200 ng/kg/d, a difference of more than two orders of magnitude.This prompted us to identify the factors that drive these divergent hazard characterisation outcomes.Accordingly, in this paper we focus solely on semen quality and will not consider the dispute between EFSA, EMA and BfR concerning BPA effects on the immune system.
The following questions guide our evaluation: (1) What are the factors that drive the diverging views on the relevance of BPA exposures during gestation or adulthood for deteriorations of semen quality?(2) What motivated the choice of species-to-species extrapolation factors, specifically for mouse-to-human extrapolations?(3) What guided the choice of critical studies for the estimation of RfD or TDIs, and therefore determined their numerical values?(4) How do the various estimates for "safe" human BPA exposures relate to the available epidemiological evidence of associations between BPA and deteriorations of semen quality?

Materials
We based our comparative assessment on the following publications: The EFSA re-evaluation of the risks to public health related to the presence of BPA in foodstuffs (EFSA 2023) and associated technical documents such as the accompanying hazard assessment protocol (EFSA 2017), the BfR statement on a HBGV for BPA of 19 April 2023 (BfR 2023), a statement detailing the differences in opinion of EFSA and BfR (EFSA BfR 2023) and the material presented in our systematic review of BPA and declining semen quality after gestational exposures (Kortenkamp et al. 2022 a).
We conducted probabilistic hazard characterizations according to the principles elaborated by Chiu and Slob (2015) and WHO/IPSC (2018) by using the APROBA PLUS tool, version 1.14, available from https://www.rivm.nl/en/aproba-plus.As recommended by EFSA, we used a body weight of 70 kg as the default value for the adult human, 0.5 kg for the rat and 0.022 kg for the mouse.Beausoleil et al. (2022) conducted an appraisal of the reproductive and developmental toxicity of Bisphenol S (BPS) after developmental exposures.Since many of the relevant BPS studies also investigated BPA, we used this paper as an additional basis for comparisons.
A note on terminology is required.We use the term "health-based guidance value" (HBGV) synonymous with "tolerable daily intake" (TDI) to denote regulatory values based on Points of Departure (PoD) for the first toxic effect that is observed as exposures are escalated, called "critical toxicity".The term "reference dose" (RfD) is reserved for toxicities derived from PoDs for toxicities that materialize at higher doses than those associated with the critical toxicity.

Evidence synthesis: The relevance of gestational exposures and exposures in adulthood for deriving a BPA reference dose
To assess whether the evidence from animal studies was sufficient to infer a causal link with effects in humans, EFSA scrutinised both their internal (is the work of good technical quality?) and external validity (are animal models and effects relevant to human health?).By application of the approaches developed in NTP OHAT (2015) for assessing internal validity, EFSA considered key evaluation elements related to controlling various forms of bias, endpoint by endpoint, in a tiered system.The aspects assessed included randomisation procedures during the allocation of animals to dose groups, comparability and consistency of experimental conditions across groups, complete data reporting, details of exposure characterization and outcome assessment and statistical methods, including power considerations (appropriate numbers of animals).BfR adopted the EFSA confidence rating scheme in broad terms, albeit not in detail.We show that this tiered system led to a disregard of evidence indicative of hazards from gestational BPA exposures.
EFSA employed three Tiers, with Tier 1 corresponding to high and Tier 3 to low reliability.Tier 2 occupies a middle position.Placements of studies in these Tiers were translated into study confidence ratings which formed the basis for an integration of the evidence into judgements about the likelihood of an effect occurring in humans.In this step, EFSA used the "hazard identification conclusion categories" of "very likely", "likely", "as likely as not", "unlikely" and "not classifiable".Only for endpoints classed as "very likely" or "likely" were hazard characterizations (dose-response analyses) performed (EFSA 2017).To provide a sound basis for dose-response analyses, EFSA only considered studies that included a vehicle-exposed control and at least 3 BPA doses.All other studies were excluded from the analysis.Studies placed in Tier 3 were not taken forward into hazard characterization (dose-response analysis) and were deleted from further consideration.
EFSA considered sperm counts, sperm morphology, sperm motility and sperm viability separately, study by study.Studies were further grouped according to the timing of BPA exposures into developmental exposures (gestational and neonatal), developmental and adult exposures, exposures during the growth phase of animals and exposures in adulthood.
EFSA judged associations between BPA exposures during development and effects on all four indicators of semen quality (i.e., count, morphology, motility and viability) as "unlikely", because neither of the studies which they rated as reliable (Tier 1) detected any of these effects (Camacho et al., 2019).According to EFSA, all studies that demonstrated effects on semen quality after developmental exposures suffered from insufficiencies which led to a downgrading to Tier 3, the lowest degree of reliability (e.g., inadequacies regarding the blinding to dose group allocations or outcome measurement or relating to test system contamination through BPA-containing caging).Studies placed in Tier 3 were excluded from further consideration (for example Hass et al., 2016, Kalb et al., 2016, Rahman et al., 2017;Shi et al., 2018).EFSA made similar judgements in relation to pre-natal and post-natal exposures in pups until adulthood and exposures during the growth phase (young animals).Examples of studies of gestational exposures EFSA left out due to fewer than 3 BPA doses include Chatsantiprapa et al. (2016), Salian et al. (2009) and Yang et al. (2015).However, they missed Vilela et al. (2014), a mouse study with 3 different BPA doses.Other papers, such as Shi et al. (2019) and Ullah et al. (2019) were not considered as they appeared outside EFSA's evaluation period which ended in 2018.
There is a considerable body of work on BPA gestational exposures and deteriorations of semen quality in animal studies.In our systematic review (Kortenkamp et al. 2022 a) we identified 26 relevant studies that analysed total sperm count, sperm concentration, motility, morphology or vitality as outcome measures and that appeared up to August 2021.We followed a tiered evaluation system similar to that of EFSA, but unlike EFSA, did not evaluate semen quality parameters separately.Furthermore, we did not restrict the analysis to studies with at least 3 different BPA doses but considered all the evidence.We added key appraisal elements explicitly relating to the use of phytoestrogen-free feed, inclusion of positive controls and control of BPA contamination (these aspects were lumped together in the exposure characterisation key element of EFSA's scheme).Because quality of reporting does not necessarily equate with quality of conduct, absence of information did not automatically lead to a downgrading to Tier 3 in our approach.Accordingly, only when there was positive evidence of the absence of phytoestrogen-free feed or of insufficient control of contaminations did we place studies in Tier 3. Before embarking on quantitative assessments, we assessed the strength of evidence for effects after gestational BPA exposures qualitatively.
For synthesising the evidence, we utilised the framework in Radke et al. ( 2018) which we modified according to the approach detailed in EFSA (2017).In this system, the strength of evidence is categorised as "Robust" when there are sets of studies with a Tier 1 confidence rating, with consistent findings of adverse effects across multiple laboratories and species.With our approach, 11 of the 26 studies obtained a high confidence rating.Of these, 10 studies reported the effects of BPA on semen quality parameters in multiple strains of rats and mice.We, therefore, evaluated the overall strength of evidence for associations of BPA exposures during gestation with poor semen quality as "robust" (Kortenkamp et al. 2022 a).This category is similar to EFSA's "likely" rating.We did not consider BPA effects on semen quality after adult exposures.
EFSA rated associations between BPA exposures in adulthood and effects on semen quality as "likely", but only for declines in sperm motility and sperm viability, and not for other measures of semen quality.Declines in sperm motility and viability were observed in one mouse study (Wang et al., 2016) which EFSA considered as well-conducted and reliable (Tier 1).Declines in sperm counts were rated as unlikely, as these were not seen in Wang et al. (2016), nor in one study placed by EFSA in Tier 2 (Yin et al., 2017).EFSA took the Wang et al. study forward into hazard characterization and conducted detailed dose-response analyses.
BfR used a confidence rating scheme quite similar to that employed by EFSA, with endpoint-by-endpoint evaluations.Like EFSA, they did not consider studies with fewer than 3 different BPA doses and dismissed the relevance of developmental exposures, but without any review of the literature.In agreement with EFSA, BfR regarded BPA effects on semen quality after exposures in adulthood as relevant.In contrast to EFSA, BfR viewed effects on sperm motility as unlikely but rated BPA effects on sperm counts as critical for deriving a TDI.These differences are the result of contrasting views of the reliability of the Wang et al. (2016) study.Noting that Wang et al. did not provide information about background BPA contamination via water bottles, caging and diet, BfR placed that study in their Tier 3 and promptly excluded it from further consideration.This exposes some inconsistencies in EFSA's confidence rating approach, as missing information about BPA contamination in some other studies has led EFSA to downgrade them to Tier 3.However, this was not applied to Wang et al. (2016).
BfR based their TDI derivation on two studies of BPA effects on sperm counts after adult exposures, those by Liu et al. (2013) and Srivastava and Gupta (2018), both placed in their Tier 2. EFSA missed evaluating Liu et al. (2013) but, noting insufficient information about blinding and deficiencies in the reporting of background contaminants, downgraded Srivastava and Gupta (2018) to Tier 3. The details provided by Srivastava and Gupta (2018) allow the inference that the animal cages (polypropylene) did not leach BPA, but it is indeed unclear whether the animal feed in their study was free of phytoestrogens.Liu et al. (2013) A. Kortenkamp et al. provided evidence that the animal feed used in their study was phytoestrogen-free, but there is insufficient information about the nature of their caging and about blinding.Application of EFSA's criteria would have led to a downgrading to Tier 3.
The differences in these evaluations are summarized in Table 1.

Conversion factors for mouse-to-human extrapolations
Rodents usually eliminate polar substances much faster than humans.To attain comparable tissue concentrations, rodents therefore must receive correspondingly higher doses than humans.To quantify these toxicokinetic interspecies differences, the blood-concentrationtime profiles of chemicals, so-called areas under the curve (AUC), are compared after administration of similar doses.Faster metabolism and elimination give smaller AUC values.An animal-to-human extrapolation factor is then derived by calculating the ratio of the AUC of the species in question and the AUC measured for humans (human equivalent dose factor, HEDF).By multiplying the animal-specific reference dose with the HEDF, a human equivalent dose (HED) is derived.The inverse of BPA-specific HEDFs can then be used as substitution for the toxicokinetic standard assessment factor (AF) for interspecies extrapolations which is set to 4 by default (EFSA 2012).Critical for deriving a HEDF for BPA are measurements of the serum concentrations of free, unconjugated BPA.
Here, we argue that EFSA's choice for deriving a HEDF, although erring on the side of caution, is not well supported by the empirical evidence, while BfR's approach is more realistic.
EFSA ( 2023) derived an HEDF of 0.0155 for mouse-to-human extrapolations (the inverse of 0.0155 gives 64.5, used as AF for the kinetic interspecies extrapolation) based on an AUC of 0.244 nmol h L − 1 in mice and 15.7 nmol h L − 1 in humans.EFSA took the AUC mouse value from a study by Doerge et al. (2011); the human AUC value was chosen by calculating the median of two studies in volunteers (Teeguarden et al., 2015;Thayer et al., 2015).
BfR argued that the Further disputes concern the question of whether the free BPA AUC in mice increases linearly with the BPA dose and therefore supports a simple adjustment to a BPA dose of 100 μg/kg to enable comparisons between studies that used different BPA doses.EFSA concluded that the AUCs for unconjugated BPA in the studies by Taylor et al. (2011) did not increase linearly with dose, as indicated by the AUCs reported as 38.72 ng h mL − 1 at 400 μg/kg and 2991 ng h mL − 1 at 100,000 μg/kg.The latter value is three-times lower than what would be expected from the lower dose of 400 μg/kg, assuming that a linear increase of AUC with dose is correct.However, BfR argues that the measurement of the last time point (24 h) for the 400 μg/kg BW dose is unreliable due to analytical problems.An overestimation of the AUC is the consequence.If this timepoint is omitted from the AUC calculation, and the concentration-time profile of the high dose rescaled to the low dose, both AUCs are very similar.This suggests that, contrary to the points made by EFSA, there is no evidence against the linearity assumption between administered dose and unconjugated serum BPA in adult female mice throughout the entire 24 h period after oral administration, over a large dose range.This is supported by the linear correlation between four intake doses and their unconjugated serum levels measured 24 h after administration (Taylor et al., 2011).The linearity assumption is critical as the BPA doses in Taylor et al. and Sieli et al. were 4-, 200and 1000-fold higher than in Doerge et al. (2011).
In addition, EFSA considered the study by Sieli et al. (2011) to be unreliable due to irregular concentration-time profiles of the unconjugated serum BPA.They argued that the data does not allow a clear separation of the kinetics of absorption and distribution from the elimination process.However, from the reported data we find it difficult to recapitulate this point.The key data from the various toxicokinetic studies are summarized in Table 2.
In summary, it appears that EFSA erred on the side of caution by relying exclusively on the data in one study, that by Doerge et al., and

Studies critical for the derivation of a reference dose or TDI in relation to semen quality deteriorations
We present both deterministic and probabilistic approaches to deriving RfDs for BPA effects on semen quality.The application of probabilistic approaches in hazard characterization is relatively new, and to our knowledge, no guidance exists about when to favour probabilistic over the more familiar deterministic approaches in deriving HBGVs.Probabilistic approaches have considerable advantages in terms of clear communication of the risks associated with chemical exposures (Chiu and Slob 2015).We argue that both with deterministic and probabilistic approaches, considerably lower TDI's can be justified than the value proposed by BfR.

Deterministic approaches
In the key study of declines of semen quality in adulthood identified by EFSA (2023), C57BL/6 mice were dosed orally for 8 weeks with BPA at 10, 50 and 250 μg/kg/d (Wang et al., 2016).Decreases in sperm motility were observed at all doses; sperm viability was affected at 250 μg/kg/d.The lowest-observed-adverse-effect-level (LOAEL) in this study is 10 μg/kg/d.Benchmark dose modelling conducted by EFSA produced a benchmark dose lower limit associated with a 20% effect (BMDL 20 ) of 3.41 μg/kg/d for declines in sperm motility, and 26.1 μg/kg/d for reduced sperm viability.In a deterministic approach similar to that used by EFSA for the derivation of the immunotoxicity HBGV, with a mouse HEDF of 0.0155 (equivalent to an inter-species AF of 64.5), the BMDL 20 for reduced sperm motility translates into a human equivalent dose (HED) of 0.053 μg/kg/d.Application of an AF of 2 to allow extrapolation from subchronic to chronic exposures and further AFs for interspecies differences in toxicodynamics (2.5), between-human differences in toxicokinetics (3.2) and toxicodynamics (3.2) produces an RfD of 0.001 μg/kg/d (Table 3).
EFSA based their BMD calculation on a relatively large benchmark response (BMR) of 20%, which might be considered too high.We, therefore, made our own estimations using the LOAEL from Wang et al. (2016) in which a deterministic approach produced an RfD of 0.0005 μg/kg/d, slightly lower than the deterministic value of 0.001 μg/kg/d derived from EFSA's benchmark dose modelling (  3).
In the second key study selected by BfR, Srivastava and Gupta (2018) treated adult Wistar albino rats with 50, 500 and 1000 μg/kg/d by the oral route.Significantly decreased sperm counts were observed at 500 Notes: AUC values adjusted to a dose of 100 μg/kg/d; values from Sieli et al. (2011) and Taylor et al. (2011) converted to molar units.

Table 3
Derivation of BPA reference doses for semen quality by deterministic approaches.
A. Kortenkamp al. μg/kg/d.Accordingly, the NOAEL is 50 μg/kg/d.By application of a deterministic approach, BfR calculated an RfD of 0.163 μg/kg/d (Table 3).Benchmark dose modelling using frequentist model averaging produced unstable estimates, and Bayesian model averaging was not possible.
In addition to the key studies selected by EFSA and BfR, we analysed two further studies of gestational exposures to BPA in rodents (Shi et al., 2018;Ullah et al., 2019).These data featured in our estimation of a BPA RfD for declines in semen quality for the purposes of a mixture risk assessment (Kortenkamp et al. 2022 a).In our evaluation, both studies were placed in Tier 1, but we noted the lack of information about phytoestrogen-free feeds.In a recent evaluation of BPS by Beausoleil et al. (2022) both studies were rated as high-quality key studies (Shi et al., 2018;Ullah et al., 2019 also investigated BPS).EFSA (2023) placed the work by Shi et al. (2018) in their Tier 3, because of lacking documentation about background contamination and blinding during outcome measures.During their evaluation of effects on testis histology, however, they gave Shi et al. ( 2018) a higher confidence rating (Tier 2).Ullah et al. (2019) was not evaluated by EFSA.Shi et al. (2018) dosed pregnant CD-1 mice orally with 0.5, 20 and 50 μg/kg/d BPA from gestational day 11 to birth.Sperm counts were measured when the mice were 60 days old and were found to be significantly decreased at the two lower doses, but not at the highest dose.Accordingly, the LOAEL is 0.5 μg/kg/d, which by application of the usual AFs gives a deterministic RfD of 0.00003 μg/kg/d (Table 3).Ullah et al. (2019) administered BPA in drinking water to pregnant Sprague-Dawley rats from conception to birth, at concentrations of 5, 25 and 50 μg/L.Significant reductions in sperm counts became apparent at 50 μg/L, and accordingly, the NOAEL from this study is 25 μg/L.
Assuming a daily water consumption of 6 mL per 100 g of body weight, and a body weight of 400 g, the NOAEL translates into a dosage of 1.1 μg/kg/d.From this, a deterministic RfD of 0.004 μg/kg/d can be derived (Table 3).

Probabilistic approaches
Based on the BMDL 10 from Bayesian model averaging (26 μg/kg/d) derived from Liu et al. (2013), BfR conducted their own probabilistic calculations which yielded a RfD of 2 μg/kg/d (lower confidence LCL, Table 4).In these calculations, BfR changed the default distribution parameters in the APROBA tool (WHO/IPSC, 2018), as follows: Instead of choosing the APROBA settings for extrapolations from rat to human, they constructed their own BPA-specific log-normal distribution which they derived from all possible permutations between three rat AUC values (Doerge et al., 2011;Pottenger et al., 2000;Domoradzki et al., 2003) and two human AUC values (Thayer et al., 2015;Teeguarden et al., 2015).This generated a pool of six HEDF ranging from 0.11 to 1.58 which produced the AF distribution for inter-species extrapolations in APROBA (P05 = 0.64; P95 = 6.14).
To provide a counterpoint to their own calculations, BfR also applied the default WHO APROBA settings to the Bayesian model averaging BMDL 10 from Liu et al. (2013) (26 μg/kg/d) but left the kinetic inter-species AF unchanged.This produced a lower value of 0.14 μg/kg/d.
With the NOAEL of 50 μg/kg/d from Srivastava and Gupta (2018), the "in house" BfR probabilistic method produced a RfD of 3.1 μg/kg/d (LCL), and application of what BfR referred to as the WHO approach gave 0.2 μg/kg/d (LCL) (Table 4).
These values are critical for the derivation of BfR's alternative TDI.
BfR took the "WHO approach" value of 0.14 μg/kg/d derived from Liu et al. (2013) and the corresponding value of 0.2 μg/kg/d based on Srivastava and Gupta (2018) which they consolidated into an overall TDI of 0.2 μg/kg/d.
We applied the WHO/IPSC (2018) default APROBA settings to the other points of departure from Liu et al. (2013) and to Wang et al. (2016).The tool performs interspecies extrapolations by allometric scaling.For the rat, this gives a median AF of 4.4 (P05: 3.61; P95: 5.37), lower than the value of 6.14 preferred by EFSA in its deterministic approach (EFSA 2023) With the BMDL 20 of 3.41 μg/kg/d from Wang et al. (2016) we obtained a RfD of 0.0066 μg/kg/d (LCL, Table 4), nearly 7-fold higher than the deterministic value of 0.001 μg/kg/d (Table 3), mainly due to the lower inter-species AF.
When applied to the NOAEL of 1.1 μg/kg/d from the gestational study by Ullah et al. (2019), a probabilistic RfD of 0.0024 μg/kg/d can

Table 4
Derivation of BPA reference doses for semen quality by probabilistic approaches.
A. Kortenkamp et al. be obtained.
Taken together, deterministic approaches to estimating RfDs from studies of adult exposures give values of between 0.001 and 0.163 μg/ kg/d, while RfDs from gestational exposure studies are somewhat lower, between 0.00003 and 0.004 μg/kg/d (Table 3).The corresponding probabilistic RfDs range from 0.007 to 3 μg/kg/ d (Table 4).
The alternative TDI proposed by BfR (0.2 μg/kg/d) comes rather close to the doses associated with effects reported in some of these gestational studies: the margin left between the BfR TDI and the LOAEL from Shi et al. ( 2018) is only 2.5; the margins for the points of departure (POD) from Ullah et al. (2019) and Wang et al. (2016) are 5.5 and 50, respectively.

Epidemiological studies
EFSA ( 2023) reviewed cohort studies of associations between BPA in spot urine samples and measures of fertility in males, such as fertilization success or semen quality and concluded that such associations are unlikely.BfR (2023) did not consider epidemiological evidence.
Most studies of BPA effects on semen quality are cross-sectional which complicates assessments of whether the health outcomes can be attributed to chemical exposures because exposure and health outcome are measured at the same time.Furthermore, many of these studies relied on spot urine samples which only reflect recent exposures, while spermatogenesis takes up to 3 months.Referring to these difficulties, EFSA (2023) stated: "Although the associations were not entirely consistent in terms of directionality, findings from some individual studies could be interpreted as being adverse for male reproductive function."However, due to low confidence in BPA exposure assessments based on spot urine samples, EFSA placed all epidemiological studies in Tier 3, and consequently, none were taken forward into hazard characterization steps.
Our systematic review and quality rating of human studies of BPA and semen quality (Kortenkamp et al. 2022) was not primarily motivated by hazard characterisations but aimed at evaluating the strength of evidence qualitatively ("Is BPA associated with declines in semen quality in humans?").We identified 8 studies with null findings and 8 studies that reported associations of declining semen parameters with BPA exposures.Of the 8 null studies, 3 achieved an overall confidence rating of "medium" while we evaluated the others either as "low" or "uninformative".Among the 8 studies that found associations with BPA, 5 were "medium to high", and 2 "medium" and one study was rated "low".Due to shortcomings in exposure characterisation (spot urine samples) no study achieved a confidence rating of "high".
We attributed the disparity between the 3 "medium" confidence null studies (Caporossi et al., 2020;Goldstone et al., 2015;Mendiola et al., 2010) and the "medium" and "medium to high" studies that reported associations (Adoamnei et al., 2018;Ji et al., 2018;Lassen et al., 2014;Li et al., 2011;Meeker et al., 2010;Pollard et al., 2019;Radwan et al., 2018) to in exposures: All the null studies examined populations with quite low BPA urinary levels.We concluded that the differing outcomes cannot be viewed as conflicting evidence (defined as unexplained positive and negative results in similarly exposed human populations) but that the mixed results are explained by differing exposure levels, where low BPA exposures will have precluded the detection of effects on semen quality.
In summary, we identified 7 independent studies with a "medium" or "medium to high" confidence rating that reported associations between BPA urinary levels and declining semen quality.According to the scheme developed by Radke et al. (2018), the overall strength of evidence can be evaluated as "robust".We noted the absence of epidemiological studies of gestational BPA exposures and declines in semen quality.
We attempted to compare BPA exposures apparently no longer associated with deteriorations in semen quality with the range of RfD from animal studies.By focusing on studies among the general population (excluding occupationally exposed cohorts and populations from fertility clinics) we pinpointed four studies as useful for such comparisons: Adoamnei et al. (2018), Pollard et al. (2019), Ji et al. (2018) and Lassen et al. (2014).In these studies, BPA urinary levels were categorised into ranges of urinary concentrations.We identified ranges no longer associated with statistically significant declines in semen quality and converted the corresponding urinary BPA concentrations into estimated daily intakes.This analysis showed that BPA exposures above a range of between 0.01 and 0.18 μg/kg/d are associated with declines in semen quality (Kortenkamp et al. 2022 a).While the present EFSA HBGV of 0.2 ng/kg/d and RfDs derived from the Wang et al. (2016) study of between 0.001 and 0.0066 μg/kg/d are below this range, the BfR TDI of 0.2 μg/kg/d exceeds these values and even falls into ranges where some studies (Adoamnei et al., 2018;Ji et al., 2018;Pollard et al., 2019) observed reduced semen quality (Fig. 1).

Discussion
Our analysis exposes the approaches used for evaluating the likelihood of effects arising from BPA exposures to human health as problematic.Driven by the need for dose-response analyses and hazard characterizations essential for deriving a HBGV or TDI, EFSA (2023) and BfR (2023) adopted a parcellated and fragmented endpoint-by-endpoint study evaluation scheme in which entire ranges of effects were lost from consideration.Striking is the omission of epidemiological evidence from further analysis.These data did not feature in uncertainty considerations.The same applies to the evidence of the effects of gestational exposures in animal studies.A true synthesis of all the evidence was not achieved with these approaches.While these omissions did not affect the revised EFSA HBGV (as it is derived from immunotoxicity), they somewhat compromise the scientific credibility of the alternative BfR TDI and expose the fact that many decision points in the entire process appear to be value judgements, not based on purely scientific considerations.An important question will be how these schemes can be improved to avoid such paradoxical situations and to clearly demarcate science from value judgements.

Study evaluation schemes and gestational versus adult BPA exposures
Considering the importance of the gestational period in determining semen quality, it is striking that neither EFSA nor BfR evaluated the evidence for semen quality deteriorations after gestational BPA administrations more closely.
Because of their possible impacts on germ stem cell populations which are established in fetal and neonatal life, gestational BPA exposures are critical.Only after this period can spermatogenesis begin.Disruption of these processes can have life-long, irreversible effects.
Several independent studies from multiple laboratories support the conclusion that BPA produces qualitative and quantitative alterations of spermatogenesis, by disrupting germ cell meiosis, apoptosis and other processes.These alterations materialize in terms of decreased number and motility of spermatozoa (Salian et al., 2009, Shi et al., 2018, 2019;Ullah et al., 2019;Vilela et al., 2014;Yang et al., 2015) as well as changes in testicular histology, especially of stages VII -VIII in the spermatogenic cycle of the seminiferous epithelium (Rahman et al., 2017;Salian et al., 2009;Shi et al., 2018Shi et al., , 2019;;Ullah et al., 2019).There are further effects, including disruption of the epigenetic programming necessary for spermatogenesis, as evidenced by gene expression changes in DNA methyl transferases (Shi et al., 2019) and increased oxidative stress in testicular tissues (Ullah et al., 2019).
Physiologically, the consequences of BPA exposures during gestation manifest as a spectrum of inter-related effects.In recognition of this, it makes little sense to evaluate each component of this spectrum separately, one-by-one, in terms of sperm counts, sperm motility, hormonal changes, testis histology etc. to arrive at conclusions about the relevance of effects for human hazard characterisation.The reductionist, fragmented, endpoint-oriented study evaluation system used by EFSA (2023) has obscured the overall effect pattern and has led to the omission of gestational BPA exposures from further consideration.BfR (2023) have uncritically adopted EFSA's conclusions on this matter, without further assessment of the literature, and focused solely on the consequences of BPA exposures in adulthood.
In EFSA's and BfR's study quality rating systems, many academic studies were downgraded to Tier 3. Often, these decisions were based on deficits in the reporting of concealment methods (randomisation and blinding) and shortcomings in the characterisation of background contamination.The study rating applied by EFSA was not consistent across endpoints, as some studies downgraded to Tier 3 for certain effects, were placed in Tier 2 in others, despite having the same "deficits" (Shi et al., 2018 is an example: Tier 3 for sperm counts, Tier 2 for testes histology changes).
As the differences between EFSA and BfR in the quality rating of the studies by Wang et al. (2016) and Srivastava and Gupta (2018) show, the system does not necessarily produce consistent evaluations.The differences between our (Kortenkamp et al. 2022 a) andEFSA's (2023) study quality ratings of some gestational BPA studies also show that the system is open to interpretation, complicated by the fact that the reporting of study details is often not clear and evidence for any existing bias frequently indirect.In any case, by excluding the Wang et al. study of sperm motility from further consideration, BfR set themselves on a path of considering only studies of sperm counts that reported effects at higher doses.
The diverging interests of academic researchers and risk assessors contribute to these problems.The primary motivation of many academic studies is in elucidating mechanisms, rather than "merely" providing data for hazard identification and dose-response analysis.The pressures of academic publishing are such that "descriptive" studies such as doseresponse studies are viewed as being of limited originality, and accordingly are not likely to produce high impact papers.As a result, issues that take centre stage in regulatory testing, such as randomisation, blinding etc. are neglected in academic studies.On the other hand, authors of academic studies often do not fully appreciate how the impact of their work could be increased, if only they paid more attention to issues such as exposure characterisation, inclusion of positive controls, blinding etc.A solution might be to implement quality criteria for submissions to toxicology journals that take account of regulatory requirements, as elaborated by Martin et al. (2019).

Study selection
Several factors explain the rather high value of BfR's alternative TDI.Apart from their refusal to accept immunotoxic effects as the basis for deriving a HBGV (not further discussed here), BfR's choice of Liu et al. (2013) and Srivastava and Gupta (2018) as the key studies for semen quality deteriorations, rather than Wang et al. (2016), as in EFSA (2023), is critical.Their selection of benchmark dose modelling with Bayesian model averaging as the basis for a regulatory value and their use of rather low inter-species and intra-species AFs in probabilistic modelling also drove up their TDI.The use of such low factors is a value judgement, poorly supported by the available evidence (Martin et al., 2013).Conversely, EFSA's adoption of Wang et al. (2016) as the key study and their choice of a rather high inter-species AF factor for mouse-to-human extrapolations led to correspondingly lower RfDs for semen quality deteriorations.Similarly, our preference for deriving an RfD from gestational BPA exposures (Kortenkamp et al. 2022 a) has produced a value considerably lower than BfR's alternative TDI.The differences between our estimate and the values derived by EFSA (2023) for adult exposures from Wang et al. (2016) are far less pronounced.
BfR's and EFSA's conclusion that there is sufficient evidence from animal studies to assume a causal link between adult BPA exposures and deteriorations of semen quality is well founded.If this is accepted as the basis for further deliberations, the question becomes whether to choose Wang et al. (2016) or Liu et al. (2013) as the key studies for deriving a regulatory value.As already discussed, there are some deficiencies in both these studies, but overall, it is arguable that they are of sufficient quality to support the derivation of regulatory values.To meet the demand for being sufficiently protective, the deciding factor then must be that Wang et al. (2016) demonstrated effects at lower doses.An additional consideration should be that dose-response modelling is better supported by the data from Wang et al. because all chosen doses were associated with effects, while only the highest dose in Liu et al. produced observable effects.

Dose-response analyses
This latter factor is at the heart of some contradictions that arise from the use of the Liu et al. data   A. Kortenkamp et al. considered adequate.Since only the largest of three doses produced an effect, the Liu et al. data conform with these requirements.However, the resulting low degree of definition and granularity in the dose-response data severely limits the choice of suitable regression models for deriving a BMD.Of the four general approaches available for estimating a BMD (i.e.application of a single dose-response model, a best-fitting model, frequentist model averaging, and Bayesian model averaging), EFSA (2022) recommends Bayesian model averaging.In general, BMDs and BMDLs derived from frequentist and Bayesian model averaging (if no informative priors are used) should produce similar values which in turn should be comparable to the corresponding NOAELs (EFSA, 2022).We assume that BfR used uninformative priors in Bayesian model averaging as recommended by the default setting in EFSA's web applications for statistical models (https://r4eu.efsa.europa.eu/)and therefore expected no marked differences in the BMDs and BMDLs derived from both model averaging approaches.However, the BMDLs which BfR obtained from the Liu et al. data differ considerably, with the Bayesian model averaging producing a value 13 times larger than that from frequentist model averaging.Furthermore, the BMD estimates which BfR derived from frequentist model averaging were quite unstable, as suggested by the high ratio between the upper and lower limits of BMDs (in excess of 50).In such cases, EFSA (2022) guidance advises the use of other points of departure for deriving a HBGV.This led BfR to base their TDI on the rather large BMDL produced by Bayesian model averaging (26 μg/kg/d).Although the ratio between the upper and lower BMD was acceptable in this case, this value is, rather unusually, larger than the corresponding NOAEL (20 μg/kg/d).These striking discrepancies were  3 and 4).These differences were ignored by BfR ( 2023), but at the very least should have been the topic of uncertainty considerations.

Interspecies assessment factors
While the dispute between EFSA and BfR about the interspecies AF for the mouse is not material for BfR's suggested alternative TDI (the relevant studies are in rats), we share some of BfR's concerns regarding the large EFSA mouse AF of 64.5.In relying exclusively on the Doerge et al. (2011) data, EFSA chose a somewhat weak basis for their selection.The mouse interspecies AF derived from allometric scaling in the WHO APROBA tool is perhaps more appropriate.However, the AF chosen by BfR for rat-to-human extrapolations in the probabilistic evaluation appears to be unrealistically low.Derived from permutations of rat and human AUCs, the median of 1.98 of the proposed log-normal distribution implies that humans are on average only about two times more sensitive to BPA than rats, far less than suggested by allometric scaling in the APROBA tool.The ratio between the 5th (P05) and the 95th (P95) percentile of the log-normal distribution is an expression of the degree of uncertainty in extrapolating from the rat to humans.Therefore, the choice of the values for the tails of the distribution (P05 and P95) should be supported by data of sufficient quality.It is highly questionable whether only six (not independent) HEDF values can provide a sound basis for such estimations.Even more puzzling is BfR's choice of AF for sensitivity differences between humans, the intra-species AF.It is normally assumed that sensitive humans on average react to 10-times lower exposures than more resilient subjects, expressed by the median of the APROBA values (Table 4).The range chosen by BfR, with a median of 1, assumes that these differences do not exist.In selecting their values, BfR reference ECHA (2012) and EFSA (2012) guidance on the matter, but we were unable to locate these values in the references provided by BfR.

Conclusions
Switching from immunotoxicity to declines in semen quality after adult exposures as the critical toxicity for deriving a BPA HBGV would produce values 33-times higher (best estimate, derived from Wang et al. by using a probabilistic approach, Table 4) to 2.5 -5-times higher (lowest estimates, from Wang et al. with deterministic approaches and the large EFSA mouse-to-human AF, Table 3) than the present EFSA HBGV of 0.2 ng/kg/d.Consideration of gestational exposures would give RfDs 12-times higher (best estimate, derived from Ullah et al., 2019 using a probabilistic method, Table 4).As detailed before (Kortenkamp et al. 2022 a), we maintain that a study of gestational BPA exposures is the most suitable basis for deriving a RfD for semen quality deteriorations, and that the Ullah et al. study is key.On the basis of that study, a value of 2-4 ng/kg/d is appropriate.The data provided by Shi et al. (2018) will lead to an even lower value.
Our deconstruction of the proposed alternative BfR TDI shows that this value is the result of a procedure in which less protective choices were made consistently at every possible turn: first, in selecting the key studies, then in settling for BMD modelling as a way of providing a PoD, ignoring a NOAEL as an alternative, and finally by adopting a low interspecies AF derived through a questionable procedure.The consequence is a value which comes very close to BPA doses empirically shown to produce semen quality deteriorations in animal studies of gestational and adult exposures.This TDI also exceeds exposures associated with poor semen quality in epidemiological studies.

Suggestions for improvements
To avoid losing sight of the entire effect patterns, especially when dealing with syndrome-like effects such as those relevant to male reproductive health, it will be necessary to adopt a system that takes account of all the evidence, initially without deleting from consideration data from Tier 3 studies.A way forward would be to split the evaluation process into a qualitative and a quantitative part, in which the primary aim should be to first evaluate the question "what is the strength of evidence linking BPA to the effect of interest".In this step, all studies should be considered, initially independent of matters relevant to hazard characterisations such as number of doses used and other aspects.Only in the next, the quantitative step, should issues concerning dose-response analyses come to the fore.In this way, studies rated as less reliable from a hazard characterisation viewpoint are not entirely deleted from the analysis.Such a system might safeguard against the paradoxical situation that has arisen particularly with the BfR approach, where RfDs or HBGV derived from "high quality" studies come too close to the dose ranges associated with effects in animal studies, even though these may not be suitable for hazard characterisations.A summary of studies not further considered for dose-response analyses will be essential as a corrective, and this must be taken account of in uncertainty analyses.

Declaration of competing interest
Andreas Kortenkamp, Martin Scholze and Eleni Iacovidou declare they have no conflicts of interests.Olwenn V Martin reports a relationship with European Chemicals Agency that includes board membership and with the Food Packaging Forum that includes board membership and consulting or advisory services.
. The corresponding values for the mouse are 11.24 (P05: 8.14, P95: 15.25), well below the AF of 64.5 used by EFSA.Compared with the BfR values from the Liu et al. study, these settings yielded somewhat lower estimates, as follows: The NOAEL of 20 μg/kg/d gave a RfD of 0.044 μg/kg/d and the BMDL 10 of 26 μg/kg/d from Bayesian model averaging produced a RfD of 0.076 μg/kg/d (all LCL).

Fig. 1 .
Fig. 1.Comparison of BPA daily intakes no longer associated with decrements in semen quality in epidemiological studies with various regulatory values Shown are estimated daily BPA intakes derived from urinary BPA levels in epidemiological studies, using the procedure detailed in Kortenkamp et al. (2022).Values underlaid green are intakes not associated with deteriorations in semen quality ("epidemiological NOAELs"), those underlaid red are the first quantiles associated with significant effects (Adoamnei et al., 2018: decreases in in sperm concentration and total sperm count; Pollard et al., 2019: morphological changes; Ji et al., 2018: sperm concentration; Lassen et al., 2014: sperm motility).The green bars on the intake grid (right) correspond to the "epidemiological NOAELs", the red bars correspond to the intake ranges associated with deteriorations of semen quality.The vertical lines are the EFSA HBGV (blue) and the alternative BfR TDI (black).The range of RfD derived from Wang et al. (2016) is shown as a blue box.(For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.) neither discussed nor explained by BfR.The other sensible option of using the NOAEL from Liu et al. as the basis for a TDI was not pursued by BfR.The Liu et al.NOAEL of 20 μg/kg/d translates into a deterministic RfD of 0.065 μg/kg/d and a probabilistic value of 0.044 μg/kg/d, approximately 2-to 3-times lower than BfR's value of 0.14 μg/kg/d (Tables Sieli et al. (2011))orted by the 1000-fold higher AUCs for total BPA (AUC = 247 nmol h L − 1 , extrapolated to infinity) reported by Doerge et al.Accordingly, BfR prefer basing HEDF estimations on two alternative studies, those byTaylor et al. (2011)andSieli et al. (2011)which produced larger AUCs for mice of 42 and 4.4 nmol h L − 1 , respectively (after adjustment to a dose of 100 μg/kg BW).These values are 172-fold and 18-fold higher than the AUC fromDoerge et al.In Taylor et al. and Sieli et al.the ratios of AUC of total to free BPA are much lower than the value of 1000 reported byDoerge et al., 124 (at 100 mg/kg)and 109 (at 20 mg/kg), respectively.Accordingly, BfR regarded mouse HEDFs between 0.2 and 1.56 as more appropriate (BfR 2023).
Doerge et al. studyis not suitable for estimating a mouse HEDF.Doerge et al. had administered a single dose of 100 μg/kg to mice by gavage, and this dose resulted in free BPA serum levels that were too low to be measurable after 2 h of administration.The measured time window (0.25-1 h), so BfR's argument goes, does not cover the period of enterohepatic recirculation, which is higher in rodents than in humans, and which drives up free BPA levels.This results in larger AUCs and correspondingly higher HEDFs.In BfR's opinion, the mouse AUC for free BPA from Doerge et al. is unrealistically low.

Table 1
Study ratings of gestational and adult BPA exposures and semen quality in laboratory animals.

Table 3 )
Liu et al. (2013)), one of the two key studies chosen by BfR, dosed adult Wistar rats with BPA (oral route) at 2, 20 and 200 μg/kg/d for 60 days.At 200 μg/kg/d, significantly decreased sperm counts were observed; all other doses were without effects.Accordingly, the NOAEL from this study is 20 μg/kg/d which, in a deterministic approach, translates into an RfD of 0.065 μg/kg/d.
BfR used the data from Liu et al. for benchmark dose modelling.With frequentist model averaging, they calculated a BMDL 10 of 2 μg/kg/ d which increased to 26 μg/kg/d when Bayesian model averaging was used (BfR 2023).Frequentist model averaging produced unstable estimates, as suggested by the large ratio between the upper and lower benchmark dose limits (BfR 2023).According to EFSA guidance (EFSA 2022) this precludes the use of the Liu et al. data for deriving TDIs or HBGVs, and therefore BfR did not take the BMDL 10 of 2 μg/kg/d forward for establishing a deterministic RfD.The Bayesian model averaging BMDL of 26 μg/kg/d translates into a deterministic RfD of 0.085 μg/kg/d (Table

Table 2
Toxicokinetic parameters in selected BPA mouse studies.
by BfR in their benchmark dose (BMD) modelling.According to EFSA (2022) guidance, at least three doses associated with responses significantly different from each other are needed for reliable BMD estimations.Under certain circumstances, data with only two statistically significantly different responses can also be