The OECD program to validate the rat uterotrophic bioassay. Phase 2: coded single-dose studies.

The Organisation for Economic Co-operation and Development has completed phase 2 of an international validation program for the rodent uterotrophic bioassay. This portion of phase 2 assessed the reproducibility of the assay with a battery of positive and negative test substances. Positive agonists of the estrogen receptor included the potent reference estrogen 17-ethinyl estradiol (EE), and the weak estrogen agonists bisphenol A, genistein, methoxychlor, nonylphenol, and o,p -DDT. The negative test substance or nonagonist was n-dibutylphthalate. The test substances were coded, and prescribed doses of each test substance were administered in 16 laboratories. Two versions of the uterotrophic assay, the intact immature and the adult ovariectomized female rat, were tested and compared using four standardized protocols covering both sc and po administration. Assay reproducibility was compared using a) EE doses identical to those used in phase 1 and in parallel dose-response studies, b) single doses of the weak agonists identical to one of five doses from the dose-response studies, and c) a single dose of the negative test substance. The results were reproducible and in agreement both within individual laboratories and across the participating laboratories for the same test substance and protocol. The few exceptions are examined in detail. The reproducibility was achieved despite a variety of different experimental conditions (e.g., variations in animal strain, diet, housing protocol, bedding, vehicle, animal age). In conclusion, both versions of the uterotrophic bioassay and all protocols appear robust, reproducible, and transferable across laboratories and able to detect weak estrogen agonists. These results will be submitted along with other data for independent peer review to provide support for the validation of the uterotrophic bioassay.

The Organisation for Economic Co-operation and Development (OECD) has undertaken the validation of the uterotrophic bioassay. The management of the validation program and the results of other portions of the validation program have been described in other reports (Kanno et al. 2001(Kanno et al. , 2003. A central objective of the OECD validation program is to establish the reliability of standardized protocols for the uterotrophic bioassay. A demonstration of reliability is based on the transferability of a protocol among laboratories, where the protocol results are reproducible among laboratories (ICCVAM 1997;OECD 1998). Two aspects of reliability require demonstration in a validation program: a) the assay's sensitivity, or ability to respond to and detect positive substances, and b) the assay's specificity, or absence of response to negative substances (ICCVAM 1997;OECD 1998). Additionally, sensitivity and specificity should be assessed over time and should include data gathered using coded or blinded test substances (ICCVAM 1997;OECD 1998).
The studies in this paper are intended to demonstrate the reliability of the uterotrophic bioassay, including its sensitivity and specificity with coded samples. The test substances were a potent reference agonist, five weak estrogen agonists, and a negative test substance. Four protocols are included in the validation studies to address the two primary versions of the uterotrophic assay, the intact, immature, and the adult ovariectomized (OVX) female rat as well as the primary routes of administration, oral gavage and sc injection. A previous article demonstrated the reproducibility of the dose response of the reference agonist, 17α-ethinyl estradiol (EE), with both versions and all protocols (Kanno et al. 2001). An accompanying article demonstrates the reproducibility of both versions and all protocols using dose responses of the five weak agonist test substances (Kanno et al. 2003). Because all laboratories performed the EE dose response separate from these data, and almost all laboratories performed the weak agonist dose-response and coded single-dose studies at separate times, a comparison of the data provides for an assessment of bioassay reproducibility over time.

Materials and Methods
Test substances. Test substances were obtained and distributed through a centralized chemical repository at TNO, Zeist, the Netherlands. This repository is described in the accompanying paper, including a full description of the chemical identities, purities, and sources (Kanno et al. 2001), with the exception of the negative test substance, n-dibutylphthalate (DBP) (CAS no. 84-74-2, purity 99.9%) which was obtained from Sigma Aldrich (St. Louis, MO, USA). Because of the coded nature of this study, the amounts of test substance needed by each laboratory were calculated for each protocol. These amounts were preweighed into individually coded, opaque vials at the central repository prior to their shipment.
Animal supply, husbandry, and preparation. The details of how participating laboratories obtained animals, the housing and husbandry conditions, the age of the animals, compliance with the OECD guidelines on animal care (OECD 2000) and appropriate national regulations, and the animal preparation and observation prior to test substance administration have been described previously (Kanno et al. 2001(Kanno et al. , 2003. Protocols. The details of the individual protocols have also been described previously (Kanno et al. 2001(Kanno et al. , 2003. Briefly, protocol A uses intact, immature female rats with dosing by oral gavage for 3 consecutive days. Protocol B uses intact, immature female rats with dosing by sc injection for 3 consecutive days. Protocol C uses young adult OVX rats as described above with dosing by sc injection for 3 consecutive days. Protocol D [previously called protocol C´ (Kanno et al. 2001)] also uses young adult OVX rats and extends the sc injection dosing to a total of 7 days.

Coded samples, vehicle, test substance preparation, and dosing.
For each test substance, individualized instructions, depending on the amount to be shipped, were given to each laboratory. The instructions specifically stated the volume of test vehicle to be added to the coded vials to provide a reference dose solution for each test substance. Further instructions were provided to adjust the administered test volume based on the recorded body weight (bw) of the animals to provide the prescribed experimental doses.
Participating laboratories were asked to have one set of personnel prepare the test substance dose solutions and administer the preparations and a second set perform the necropsy and record the uterine weights. This was intended to minimize the chances of working out the code for each test substance. Material safety data sheets were provided in a sealed envelope to a nominated person at each laboratory, who agreed to keep this envelope sealed except in cases of emergency. A generic material safety data sheet was prepared and supplied to cover all test substances so that the health and safety of personnel at the laboratory would not be compromised. The other details of the vehicle, test substance preparation, and animal-dosing procedures have been previously described (Kanno et al. 2001(Kanno et al. , 2003.
Necropsy, dissection, and uterine weight. As described previously, the animals were killed humanely 24 hr after the last test substance administration in the same sequence as the test substance was administered. The dissection of the uterus and the measurement of wet and blotted uterine weights to the nearest 0.1 mg were performed as described previously (Kanno et al. 2001(Kanno et al. , 2003. Study management and quality control. The study management and quality control have been previously decribed. The VMG requested that the studies be performed under OECD Good Laboratory Practice (GLP) guidelines (OECD 2002). However, full GLP compliance was not a requirement for a laboratory's participation in the validation program, and several of the laboratories did not perform their studies under GLP. Data were received, and after an initial statistical analysis was performed, all laboratories were requested to audit these raw data and respond to specific queries on outliers and questionable data. A small number of data corrections were made, and reporting errors on dilutions, samples, and identity of control groups were either corrected or clarified.
Statistics. The recording and statistical procedures, data evaluation by an analysis of covariance, logarithmic transformation of uterine data, use of the Dunnett and Hsu pairwise comparison test, studentized residual plots, and use of the ratio of the geometric means of the uterine weights (relative to the vehicle control) after adjusting for the body weight of the animal at necropsy with upper and lower 95% confidence levels have all been previously described (Kanno et al. 2003).
To draw inferences across laboratories about the reproducibility of results at a given dose for each protocol, mixed-effects linear models were used, where the laboratories were treated as the random effects. Such an analysis takes into consideration both between-lab variability and within-lab variability, and provides an overall summary of the results. Thus, the analysis enables the computation of a mean response to a chemical across labs, and the lower and the upper 95% confidence limits under each protocol. This use of mixed-effects linear models is termed the "global analysis."

Design of Phase 2 Coded Single-Dose Studies
The objective of the coded single-dose studies was to produce the data to assess the reproducibility of the uterotrophic bioassay both within the same laboratory and across the multiple, participating laboratories. Further, the reproducibility was to be assessed over time and using blinded or coded samples.
Overall design rationale. Three types of test substances were included: a potent reference test substance, EE; five weak estrogen receptor agonists: genistein (GN), methoxychlor (MX), nonylphenol (NP), bisphenol A (BPA), and 1,1,1-trichloro-2,2-bis(o,p´chlorophenyl)ethane or o,p´-DDT (DDT); and a negative test substance, DBP. A robust statistical comparison required that identical doses be selected so that the same prescribed doses for each test substance were used in every laboratory.
Two EE doses were selected from phase 1 (Kanno et al. 2001) to generate two additional sets of data to assess the reproducibility of the bioassay. The first EE dose for a given route of administration was the first minimally effective dose in the lower portion of the dose-response curve that was a statistically significant response in all laboratories in phase 1. The second EE dose was then 0.5-log higher than the first dose, and this second dose had given responses near or at the maximum uterine response in phase 1. These selected doses were used as control reference doses as part of the accompanying dose-response studies (Kanno et al. 2003) and in these studies as coded samples. This design produces three data sets of replicate doses to assess the reproducibility of the uterine response over time.
The selection of the positive weak agonists and a series of five prescribed doses for each are described in the accompanying paper (Kanno et al. 2003). The participating laboratories were required in the dose-response studies to use the three intermediate doses, whereas the lowest and highest of the five doses were optional. Therefore, the third or fourth dose in the series was selected for this coded single-dose study. As a result, two sets of replicate doses would be available, one from the dose-response studies and one from the coded single-dose studies, and would include all five weak agonists in all four standardized protocols.
The negative test substance, DBP, was chosen based on two lines of evidence. First, DBP does not display binding affinity for the rat uterine estrogen receptor, i.e., there is no displacement of bound [ 3 H]17β-estradiol at concentrations up to 1 mM concentrations in vitro (Blair et al. 2000). Second, in vivo toxicological studies, with some including gene activation profiles, indicate that DBP does not elicit responses indicative of an estrogen mode of action (Ema et al. 2000;Ema and Miyawaki 2001;Mylchreest et al. 1998Mylchreest et al. , 1999Schulz et al. 2001;Zacharewski et al 1998). A single data set that included data for all four standardized protocols was judged adequate for the negative chemical to conserve resources and animals.
Selected doses. Two reference EE doses selected were 1 and 3 µg/kg bw/day for oral gavage and 0.3 and 1 µg/kg bw/day for sc injection. For the weak estrogen receptor agonists, the selected doses for the oral gavage studies were 600 mg BPA/kg bw/day, 300 mg GN/kg bw/day; 300 mg MX/kg bw/day; 250 mg NP/kg bw/day; and 300 mg DDT/kg bw/day. Doses for the sc injection studies were 300 mg BPA/kg bw/day; 35 mg GN/kg bw/day; 500 mg MX/kg bw/day; 80 mg NP/kg bw/day; and 100 mg DDT/kg bw/day. For the negative test substance, DBP, a limit dose was selected for each route of administration: 1,000 mg/kg bw/day for oral gavage and 500 mg/kg bw/day for sc injection.

Results
The coded single-dose studies were performed by 16 laboratories. Laboratories 6, 7, 9, 10, and 15, which either participated in phase 1 (Kanno et al. 2001) or the dose-response studies in phase 2 (Kanno et al. 2003), did not participate in the coded single-dose studies. However, their EE results from these studies were included in the comparison of the EE results generated in the coded single-dose studies. Despite the size of this international study, the actual difficulties encountered were few. For example, laboratories 17 and 19 may lack results for MX, BPA, GN, or DDT, because some of these substances were not administered after these two laboratories experienced difficulty in solubilization during dosage preparation. A few laboratories misinterpreted the EE dilution instructions, so that a few dose concentrations were either reversed or were incorrect (e.g., the high EE dose in Mini-Monograph | Uterotrophic bioassay validation: coded single-dose studies Environmental Health Perspectives • VOLUME 111 | NUMBER 12 | September 2003 laboratory 12). Except for laboratory 1, audits of the records were able to correct the data for the reversals. Finally, uterine wall punctures were reported in three animals in separate laboratories and groups during dissection. The possible losses of imbibed fluid did not affect any results.
Mortalities, decreases in body weight or body weight gain, and clinical signs. Of 1,842 animals administered test substances in the coded single-dose studies, 42 mortalities were observed in eight laboratories. All mortalities in the coded single-dose studies were in protocol A (2 in GN studies, 3 in MC studies, 3 in DBP studies, 6 in BPA studies, 8 in DDT studies, and 19 in NP studies). As with the dose-response experiments, a dose-related pattern of modest reductions in body weights and diminished body weight gains was often observed in the immature animal studies and in the OVX studies where the dosing was extended to 7 days. Decreases in body weights at terminal sacrifice approaching or greater than 10% were observed with NP in most protocol A studies, DDT in protocol A, BPA in protocol D, MX in protocol D, and the high EE dose in some protocol D studies (data not shown), indicating that a maximum tolerated dose had been exceeded. Clinical signs were reported in conjunction with the mortalities and body weight losses, including piloerection, crouched positions, and labored breathing.
Ethinyl estradiol studies. Within each protocol, the mean increases in the body weight-adjusted blotted uterine weights of both the low and high EE doses were reproducible. The low and high EE dose results for the dose-response and coded single-dose protocols are shown in Table 1 and Table 2, respectively, and the phase 1 results have been previously reported (Kanno et al. 2001). In protocol A, the results of the three sets of EE data were reproducible. The blotted uterine weight increases were statistically significant at both EE doses, and the weights increased in a dose-related manner. There were two exceptions. Laboratory 1 did not achieve statistical significance at the lower 1 µg EE/kg/day dose in the dose-response studies, but had achieved statistical significance at this dose in phase 1. In laboratory 13, the ratio of mean uterine weight of the test substance group relative to the vehicle control group was nearly five at the lower EE dose in the coded single-dose studies. The ratio was a more modest value of 1.5 to 2 in phase 1 and the dose-response studies. In protocol B, the results of the three sets of EE data were reproducible with two exceptions. First, the ratio of the uterine weight increases in laboratories 9, 15, 18 at the lower EE dose was 3.5 to 5 in phase 2, compared with approximately 2 in phase 1. Second, laboratory 19 failed to achieve statistical significance at either EE dose. In protocol C, the results of the three sets of EE data were reproducible with one exception. Laboratory 19 achieved statistical significance with the low EE dose, but not the high EE dose, in the coded single-dose studies. This same laboratory had shown a low responsiveness to EE in protocol C in phase 1 (Table  5 and Figure 1C in Kanno et al. 2001). In protocol D, the results of the three sets of EE data were reproducible. As noted in phase 1 (Kanno et al. 2001), the extended dosing in protocol D again typically led to a further increase in the blotted uterine weights over protocol C at both the low and high EE doses, but the increased number of EE doses also led to decreases in body weight gains.
Weak agonist studies. The results for the same BPA dose in the dose-response and coded single-dose studies are shown in Table 3. In protocol A, even at a dose of 600 mg BPA/kg/day, the relative uterine response was very weak and did not exceed a value of 2 in any laboratory. In the response distribution from this modest response, five laboratories failed to achieve statistical significance. Although all five had increased absolute uterine weights, the 95% lower confidence level did not exceed 1 as necessary for statistical significance. In three of these laboratories, animal mortalities occurred, decreasing the power. In protocol B, at a dose of 300 mg BPA/kg/day, the mean ratio values of the relative increase in uterine weight were between 1.5 and 2.8. In this response distribution, 3 of 23 experiments did not achieve statistical significance. In laboratory 12, the mean uterine weight was increased over 30%, but did not VOLUME 111 | NUMBER 12 | September 2003 • Environmental Health Perspectives -, laboratory did not perform this particular study. a Results from studies with coded or blinded doses for each substance. b Results from the dose-response studies reported in the accompanying paper (Kanno et al. 2003). c Ratio of geometric means of treated blotted uterine weights to the vehicle control blotted uterine weights after adjusting for the body weights at necropsy as a covariable (lower 95% confidence limit, upper 95% confidence limit). d This study did not achieve statistical significance. e This laboratory used po dilution instructions to use doses of 1 and 3 µg/kg/day. Therefore, no 0.3-µg/kg/day dose was available. f This laboratory used sc dilution instructions to use doses of 0.3 and 1 µg/kg/day. The 1-µg/kg/day dose was the actual low EE dose and is reported here. *Level of significance, p < 0.05.
achieve significance. In laboratory 20, little or no evidence of a response was seen in either the dose-response or the coded single-dose studies. In protocol C, the ratio of the mean treated uterine weight relative to the vehicle controls was 2.3 to 3.4, and all laboratories in this response distribution were able to achieve statistical significance. This ratio value for the adult OVX animals was consistently greater than for the immature animals in protocol B. In protocol D, the mean blotted uterine weights appeared to be increased by the extended dosing period, and all six laboratories achieved statistical significance.
The results for the same GN dose in the dose-response and coded single-dose studies are shown in Table 4. The mean uterine responses at the selected GN doses relative to controls were 2 or greater for most laboratories. All laboratories in their respective response distributions achieved statistical significance in each protocol. In the case of GN, the immature animals in protocol B appeared to have a somewhat higher mean response than the -, laboratory did not perform this particular study. a Ratio of geometric means of treated blotted uterine weights to the vehicle control blotted uterine weights after adjusting for the body weights at necropsy as a covariable (lower 95% confidence limit, upper 95% confidence limit). b This laboratory used po dilution instructions to use doses of 1 and 3 µg/kg/day. The 1 µg/kg/day dose was the actual high EE dose and is reported here. c This laboratory used sc dilution instructions for doses of 0.3 and 1 µg/kg/day. Therefore, no 3 µg/kg/day high EE dose was performed. d This laboratory incorrectly diluted the high EE dose in all studies. e This study did not achieve statistical significance. *Level of significance, p < 0.05. -, laboratory did not perform this particular study. a Ratio of geometric means of treated blotted uterine weights to the vehicle control blotted uterine weights after adjusting for the body weights at necropsy as a covariable (lower 95% confidence limit, upper 95% confidence limit). b This study did not achieve statistical significance. c In the dose-response studies at this dose, one animal died in laboratory 2, one in laboratory 7, one in laboratory 12, and one in laboratory 13. d In the coded single-dose studies at this dose, three animals died in laboratory 12, one in laboratory 13, and two in laboratory 14. *Level of significance, p < 0.05.
OVX animals in protocol C and even in protocol D with the extended dosing period.
The results for the same MX dose in the dose-response and coded single-dose studies are shown in Table 5. The mean uterine responses at the selected MX doses relative to controls were 2 or greater for most laboratories, and often exceed 3 in protocols A and B. All laboratories in their respective response distributions achieved statistical significance in each protocol. In the case of MX, the immature animals in protocol B appeared to have a somewhat higher mean response than the OVX animals in protocol C and even protocol D with the extended dosing period.
The results for the same NP dose in the dose-response and coded single-dose studies are shown in Table 6. In protocol A, 13 of 14 studies achieved statistical significance at a dose of 250 mg NP/kg/day. This is at first surprising, given that 11 of these laboratories experienced animal mortalities that reduced their power of the already small group size of six. However, the mean relative increase in uterine weights was no lower than 1.71 in any study, and the only laboratory that did not reach statistical significance had only two surviving animals and a mean relative increase of 1.97. In the sc protocols, the mean relative increases in uterine weight at the selected dose of 80 mg NP/kg/day were more modest, and greater than 2 in only 6 of 42 studies. In protocol B, 17 of 24 studies combined from the coded single-dose and the dose-response sets achieved statistical significance. In protocol C, 8 of 12 studies achieved statistical significance. In protocol D, all NP coded samples achieved statistical significance with the extended dosing period.
The results for the same DDT dose in the dose-response and coded single-dose studies are shown in Table 7. In protocol A, all 13 studies achieved statistical significance at a dose of 300 mg DDT/kg/day, as the minimum mean relative increase in uterine weight was 2.67. In the sc protocols at a dose of 100 mg DDT/kg/day, the relative increase in uterine weights was considerably lower, with only 4 of 36 studies greater than 1.5. As a result, only 6 of 19 studies achieved statistical significance in protocol B, 5 of 11 in protocol C, and 4 of 6 studies in protocol D with the extended dosing period.
Dibutylphthalate studies. The results for the DBP studies are shown in Table 8. In protocols A and D, none of the 15 DBP-treated groups were statistically significant versus the vehicle controls. However, in protocol B, the results of 4 of 14 studies, and in protocol C,  -, laboratory did not perform this particular study. a Ratio of geometric means of treated blotted uterine weights to the vehicle control blotted uterine weights after adjusting for the body weights at necropsy as a covariable (lower 95% confidence limit, upper 95% confidence limit). b In the coded single-dose studies at this dose, one animal died in laboratory 2 and one in laboratory 14. *Level of significance, p < 0.05. the results of 1 of 7 studies, achieved statistical significance. Of these five studies, three had significantly increased blotted uterine weights when treated with DBP, whereas the other two had significantly decreased blotted uterine weights when treated with DBP.

Discussion and Conclusions
The OECD is composed of over 20 nations, and OECD protocols such as the uterotrophic bioassay are intended for use in all of the member nations. As such, this validation study was carried out in 21 laboratories in nine nations. Funding for the study came primarily from national regulatory agencies and industry associations, but several laboratories freely contributed their time and effort to the study. This large, international nature of the program, however, increased the organizational and logistical workload. For example, the protocol had to be clearly understood by speakers of a variety of languages for the procedures to be performed in a similar manner in all laboratories. Data had to be recorded in the different laboratories and provided to an independent statistician in a accurate, timely, and efficient manner. The animal husbandry supplies, vehicles, and reagents, as well as the laboratory animal themselves, also had to be widely and readily available. Finally, the central repository had to deal with international shipments with different customs regulations and laboratory safety regulations. The Validation Management Group addressed these challenges with several efforts. Both the ovariectomy and uterine dissection procedures were videotaped, and the videotape was distributed to the technical staff of Mini-Monograph | Uterotrophic bioassay validation: coded single-dose studies Environmental Health Perspectives • VOLUME 111 | NUMBER 12 | September 2003 -, laboratory did not perform this particular study. a Ratio of geometric means of treated blotted uterine weights to the vehicle control blotted uterine weights after adjusting for the body weights at necropsy as a covariable (lower 95% confidence limit, upper 95% confidence limit). b This study did not achieve statistical significance. c In the coded single-dose studies, all six animals died in laboratory 12, and two animals died in laboratory 14. d With the lower confidence level number > 1.00, the result is statistically significant. e In the dose-response studies at this dose, one animal died in laboratory 12. *Level of significance, p < 0.05. -, laboratory did not perform this particular study. a Ratio of geometric means of treated blotted uterine weights to the vehicle control blotted uterine weights after adjusting for the body weights at necropsy as a covariable (lower 95% confidence limit, upper 95% confidence limit). b In the coded single-dose studies, one animal died in laboratory 2, four animals died in laboratory 4, two animals died in laboratory 5, one animal died in laboratory 8, two animals died in laboratory 11, four animals died in laboratory 12, one animal died in laboratory 13, and four animals died in laboratory 14. c This study did not achieve statistical significance. d In the dose-response studies at this dose, two animals died in laboratory 4, one animal died in laboratory 7, and three animals died in laboratory 12. *Level of significance p < 0.05.
all the participating laboratories. The draft protocols were distributed to all national authorities and participating laboratories for comments and inquiries for any ambiguities. A common electronic spreadsheet was constructed and distributed for comment so the data could be recorded and electronically transmitted to the independent statistician. Despite these efforts and preparations, some laboratories encountered difficulty with certain dose-preparation instructions, two errors in the spreadsheet itself were later discovered, and the breakage of some vials during shipment required their rapid replacement because of the imminent delivery of immature animals whose births were timed for protocols A and B. Given the number of laboratories and individual studies, these were minor problems that did not affect the quality or the success of the results.
It should also be recognized that the protocols allowed variations in a number of experimental conditions. These variables include the choice of rat strain, the laboratory diet, housing and husbandry practices such as the use of cage bedding, the administration vehicle, and to a modest degree, the age of both immature and OVX animals. The judgment was that rigorous and detailed standardization of all of these variables would constrain the ability to widely and easily practice the uterotrophic bioassay in many of the OECD member nations, where the intended purpose is as a rapid screening bioassay for a large number of chemicals. The laboratory specifics for most laboratories have been described previously (Table 1 in Kanno et al. 2001; Table 7 in Kanno et al. 2003) or can be found for the remaining laboratories in Table 9.
The coded nature of the study also introduced some difficulties. To avoid giving very specific information that could be used to identify the coded test substances, broad general advice was given about dose preparation. Unfortunately, estrogen receptor agonists and antagonists tend to be hydrophobic and to have limited solubility. As noted, some laboratories encountered difficulty in solubilizing the test substances, and two laboratories decided to halt administration of particular preparations rather than administer apparent suspensions. This experience also suggests another source of variation in administered doses among the participating laboratories.
As with the dose-response studies, there was a consistent association in the coded single-dose studies between the occurrence of mortalities, reduced body weight gain, and clinical signs with the weak agonists DDT and NP in protocol A, and for reduced body weight gain with the EE high dose, BPA, and MX in protocol D. The 10% and greater differences in body weights between vehicle and treated animals occurred within just 4 days (protocols A, B, and C) or 8 days (protocol D) of treatment initiation, indicating a rapid onset of systemic toxicity at those doses. Despite the apparent magnitude of these insults, the uterine response appeared to remain undiminished, confirming the underlying robustness of this biological response for estrogen-screening programs.
Overall, for each protocol, the mean relative increase in uterine weight was reproducible within and among laboratories for both the dose-response and coded single-dose studies with each test substance. The dose-response results for each protocol and test substance are in the accompanying paper (Kanno et al. 2003). It is important to distinguish between when the results for a given test substance have been consistently reproduced within and across laboratories over time from whether statistical significance was consistently achieved in all or none of the laboratories. The objective here is the former, the reproducibility of the bioassay. The results here should be interpreted by taking into account the following considerations. First, several of selected doses were in the lower regions of a substance's dose-response curve (Kanno et al. 2001(Kanno et al. , 2003. Second, the lower region of the dose-response curve implies a distribution of statistically positive and negative responses, with the ratio between positive and negative results depending upon the precise dose employed in the dose response of that particular substance. That is, the rate of studies lacking statistical significance should rise as the doses move further down the dose-response curve for a substance, particularly in the case of weak agonists when the slope of the dose response is shallow. Several doses herein were at or near maximum uterine responses, for example, the high EE po and sc doses, the GN and MX po and sc doses, and the DDT po dose, and these doses consistently achieved statistical significance. Where the selected doses were increasingly in the lower portion of the dose-response curve, although the numerical results were reproducible within and across laboratories, an increasing number of studies VOLUME 111 | NUMBER 12 | September 2003 • Environmental Health Perspectives -, laboratory did not perform this particular study. a Ratio of geometric means of treated blotted uterine weights to the vehicle control blotted uterine weights after adjusting for the body weights at necropsy as a covariable (lower 95% confidence limit, upper 95% confidence limit). b This study did not achieve statistical significance.*Level of significance p < 0.05.  Table 7 in Kanno et al. (2003). b Detailed information is available from the corresponding author of this article.
failed to achieve a statistically significant difference, for example, the BPA po dose, the NP sc dose for adult OVX animals, and certainly the DDT sc dose.
To assess reproducibility, the mean relative uterine weight increases were calculated in an overall global analysis (Table 10). The uterotrophic responses were consistent and reproducible between the dose-response and the coded single-dose studies without exception for every test substance and every protocol. The global analysis in Table 10 also shows subtle test substance-specific differences in the protocols that were consistent in both the dose-response and coded single-dose studies. Comparing the intact, immature, and adult OVX versions as protocols B and C, respectively, the adult OVX version appears to be more responsive with BPA, whereas the intact, immature version appears to be more responsive with GN and MX. More than doubling the time of treatment with extended dosing (protocol D), did increase the response with BPA, and marginally with GN, MX, and NP. The global analysis includes the results of all laboratories, regardless of mortalities in protocol A or the possible issues with laboratories 19 and 20 that are discussed below. Except for the lower means in the coded single-dose, highdose EE studies for protocols B and C, no overall impact of their inclusion was observed.
The data were analyzed for an association between uterine weights and body weights and for the variability and power of the wet and the blotted weights. Although there was no consistent correlation between uterine weight and body weight, the data suggest that body weight is more strongly correlated with uterine weight in the immature animals than in the adult OVX animals. As with phase 1 and the dose-response studies, wet uterine weights were more variable than blotted weights (Kanno et al. 2001(Kanno et al. , 2003. The blotted uterine weights in phase 2, again, showed slightly less interlaboratory and intragroup variability than wet weights with imbibed fluid, suggesting that blotted uterine weight will provide slightly better power for detecting uterotrophic effects than the wet weight. In 5 of 36 studies, the uterine weights after DBP treatment were statistically different from controls, indicating a certain rate of false positives and negatives will occur. Three sets of results were statistically higher than the vehicle groups, a false positive rate of about 8%, and two were statistically lower. This nearly even division into higher and lower differences supports random chance due to variability about the baseline. In further support, the margins by which the respective upper and lower 95% confidence intervals achieved statistical significance were minimal (Table 8). In absolute terms, the mean relative increase in uterine weight in these three incidents was just under 40% and suggests a source of variability in the uterine weight from one group to another. When the raw body weight and uterine weight data were in these laboratories were examined, there were no obvious anomalies or inconsistencies such as outliers or high standard deviations when compared with other laboratories. When the overall patterns of these laboratories were assessed, one (laboratory 20) had the minimum response with five of six test chemicals and was below average for the two EE doses, consistent with its statistically decreased result. A second (laboratory 14) had responses that were the maximum with two test substances and above average for the remainder and the EE doses, consistent with its statistically increased result. The patterns of the other laboratories were unremarkable. Four of the five incidents occurred with immature animals. Although body weights were randomized, there is the possibility of group-to-group variations based on a litterrelated effect. The animals used would have been born on the same day, meaning that the animals were likely from a limited number of litters. In fact, some investigators have taken the precaution to also randomize their groups by litter (Christian et al. 1998). As the litter of origin for each individual was not recorded, this possibility cannot be assessed here. It is clear, however, that borderline false positives can occur with the present protocols, and that a weight-of-the-evidence integration of the uterotrophic results with other structural, in vitro, and in vivo data may be necessary for interpretation. Similarly, false negatives may also occur, and data to qualify the performing laboratory and criteria to accept the results may be necessary .
The results in three laboratories deserve comment. These laboratories displayed a trend toward lower responsiveness to both the EE and to several weak agonists when compared with other laboratories. The performance of laboratory 6 with its high vehicle control weights and limited responsiveness in some cases has been previously noted (Kanno et al. 2003). Here, we also note the lower general response in this laboratory to the EE doses in protocol B (Tables 1 and 2). Laboratory 19 observed no statistically significant uterotrophic responses for the test substances it could formulate or either of the two EE doses in protocol B (Tables 1, 2, 6, and 7). The pattern of responses in this laboratory in protocol C, however, was unremarkable when compared with other laboratories. A close examination of the data, including dietary analyses, has not revealed any apparent reasons for this lack of responsiveness. Laboratory 20 observed statistical significance with both EE doses, but the relative increases in weight were somewhat lower than other labs at the low EE dose and among the lowest at the high EE dose (Tables 1 and 2). Although statistical significance was observed with GN and MX, the increases in the uterine weights were the lowest observed in any laboratory (Tables 4 and  5). Statistical significance was not observed in either of the dose-response or the coded single-dose studies with either BPA or NP, and again, the increase in the uterine weights were the lowest observed in any laboratory (Tables  3 and 6). A review of the data and laboratory variables first indicated that the vehicle control Abbreviations: CSD, coded single-dose studies; DR, dose-response studies. a Ratio of geometric means of treated blotted uterine weights to the vehicle control blotted uterine weights after adjusting for the body weights at necropsy as a covariable (lower 95% confidence limit, upper 95% confidence limit).
uterine weights were > 50 mg, which was well above the 20-to 40-mg range in most other laboratories. Then, an analysis of laboratory diets for phytoestrogens found that laboratory 20's diet had the highest combined total GN and daidzein levels of > 500 µg/g diet. This leads to the suspicion that the dynamic range of the bioassay in this particular case may have been impaired by the high phytoestrogen content of the diet .
Collectively with other observations in the dose-response studies, these data suggest the need to monitor the uterine weights of vehicle control animals, to specify that laboratory diets have low to modest phytoestrogen levels (< 350 µg/g diet) , and to qualify laboratories with both reference and weak agonists before performing tests of unknown substances. In addition, care should be taken not to exceed the maximum tolerated dose, to reduce animal pain, suffering, and mortalities. The reslts in the current coded dose study provide additional evidence that a strong uterine response occurs even in the presence of severe systemic toxicity. The robustness of the uterine response in turn supports its use in a screening assay.
In conclusion, the uterotrophic bioassay yields reproducible results within the same laboratory and across the participating laboratories over time with a range of test substances including the EE positive reference substance, the five weak agonist substances (BPA, GN, MX, NP, and DDT), and the negative substance (DBP). The results of the dose-response and coded single-dose studies are in agreement. No substantive performance differences were found between the different versions or their protocols that would support one version being consistently superior to another. Therefore, both the intact immature and OVX versions of the uterotrophic bioassay and the protocols herein are judged to be qualitatively equivalent to one another. Low rates of false negatives and false positives were observed. The false negatives occurred with very weak agonists (BPA, DDT, and NP) in the lowest portions of the their dose-response curves. The false-positive rate with DBP was just over 8%, with mean relative weight increases of 30-40%, suggesting the importance of controlling group-to-group variations in the baseline and using a weightof-the-evidence approach in interpreting very modest responses. These and other results from the dose-response studies and the dietary analyses will be used to develop the draft OECD test guideline for the uterotrophic bioassay. These results will be submitted along with other data for independent peer review to provide support for the validation of the uterotrophic bioassay.
We acknowledge the dedicated efforts and work of the participating labs in generating these data, namely, the Institute of Environmental Toxicology, Japan; Mitsubishi Chemical Safety Institute, Japan; Japan Bioassay Research Centre, Japan; Sumitomo Chemical Company, Japan; Syngenta Central Toxicology Laboratory, United Kingdom; WIL Research Laboratories, United States; BASF, Germany; Bayer AG, Germany; Aventis Crop Sciences, France; Exxon Biomedical Sciences Inc., United States; Free University of Berlin, Germany; National Institute of Toxicological Research, Korea; Huntingdon Life Sciences, United Kingdom; Institute of Food Safety and Toxicology, Denmark; Research Toxicology Center, Italy; and the Institute of Biomedical Research, Italy. The acquisition of the chemical substances and the chemical repository for this assay validation was sponsored by the European Chemical Industry Council and the Japanese Chemical Industry Association. Nonylphenol was kindly donated by Schenectady International Inc., (Schenectady, NY, USA) and bisphenol A was kindly donated by Bayer AG. We thank E. Zeiger and H. Koëter for reviewing the manuscript.