Same-Chemical Comparison of Nonanimal Eye Irritation Test Methods: Bovine Corneal Opacity and Permeability, EpiOcular™, Isolated Chicken Eye, Ocular Irritection®, OptiSafe™, and Short Time Exposure

The testing and classification of chemicals to determine adverse ocular effects are routinely conducted to ensure that materials are appropriately classified, labeled, and meet regulatory and safety guidelines. We have performed a same-chemical analysis using publicly available validation study results and compared the performance between tests for the same chemicals. To normalize for chemical selection, we matched chemicals tested by pairs of tests so that each matched set compared performance for the exact same chemicals. Same-chemical accuracy comparisons demonstrate a chemical selection effect that results in a wide range of overlapping false-positive (FP) rates and accuracies for all test methods. In addition, the analysis suggests that a tiered-testing strategy with specific combinations of tests can reduce the FP rate for some combinations. However, reductions in the FP rates were typically accompanied by an increase in the false-negative rates, resulting in minimal advantage in terms of accuracy. In addition, actual improvements in the FP rate after retesting positives with a second test are not as good as the theoretical improvements because some chemicals and functional groups appear to be broadly misclassified by all test methods, which, to the extent the tests make the same-chemical misclassifications, reduces the advantage of using tiered-testing strategies.


Introduction
Ocular irritancy testing plays an important role in the labeling of chemicals and products to protect consumers and manufacturers from exposure to toxic chemicals through appropriate product and chemical labeling. In the past, the appropriate labeling of chemicals and compounds has relied on the in vivo Draize rabbit eye test, which has become the "gold standard" for chemical classification. This live animal test uses a clinical scoring system that grades the severity and duration of the ocular irritation response that generally occurs from 1 to 21 days after a chemicals exposure. The resulting clinical scores and durations of effects are then applied to either the United Nations Globally Harmonized System (GHS; Choksi et al., 2020;Lebrun et al., 2019) or U.S. Environmental Protection Agency (EPA) methods of eye irritation classification (Choksi et al., 2020). Because of animal cruelty concerns, there has been a major effort over the past several decades to develop alternative strategies (non-live animal tests) that could be safely used to classify and label potentially harmful chemicals. In this paper, we briefly review the background concerning ocular irritation testing and chemical/compound labeling, the major driving forces for nonanimal testing, and the currently recognized and validated nonanimal tests; we then perform a same-chemical comparison analysis using published databases to compare accuracy for nonanimal eye irritation tests alone or in a two-test tier.

Classification of Ocular Irritation
The GHS classification system is based on the most severe effects in two-thirds of animals tested (typically three) when the score drives the classification. However, for severe, persistent injury, one animal drives the classification. Class definitions include not classified (NC, no serious eye damage averaged over the first three days), 2 (2A, reversible serious irritation effects that reverse by 21 days, or 2B, by 7 days), and 1 (extreme and/or irreversible irritation effects that do not reverse by 21 days) (Choksi et al., 2020). GHS classification may be mechanistically related to materials that cause a depth of eye injury that damages the basement membrane and corneal stroma, and therefore is a more serious injury than damage to the superficial epithelium and takes several days or longer to reverse (Lebrun et al., 2019). This serious level and duration of eye injury required for GHS classification is used by regulatory agencies to indicate when labeling for eye protection is required. Hence, the GHS classification "NC" does not indicate a material is a "nonirritant" as defined by the layperson including users of eye area cosmetics and personal care products. The EPA classification system is based on the most severe animal responses (typically one out of three animals) with class definitions of IV (no significant damage 24 hours after exposure), III (damage reversible by 7 days after exposure), II (damage reversible by 21 days after exposure), and I (corrosive, lesions do not reverse by 21 days) (Choksi et al., 2020). While EPA category IV indicates no serious damage at 24 hours or longer, significant adverse ocular effects can still occur for these materials as long as these effects resolve prior to 24 hours. Hence, for some testing applications, for example, consumer satisfaction of eye associated with small group size, the inability of the clinical scoring systems to reflect the complexities of the total in vivo response are a major limitation (York and Steiling, 1998). There are also factors related to the dosing of test materials, methods of exposure, and the subjectivity of observations, scoring, and laboratory procedures (OECD, 1998). The Draize eye test has been found to demonstrate high misclassification errors. About 12% of the chemicals classified as category 2 and at least 11% of those classified as category 1 could in fact be equally identified as NC and category 2 (respectively) by the in vivo Draize eye test; based only on within-test variability (Luechtefeld et al., 2016a;Barroso et al., 2017).
Key criteria for selecting reference chemicals include chemicals covering different drivers of classification based on observed tissue effects (primarily corneal opacity), relevant chemical classes, and physical states. According to Barroso et al. (2017), "Considering the chemicals in the DRD that are commercially available today (511 individual chemicals tested in 556 studies), only about 73% (375 individual chemicals tested in 402 studies) are considered good reference chemicals that can be selected for future studies." Most recently, a set of 80 reference chemicals was selected in collaboration with Cosmetics Europe from their database of 634 chemicals that was generated from past validation studies and correlated with the main chemical ocular effect driving the in vivo GHS/EU CLP classification; these 80 chemicals consist of a wide range of classes used for in vitro eye test development for the CEFIC-LRI-AIMT6-VITO CON4EI (CONsortium for in vitro Eye Irritation testing strategy) project (Adriaens et al., 2018a).

In Vitro Test Methods
Validated and widely recognized tests used to detect GHS category 1 chemicals (ocular corrosives) include Bovine Corneal Opacity and Permeability (BCOP), Isolated Chicken Eye (ICE), Ocular Irritection, OptiSafe, and Short Time Exposure (STE); however, due to the complexity of the analysis and space constraints, GHS category 1 is not analyzed further in the current review. Validated nonanimal tests to detect GHS NC analyzed here include BCOP, EpiOcular, ICE, Ocular Irritection, OptiSafe, and STE, whose methods are briefly reviewed below. Data sources for each test method are listed in Table 1.

Bovine Corneal Opacity and Permeability
The BCOP method uses cow eyes from the meat industry to measure corneal opacity with an opacitometer and permeability using a spectrophotometer or plate reader. The BCOP test has been validated by the Interagency Coordinating Committee on the Validation of Alternative Methods (ICCVAM) along with ECVAM and the Japanese Center for the Validation of Alternative Methods (JaCVAM) in 2006 and 2010, respectively, using the BCOP (OP-KIT) to identify materials that cause eye corrosion (GHS category 1) and materials not requiring classification for serious eye damage (GHS NC). The OECD test guideline was updated in 2020 to include use of a laser light-based opacitometer (LLBO), which demonstrated similar performance as the OP-KIT opacitometer (OECD 437, 2020a). The OP-KIT opacitometer is a center-weighted reading of white polychromatic light transmission, and the LLBO opacity reader measures the entire corneal surface with a monochromatic laser source. Permeability is quantified using optical density (OD) values of the amount of sodium fluorescein dye that penetrates all layers. Using the opacity measurements and permeability OD values, an in vitro irritancy score (IVIS) is generated. A prediction model differentiates these scores into GHS NC and category 1 predictions. The OECD BCOP guideline states for the BCOP (OP-KIT) method that the accuracy was 69%, the false-positive (FP) rate was 69%, and the false-negative (FN) rate was 0%. For the BCOP (LLBO) method, the accuracy was 83%, the FP rate was 45%, and the FN rate was 6% (OECD 437, 2020a). The BCOP method can test solids and liquids, but is not recommended for use with test chemicals that are classified as irritating to eyes (category 2B or 2A) due to a number of category 1 chemicals underclassified and a number of NC chemicals overclassified (OECD 437, 2020a).

Reconstituted Human Corneal Epithelium
There are four validated (OECD 492, 2019a) EIT (Yang et al., 2017;Lim et al., 2019). These test methods are three dimensional models that consist of epithelium composed of human epidermal keratinocytes (skin cells) or transformed corneal epithelial cells. The test matrix is constructed to model the corneal epithelium with stratified and noncornified cells (Doucet et al., 2006;Jung et al., 2011). The RhCE data used for this analysis include only those for EpiOcular EIT.
Ocular irritancy is predicted based on in vitro cell viability following exposure to the test chemical as determined by the MTT viability assay. Live cell metabolism turns MTT purple. The optical density of extracted purple MTT is quantified using a plate reader. A prediction model classifies substances as nonirritants based on the percent of treatment group cell viability compared to that of the control group. EpiOcular EIT distinguishes test substances not requiring classification for serious eye damage/eye irritancy (NC) from chemicals requiring classification and labeling, according to the GHS. The EpiOcular OECD guideline states that the accuracy was 80%, the FP rate was 37%, and the FN rate was 4% (OCED 492, 2019a). While a wide range of substances can be tested using EpiOcular EIT, colored substances and substances that directly change the color of MTT require additional controls (OCED 492, 2019a).

Isolated Chicken Eye
The ICE test method (OECD 438, 2018) is an organotypic model that uses chicken eyes from the meat industry to classify GHS NC and category 1 chemicals. Effects on the cornea are measured using a visual clinical scoring of opacity and involve applying fluorescein to the cornea, scoring damage to the epithelium, and measuring increased thickness (swelling) and macroscopic damage to the surface of the eye. Corneal opacity, swelling, and damage are assessed individually, and then all three endpoints are combined and applied to a prediction model to make an irritancy classification. The OECD ICE test guideline states that the accuracy was 88%, the FP rate was 24%, and the FN rate was 3% (OCED 438, 2018). While the ICE method can be used to test a wide range of substances, testing of alcohols produced a high FP rate, and solids/surfactants produced a high FN rate (OCED 438, 2018).

Ocular Irritection
Ocular Irritection® (InVitro International, Irvine, CA, USA) is a macromolecular test used to identify GHS NC and category 1 chemicals. The test system is a cell-free matrix that mimics the cornea. This is used to measure the amount of coagulation and saponification following exposure to a test chemical (Eskes et al., 2014). Quantification of changes to the test matrix are converted into a maximal qualified score and a GHS classification result. The OECD Ocular Irritection test guideline indicates that the accuracy was 75%, the FP rate was 41%, and the FN rate was 9% (OCED 496, 2019b). Ocular Irritection can test solids and liquids when a 10% solution of those materials falls within the pH range between 4 and 9 (OECD 496, 2019b). The OECD application domain is limited to weight of evidence testing versus as a standalone test for GHS NC. Therefore, FN and FP rates "in this context are not critical since all test chemicals that come out negative or positive would be subsequently tested with other adequately validated in vitro test(s), using a sequential testing strategy in a weight-of-evidence approach according to the OECD GD 263" (OECD 496, 2019b).

OptiSafe
The OptiSafe™ (Lebrun Labs LLC, Anaheim, CA, USA) ("Optimized for Safety") test is a shelf-stable, test tube-based in chemico method. OptiSafe is used to determine whether an unknown substance is an ocular nonirritant or ocular corrosive using chemical models of three variables: 1) damage to the corneal stroma, 2) damage to phospholipid bilayers, and 3) the potential to induce pH extremes in a system (pH buffering system of the eye) (Choksi et al., 2020). OD measurements and pH are used to calculate a score that is applied to prediction models for GHS NC and GHS category 1 (ocular corrosive). It also includes EPA category IV and EPA category I (ocular corrosive) prediction models. A validation study coordinated by NICEATM with members of ICCVAM forming the validation management team, assessed accuracy for the GHS NC and EPA IV nonsurfactant chemical predictions. For the three-lab masked transferability phase, when results across all three laboratories were combined based on the majority classification for the GHS system, test method accuracy was 89%, the FN rate was 0%, and the FP rate was 23%. Additional chemicals were selected by the validation management team to evaluate the application domain; these were tested by the lead lab only in the "application domain phase." Based on results from both the masked transferability study and the masked application domain study conducted by the lead lab only, test method accuracy for the GHS system was 80% (when the degraded chemical iso-octylthioglycolate, CASRN 25103-09-7 is accounted for, see Supplemental Material I), the FN rate was 0%, and the FP rate was 40% (GHS). The application domain includes soluble and insoluble solids, liquids, surfactants, and highly colored substances.

Short Time Exposure (STE)
The Statens Seruminstitut Rabbit Cornea cell-based STE test method (OECD 491, 2020b) is an in vitro cytotoxicity test. Cell viability is measured after a 5-minute exposure at both 5% and 0.05% test material concentrations. While the cell line is not differentiated, this is balanced by the rapid exposure of cells to the test substance followed by an evaluation of viability using the MTT assay (ICCVAM-NICEATM, 2013). The prediction model uses cell viability at both the 5% and 0.05% concentrations to classify GHS NC and GHS category 1 chemicals. No predictions can be made for GHS category 2B and 2A chemicals. For the detection of GHS NC chemicals, the STE test method has an accuracy of 85%, an FN rate of 12%, and an FP rate of 19% (OECD 491, 2020b). Test chemicals with vapor pressures higher than 6 kPa that fail to dissolve or form a suspension in mineral oil are outside of the application domain of the test method (OECD 491, 2020b).

"State of the Art"
None of the tests described above have been found to be capable of detecting the full range of the ocular irritation response, particularly those materials that cause an early response that resolves or "reverses" over time, that is, GHS category 2B/2A and EPA categories III and II. To address this issue, various groups have advocated for a tiered-testing strategy using multiple tests in a top-down or bottom-up approach using multiple test procedures in combination, making testing more expensive and difficult to interpret. The Integrated Approach on Testing and Assessment (IATA) described in OECD 263 (2019c) improves on the sequential testing previously described in OECD 405 (2020c) and provides decision-making guidelines. The IATA for eye hazard identification is divided into nine "modules" with three major sections that are further detailed in OECD 263. In the IATA, it is suggested that the bottom-up approach should be followed when existing information is insufficient and the weight-of-evidence (WoE) assessment results in a high probability that the test chemical does not require classification for eye hazards (GHS NC versus GHS category 1/category 2). On the other hand, a top-down approach is recommended when the WoE supports probability for GHS category 1 chemical. Hayashi et al. (2012) used a tiered-testing approach combining the data of the STE and BCOP tests for predicting eye irritation potential of chemicals concluded that for these two tests, that the accuracy of the GHS prediction was only slightly improved when the tiered approach combination was used compared to the STE test irritation rank classification alone (Hayashi et al., 2012).
The recommended maximum of three test methods in a tiered-testing strategy (Adriaens et al., 2018a) may not adequately identify reversible ocular irritants with a reasonable level of statistical certainty, as validated, high-sensitivity methods for the detection of GHS NC have a high FP rate, and validated methods for the detection of ocular corrosives (GHS category 1) have low sensitivity. This is further confounded by the fact that published accuracies of different tests use different validation procedures, making the comparisons between test methods and selection of which tests to use imprecise. Specifically, different studies use different chemicals. As demonstrated below, the mix of chemicals used to calculate accuracy and FP rates has a major impact on these statistics.
To address this, we conducted an extensive database search to identify chemicals that were used in common between the different test methods. Comparisons were then made based on these common chemicals to more precisely compare the accuracies between tests in the hope of identifying optimal tests and testing strategies.

Database Source Selection
Search Criteria-To ensure an objective and thorough analysis, we developed a standardized search procedure to identify validation study results for the different methods to be compared: The NIH PubMed database and specific government agency websites (OECD, ICCVAM, NICEATM, ECCVAM, EURL ECVAM, and JaCVAM) were used as peer-reviewed and trusted sources. To search for test method validation results on these websites, the full name of the test method and the abbreviated form were used; Bovine Corneal Opacity and Permeability (BCOP), EpiOcular Eye Irritation (EIT), Isolated Chicken Eye (ICE), Ocular Irritection (OI), OptiSafe (OS), and Short Time Exposure (STE). The search was performed as described below.
The NIH PubMed database (NIH, accessed July 27, 2020) was used to search for publications, the OECD website (OECD, accessed July 27,2020) was used to search for test guidelines, the U.S. NTP website (NTP, accession July 27, 2020) was used to search for ICCVAM and NICEATM reports, and the EU Science Hub website (EU, accessed July 27, 2020) was used to search for ECCVAM and EURL ECVAM reports. The following keywords were used to find the sources for each test method: "TEST METHOD NAME," "TEST METHOD NAME ABBREVIATION," "TEST METHOD NAME" + "TEST," "TEST METHOD NAME ABBREVIATION" + "TEST" and "TEST METHOD NAME" + "VALIDATION," "TEST METHOD NAME ABBREVIATION" + "VALIDATION," "TEST METHOD NAME" + "EYE," "TEST METHOD NAME ABBREVIATION" + "EYE," "TEST METHOD NAME" + "OCULAR," and "TEST METHOD NAME ABBREVIATION" + "OCULAR." The first 100 results were analyzed from each of the search criteria above. Each result was checked to make sure the methods used were consistent with the most current test procedure and that a data set of chemicals tested with corresponding CASRN and GHS in vivo and in vitro results for each of the chemicals was included in the publication. In the case of BCOP, the sources were required to distinguish between what type of opacitometer was used. These results were used as the sources for comparisons between each test method; abbreviated results are shown in Table 1, and detailed results are shown in Supplemental Table 1.
There were a number of sources with duplicate data, as some sources compiled the data from multiple studies. For BCOP (OP-KIT), the Test Method Evaluation Report Volume 1 (ICCVAM, 2010a) contains data from the Balls et al. (1995) and Gautheron et al. (1994) studies. The BCOP method was divided into BCOP (OP-KIT) and BCOP (LLBO) to account for the different opacitometers utilized (OECD, 2020a). For ICE, the Test Method Evaluation Report Volume 2 (ICCVAM, 2010b) contains data from the Balls et al. (1995), Prinsen andKoeter (1993), andPrinsen (1996) studies. For STE, the Test Method Summary Review Document (ICCVAM-NICEATM, 2013) contains data from the Takahashi et al. (2009 and 2010) and Sakaguchi et al. (2011) studies. All data were obtained from the published literature, with the exception of isothioglycolate (CASRN 25103-09-7) results for the OptiSafe test method. The chemical was determined to be impure, and a different result was obtained when a purer lot was used. Repeated testing of 98% pure isothioglycolate showed that the OptiSafe method accurately predicts this chemical as a GHS NC (additional details are provided in Supplemental Material I). Therefore, this chemical was not used for the OptiSafe analysis reviewed herein.
From the sources listed in Table 1, chemicals were selected for analysis if the CASRN, GHS in vivo classification, and GHS in vitro prediction were provided. Chemicals without a CASRN could not be selected for analysis because there was no way to accurately identify the same chemical tested using another test method due to variations in chemical names. Chemicals without a GHS in vivo classification were also not selected for analysis because the chemical could not be classified as a true positive (TP), FN, true negative (TN), or FP with only an in vitro classification or result. To determine if a chemical was tested in a masked study, a statement on the source that indicates which chemicals were tested "blinded", "masked," or "coded" was required. In this context, masked means that the samples to be tested are aliquoted into coded vials, and the laboratory conducting the test does not have the key to code numbers or know what is in each vial. Note: "masked" can be considered the same as "coded" or "blinded". To ensure study quality was sufficient, the following criteria were applied:

1.
Only studies using the most current protocol were considered.

2.
All studies were published and peer-reviewed.

3.
Results were for chemicals with a CASRN and in vivo data, and could theoretically be repeated if questions arose.

Results and Discussion
Results for the BCOP (LLBO), BCOP (OP-KIT), EpiOcular, ICE, Ocular Irritection, OptiSafe, and STE methods were compiled into a comprehensive table. Supplemental Table 2 shows the same-chemical comparisons for GHS categories 1, 2B/2A, and NC. An NC versus irritant analysis was used to determine if the chemical was a TP, FN, TN, or FP. Results for masked, blinded, or coded studies are indicated as masked (M), and not masked, not blinded, or not coded studies are indicated as NM. A dash (−) indicates there was no result provided from the sources used or there was no agreement between results to make a final classification decision. Also shown in this table is the OECD Toolbox V4.4.1 functional group for each chemical. While readers may find it bothersome to seek out supplemental tables on journal websites, given the very large number of results analyzed (363), it is not possible to put these very large tables within the published manuscript. Nonetheless, some interested parties will find these comprehensive comparison tables useful; therefore, they are included as supplemental materials. In addition, supplemental table 2 provides the reader with the capability to independently verify results and calculations, and may be useful for other applications.

Same-Chemical Accuracy Comparisons Demonstrate that the Mix of Chemicals Tested Determines Perceived Accuracy
Results for chemicals tested by two or more test methods were used to identify TP, FP, TN, and FN for chemicals in common between pairs of test methods. These values were used to determine the FP rate, FN rate, accuracy, and balanced accuracy for chemicals in common between the different test methods evaluated. Therefore, accuracy results are segregated based on if the two tests being compared assessed the same chemicals. Table 2 shows the FN rate, FP rate, overall accuracy, and balanced accuracy comparison for each pair of tests evaluated with exactly the same chemicals. By comparing the same chemicals, unknown chemical selection bias is controlled for, allowing for an evaluation of the relative accuracy of one test compared to another. Supplemental Table 3 shows the accuracy comparisons of masked (i.e., coded or "blinded") studies.
Supplemental Table 4 compares the n for coded study comparisons, with the n for all available data (resulting from both coded and not coded studies). The total number of chemicals in common, especially the number of GHS NC chemicals, are limited for coded comparisons. For many of the coded comparisons, there are fewer than 10 negatives (GHS NC) in common, and in several examples, there are just one or two negatives in common (Supplemental Table 4). The overall same chemical comparisons provide an increased number of chemicals in common for a better comparison but may still be considered a preliminary comparison because the numbers of chemicals in common are low and likely below what would be required for a robust statistical comparison. As shown in Supplemental Table 5, the balanced accuracies are similar between the masked and overall comparisons. It is interesting to note that the accuracy for some of the comparisons was slightly higher for the coded studies.
As shown in table 2, although the n's are low, there are few commonly missed FNs, suggesting that the different tests likely detect different mechanisms of irritation and therefore combinations of tests likely identify most irritants. An exception is the BCOP (OP-KIT) method compared to the BCOP (LLBO). Nonetheless, as shown in Table 2, about one-third to one-half or more of the FPs are commonly missed (Table 2, commonly missed FPs) with the notable exception of the STE test. Table 3 is a compilation of the accuracy ranges for each test presented in Table 2. the published accuracy for each test method with the range of accuracies that result from selection criteria based on the same chemicals tested by two tests (segregated based on chemicals commonly tested by the two test methods being compared in each instance). The relative FP rates are variable and different than the published FP rates, except for the STE test method, and the relative FN rates are variable and different than the published FN rates, except for the OptiSafe method. Tables 2 and 3 show a similarity in results for matched chemicals which is more pronounced than the relative accuracy of one test compared to another for the same set of chemicals. Very different accuracies for the same test depend on which chemicals are in common for a given pair comparison. This indicates that the mix of chemicals has a major impact on the perceived accuracy of the test method; in other words, a test method can be perceived as highly accurate or inaccurate, depending on the mix of chemicals evaluated.  2 also demonstrates that for any given pair, there are a high percentage of commonly missed FPs. This is explored more below.

There Is a Group of Chemicals that are GHS FPs by All of the Tests Evaluated Here
To evaluate commonly missed FPs in more detail, we narrowed the number of chemicals down to those that were assessed using two or more tests, this reduced the number of chemicals from 363 to 202 chemicals. Additionally, upon closer inspection of chemical results, it was noted that the dilution for one chemical (No. 92) was not reported in two studies leading to different classifications (both irritant and nonirritant) and different results leading to TP and FP. Because of this disparity, this chemical was also removed from the data set leaving only 201 chemicals that were further analyzed.
As shown in Table 4, of the 201 chemicals tested in 2 or more tests, 121 chemicals were tested by BCOP (LLBO), 145 chemicals tested by BCOP (OP-KIT), 144 chemicals tested by EpiOcular, 67 chemicals tested by ICE, 74 chemicals tested by OI, 81 chemicals tested by OptiSafe, and 117 chemicals tested by STE. Of the chemicals tested, BCOP (LLBO), BCOP (OP-KIT), EpiOcular, and ICE had twice as many irritants as NC chemicals, while, Ocular Irritection, OptiSafe, and STE had about equal numbers for irritants and nonirritants.  Table 2 is organized by ascending CAS Registry Number and provides additional information, including functional groups and additional references). Of the 201 test chemicals, 82 were classified as NC by the GHS classification system (Table 5A), while 119 were classified as irritants (Table 5B) with 72 classified as category 1 and 47 classified as category 2 irritants. On the other hand, of the 155 EPA classifiable chemicals, 39 were classified as EPA category IV, while 55 were classified EPA category III, 23 as EPA category II, and 38 as EPA category I. Of all the tests, only 7 chemicals were tested by all 7 alternative tests, while 16 chemicals were tested by 6 tests, 30 chemicals were tested by 5 tests, 56 chemicals were tested by 4 tests, 46 chemicals were tested by 3 tests, and the remaining 46 chemicals were tested by only two tests.
Of the 82 NC chemicals, 35 were correctly classified in all tests (Table 5A). Thirty-one of these chemicals had EPA classification, and of these 31, 25 were classified as EPA category IV with 5 being classified as EPA category III and 1 classified as EPA category II. On the other hand, 17 chemicals were misclassified by all test, with 2 chemicals missed by 5 tests, 2 chemicals missed by 4 tests, 6 chemicals missed by 3 tests, and 7 chemicals missed by 2 tests. Of those chemicals with EPA classifications, 12 were classified as EPA category III with 4 classified as EPA category IV. Of the remaining NC chemicals that were FP, at least 1 test correctly classified the material; however, 13 chemicals showed correct classification in only one of the other three to five tests. EPA classification for these chemicals also showed a greater number of EPA category III chemicals (7) compared to EPA category IV (3). Of these 13 missed chemicals, no one single test correctly classified them, with Ocular Irritection correctly classifying one chemical, EpiOcular correctly classifying 4 chemicals and STE correctly classifying 8 chemicals. Table 5A demonstrates there are commonly misclassified chemicals, and many of the commonly overclassified chemicals are EPA category III/GHS NC.
Based on an extensive review of the literature, this paper establishes a historical database of test results of BCOP (LLBO), BCOP (OP-KIT), EpiOcular, ICE, Ocular Irritection, OptiSafe, and STE that provides GHS classifications and functional group identifications for which comparative accuracies and empirical tiered-testing strategies can be analyzed. Importantly, using this database, we have identified a group of chemicals that provide FP results in all reported testing strategies (Lower 1/3 of Table 5A). Specifically, there is a group of chemicals that are misclassified as FP by all in vitro eye irritation tests that were identified. While the STE method alone did show a lower FP rate for some of these chemicals, albeit also showing the highest FN, there was a high percent of FPs that are missed between test methods. When some chemicals are commonly missed by all tests, inclusion of these chemicals in a validation study will have an adverse impact on perceived test method accuracy.
There are several potential explanations for why all identified alternative tests misclassify this set of chemicals. First, we did identify that as a group, these chemicals contained specific functional groups in common, specifically it was identified that the amine, aryl and aryl halide, ether, ketone, and alkene were disproportionally overclassified by every test except for the STE test. However, it should be noted that these functional groups were not unique to the set of missed chemicals, and therefore cannot be used as markers for this set of chemicals. Further comparisons of functional groups and other chemical properties with misprediction rates and specific ocular effects may provide insights into the chemistry and biological processes of these chemicals. However, the n is low, and these preliminary results are best suited for hypotheses and as the bases of prospective studies, for example, a guess that other chemicals with these functional groups will also be FPs for a given test method.
If the database were expanded in the future, a program might be developed that would allow the input of a chemical structure to yield a suggestion of which in vitro test would have the highest accuracy for the associated functional groups.
Another possible explanation for the mispredictions for this set of chemicals and their functional groups may be the variability of the in vivo database. To the extent that the in vivo data for a specific material do not reflect a risk to humans, the in vivo data are of little value for risk assessment and distracts from a meaningful validation of nonanimal test methods. Consistently mispredicted chemicals should be identified and flagged, and then additional studies of human exposure data and chemical and toxicological properties performed to allow stakeholders to develop consensus opinions on the appropriateness to include these chemicals as gold standards for validation studies.
Finally, common mispredictions may be due to the inability of all the tests evaluated to detect chemical irritants that show reversibility. Greater than 80% of the consistently overpredicted NC chemicals with EPA classifications, were classified as EPA category III (Table 5A). Therefore, it appears that the test methods discussed here do not differentiate between damage at 24 hours and damage at 3 days, resulting in a high GHS FP rates for most tests when EPA category III/GHS NC classified chemicals are included in the validation. This finding suggests that there is still a need for further alternative test development based on detection of reversibility, if there is a desire to reduce the number of FPs. The need for alternative tests to address mechanisms of reversibility is also supported by the evaluation of the effects of these persistently missed chemicals on tiered-testing strategies.

Tiered Testing with Two Tests Provides Little or No Advantage over a Single Test
Same chemical matched results for two tests can be used to evaluate sequential testing strategies for the same two tests. When the FP rate is high, as is the case for nonanimal eye irritation tests, a common bottom up, tiered-testing strategy is to retest positives using a second test. This strategy theoretically reduces the FP rate (because FPs have a second chance to be classified as TNs) but may increase the FN rate (because TPs have a second chance to misclassified as FNs). In Table 6 However, results shown in Table 6 demonstrate that the bottom-up strategy of retesting positives typically results in an increased FN rate. If FPs are commonly misclassified, the actual improvement in specificity is not as good as the theoretical improvement. As shown in Table 6, in some cases, the actual results from just one test are more accurate than retesting positives with a second test (because the second test contributes no or a limited actual reduction of FPs but does increase the FN rate). Because the same chemicals are commonly overclassified, retesting positives by a second test method results in a lower-than-expected improvement of the FP rate and typically increases the FN rate (Table 6).
In summary, as demonstrated by same-chemical comparisons, perceived accuracy of one test method compared to another is highly dependent on the selection of chemicals. In addition, the tests evaluated herein miss many of the same chemicals and functional groups in the substances analyzed; in these cases, a second test in a tier will not add statistical power to a bottom-up analysis. It is suggested that chemicals misclassified using these test methods be further studied. Specifically studying biological endpoints associated with mispredictions would allow specific mechanisms related to mispredictions to be identified and enable the development of more predictive and biologically relevant nonanimal tests.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material. Same-chemical comparisons demonstrate that the perceived accuracy of one test compared to another is dependent on the specific chemicals used for the validation study.
Some chemicals and functional groups appear to be broadly misclassified by all test methods.
To the extent that all tests make the same misclassifications, the advantage of using a tiered-testing strategy is reduced. n is the number of chemicals in common between the test methods compared.  Classification of Chemicals Analyzed by Two or More Alternative Tests.

Representative Chemicals
Alternative Tests  Table 5A.

BCOP (LLBO) BCOP (OP-KIT) EpiOcular ICE OI OS STE
Identification of Non-Classified Chemicals.  Identification of Irritant Chemicals.    Table 6. Example of sequential testing strategy with test 1 first and then test 2, using chemicals in common between both tests. Materials are first assayed using test 1. Since these results are available, the accuracy values are the actual values, which are also the same as the theoretical accuracy values for the chemical set. In this sequential strategy, only positive results will be re-tested using test 2 (chemicals in the data set that were identified as in vitro positives will be retested using test 2). Test 1 then test 2 (theoretical) assumes the accuracy for the second test is as described in Table 2 (accuracy comparison table); in other words, accuracy is unaffected by the outcome of the first test. Test 1 then test 2 (actual) shows the real accuracy values. Since the results for all of these materials are known for both tests, the actual outcome of the first test is followed by the sequential testing of positives by the second test (refer to Supplemental Table 2).