A Triangular Approach for the Validation of New Approach Methods for Skin Sensitization

The availability of reference data is a key requirement for the development of new approach methods (NAM), i.e., in vitro, in chemico and in silico methods and integrated approaches, like defined approaches (DA), which combine these data sources. Reference data are of even greater importance for regulatory acceptance. In contrast to most other adverse effects, human skin sensitization data on many chemicals are available, next to data from animal studies, such as the local lymph node assay (LLNA). Skin sensitization NAM data can therefore be compared to different reference datasets. Recent publications and validation at the OECD focused on human and LLNA reference data. The “2 out of 3” DA (2o3 DA) is the first DA for skin sensitization solely based on experimental data from validated tests and was recently adopted as an OECD test guideline. Here we review the predictivity of the 2o3 DA on multiple human and LLNA reference datasets. Concomitantly, we compare the predictivity of the LLNA for human data within the same datasets. Comparing predictivity of methods not only bilaterally (NAM or DA vs. animal method) but including human data in a triangle “NAM data – animal data – human data” offers a comprehensive assessment of the NAM’s and DA’s predictivity. In all these assessments, the 2o3 DA was superior to the LLNA in predicting human skin sensitization hazard. This highlights the importance of a holistic view of reference data instead of limiting validation of NAMs and DAs to data from a single

research and even though multiple models, including commercial 3D lung tissue models, have been developed (Lacroix et al., 2018). One may suspect that this lack of a validated replacement could be closely linked to the intrinsic noise of the in vivo data on that systemic endpoint.
For ingredients in consumer products with topical exposure, the skin sensitization endpoint is one of the most important aspects in the safety evaluation of new and existing chemicals. The replacement of skin sensitization testing by non-animal methods has thus been a strong research focus, accelerated by the ban on animal testing for cosmetic ingredients in the European Union since 2013 (EC, 2009(EC, , 2013. This research focus led to the rapid development of multiple NAMs by both academic and industrial laboratories (Ezendam et al., 2016). At the time of writing, eight of these methods addressing the endpoint skin sensitization have been fully validated in multi-laboratory ring-trials and implemented as protocols in OECD TG 442C, 442D and442E (OECD, 2018a,b, 2020). Relatively rapid adoption of these methods by the OECD may have been facilitated by the fact that all these methods were specifically labeled as not to be used as stand-

Introduction
The development of new approach methods (NAM), i.e., in vitro, in chemico and in silico methods, has become a key focus in toxicology. In order to develop hypothesis-driven, mechanistically based new tests, a limited and discrete set of reference substances with well-defined in vivo reference data is often sufficient, and for skin sensitization such a small set was proposed early on (Casati et al., 2009). Larger datasets are usually needed to train empirical models based on a large number of input variables such as models based on genomic data or in silico models (Dimitrov et al., 2005;Johansson et al., 2013). Nevertheless, when it comes to method validation and regulatory acceptance, assessment of predictivity of the test method (or approaches combining multiple methods) becomes a key aspect, and this can only be achieved with datasets of sufficient size and with high-quality in vivo data (Kolle et al., 2019). Toxicological endpoints lacking such high-quality data may face a significant delay in acceptance of a NAM. Thus, to name an example, no method has gone into full validation for acute respiratory toxicity despite decade-long outcomes based on the validated prediction models from OECD TG 442D, DPRA (442C) and h-CLAT (442E) (OECD, 2018a(OECD, ,b, 2020. It is the only DA fully based on validated experimental methods and prediction models. Next to the 2o3 DA, the recently published OECD guideline on DAs for skin sensitization (OECD, 2021a) also includes a modified version of the integrated testing strategy (ITS) (Nukada et al., 2013), which combines scores derived from the in vitro data (based on data obtained from DPRA and h-CLAT) and an in silico prediction. Here we review the predictivity of the 2o3 DA given in different literature sources and for the OECD reference database used to evaluate the DAs, and we discuss the underlying reference datasets. By comparing predictivity in the "2o3 DA -animal data/LLNAhuman data" triangle as shown in Figure 1, a detailed assessment of predictivity is possible, and predictivity of both the DA and the LLNA against human data can be compared.

Description of the reference datasets and comparison of human and LLNA reference data
The predictivity of the 2o3 DA and of the individual underlying test methods had been assessed repeatedly against different databases prior to the assessment by the OECD (Bauch et al., 2012;Kleinstreuer et al., 2018;Natsch et al., 2013;Urbisch et al., 2015). These datasets all overlap substantially, but the different authors all added further chemicals based on their evaluation target, which was set differently for different studies.
The first evaluation of the 2o3 DA was performed by Bauch et al. (2012) when the approach was first described. This was based on a set of 54 chemicals with LLNA data, for 50 of which human evidence was also available. The human data were mainly retrieved from Basketter et al. (1999) among few other sources. In the dataset of Bauch et al. (2012), the LLNA and the human outcome are largely congruent. Thus, balanced accuracy (BA) of alone methods, especially not as stand-alone methods to classify a chemical as a non-sensitizer (however, they could be used for a positive labeling). Thus, some limitations in predictivity were not of major concern in the validation and adoption of the test guidelines, while technical reproducibility and protocol standardization were of major importance. It was always assumed that the information of multiple tests would be integrated in so-called integrated approaches for final prediction of the skin sensitization endpoint (Jowsey et al., 2006;OECD, 2021a).
To standardize the interpretation of the aggregated information from multiple tests and information sources, defined approaches (DA) were proposed as a way forward. DAs are fixed workflows allowing an assessment of multiple inputs following a fixed data interpretation procedure (DIP) without expert judgment. DAs therefore should deliver an unambiguous rating and can thus be subject to mutual acceptance of data (MAD) by OECD member countries. To put this concept into practice, the OECD has developed a new guideline (No. 497), which describes two simple DAs for the skin sensitization endpoint (OECD, 2021a), and after three years of in-depth discussions by an expert group convening in multiple meetings, this guideline was finally adopted in June 2021.
In order to accept a DA or integrated approach to fully replace the endpoint of concern, in this case skin sensitization, predictivity takes center stage, and the detailed discussions within the OECD expert group mainly focused on this aspect. Hence, reference data were of major interest, and a detailed review of the reference data was undertaken by the OECD expert group (OECD, 2021b). However, predictivity of the DAs had already been assessed repeatedly prior to OECD submission, and it is thus possible to compare the published predictivity of the DA on different reference datasets.
The "2 out of 3" approach (2o3 DA) (Bauch et al., 2012;Urbisch et al., 2015) described in the new OECD guideline allows hazard identification based on two concordant, non-borderline the LLNA vs. human data is 88% (Tab. 1). This value is affected by the fact that the underlying dataset (Basketter et al., 1999) was compiled for a retrospective statistical evaluation of the optimal threshold for LLNA positivity, and it contained largely congruent LLNA and human data. Furthermore, the selection of chemicals made by Bauch et al. (2012) was partly based on the availability of KeratinoSens ® data (since the same publication compared the predictivity of LuSens and KeratinoSens ® , the two assays described in OECD TG 442D; OECD, 2018b), and part of the data was retrieved from the original KeratinoSens ® publication (Emter et al., 2010). The dataset in Emter et al. (2010) was compiled from (i) the publication of Casati et al. (2009), (ii) the LLNA performance standards (ICCVAM, 2009), and (iii) the initial list of bona fide reference chemicals for the European Sensit-iv project (Rovida et al., 2007). Especially the former two lists were specifically compiled to contain chemicals with congruent LLNA and human/guinea pig data. These chemicals (n = 36) were combined with additional chemicals (n = 31) with congruent LLNA and guinea pig/human data to make up the so-called "Silver List" (n = 67). The name "Silver List" was used as we considered it the best possible compilation we could make containing such congruent data from multiple in vivo sources for the validation of a method, anticipating that a "Gold List" with more indepth data curation would be generated one day. The chemical set in Bauch et al. (2012) has an overlap of 31 chemicals with the Silver List (Emter et al., 2010).
Thus, in all these initial reference lists compiled to validate non-animal methods to address skin sensitization, strong emphasis was placed on the selection of reference chemicals for which LLNA data were available, but also other, congruent, clear evidence for the sensitization risk, and therefore a strong alignment especially between LLNA and human data was intended. This was considered important, as the LLNA itself is an alternative (specifically a "refinement") method, which does not directly measure skin sensitization and which was itself validated based on reference data .
A larger set of chemicals (n = 145) was later presented to combine all chemicals with available data for KeratinoSens ® , DPRA, and the dendritic cell activation test (the U-937 test, an early modification of the U-Sens protocol was used) (Natsch et al., 2013). This set was developed to specifically predict the LLNA outcome based on a Bayesian network approach (Jaworska et al., 2013). In this case, human references were neither available nor required, thus this reference dataset had a clearly different focus.
Cosmetics Europe then developed a completely different reference list, the main goal being to categorize chemicals regarding their human sensitization potential (Basketter et al., 2014). This list (n = 128) categorizes chemicals into 6 classes, whereby classes 5 and 6 would be categorized as human non-sensitizers for classification purposes according to the authors. No direct comparison to the LLNA was made, except for a graphical representation. Cosmetics Europe then filled all the data gaps for the in vitro data for this list of chemicals for some guideline methods (KeratinoSens ® , DPRA, h-CLAT and U-Sens) and the back then emerging method SENS-IS . This analysis was also complemented with a detailed review of all available LLNA data, often from multiple sources. In this analysis, the LLNA is still a very sensitive method when compared to human data (sensitivity of 85.2%), but it has a clearly lower specificity (50.0%) to predict the human sensitization risk (Tab. 1).
In parallel to the human data collection of Basketter et al. (2014), a large compilation of data was published by Urbisch et al. (2015). This list contains 180 chemicals with LLNA and in vitro data (KeratinoSens/LuSens, DPRA and h-CLAT 1 ) and a subset (n = 103) that additionally contains human data. The human data were aggregated from the RIFM database and from earlier publications by Basketter et al. (2014) and Bauch et al. (2012). In this evaluation, predictivity of the LLNA vs. human data also was analyzed. Again, good sensitivity (91%) and limited specificity (64%) of the LLNA to predict human reference data were reported (Tab. 1).
1 The 2o3 was originally described using both KeratinoSens™ and LuSens test methods interchangeably to address the key event keratinocyte activation with similar predictive capacity (Bauch et al., 2012). Equivalent predictive capacity using both tests to address the key event keratinocyte activation was later shown by Urbisch et al. (2015). The LuSens is a me-too assay, which was developed based on essential test method components as described in OECD Guidance Document No. 213 (OECD, 2015;Ramirez et al., 2014). It has been validated according to performance standards of OECD TG 442D (OECD, 2018b) as laid down in Guidance Document No. 213 (OECD, 2015;Ramirez et al., 2014). Likewise, the 2o3 was originally described using both the h-CLAT, mMUSST and U-937 test methods interchangeably to address the key event dendritic cell activation with similar predictive capacity (Urbisch et al., 2015). The human datasets used in these studies were not assessed based on uniform evaluation criteria and involved significant expert judgment. This is because human predictive tests have never been standardized, are based on multiple protocols, and in some cases also clinical data and data on safe use were included in the weight-of-evidence (WoE) assessment (Basketter et al., 2014). Furthermore, the rationale for the expert judgment and WoE analysis was often not fully transparent on a chemical-per-chem-Thus, in summary, published reference datasets were made either to contain (i) congruent LLNA and human/other evidence (Bauch et al., 2012;Casati et al., 2009;Emter et al., 2010), (ii) LLNA-only data for a project focused on LLNA predictivity (Natsch et al., 2013), (iii) a primary focus on human data (that was complemented with LLNA and in vitro data later) (Basketter et al., 2014;Hoffmann et al., 2018) or all chemicals with available in vitro data (Urbisch et al., 2015).

human, and c) 2o3 DA vs. LLNA reference data in different evaluations
The different evaluations (x-axis) are ordered in the order of the publication date of the differently curated datasets with the reference data (Bauch et al., 2012;Urbisch et al., 2015;Kleinstreuer et al., 2018;Basketter et al., 2014;Natsch et al., 2013; OECD DB) (please refer to text for details). Blue diamonds: sensitivity, red squares: specificity; green triangles: balanced accuracy be compared with the human data compilation from Basketter et al. (2014), giving a higher number of comparisons (n = 96). In this case, a high sensitivity (99%) and better, but still poor, specificity (39%) is observed for the LLNA vs. human data (Tab. 1).
In summary, the curation of data over time led to a continuously reduced specificity of the LLNA vs. human data as summarized in Table 1 and Figure 2. This appears at least partly due to stringent requirements to accept negative results, which are partly based on guidance in OECD TG 429 (i.e., to test chemicals up to 100% concentration), which was actually never validated , and this has to be kept in mind when evaluating an integrated approach or DA for skin sensitization. Table 2 summarizes the predictivity of the 2o3 DA approach vs. human data evaluated on the different datasets summarized above. The initial analysis of Bauch et al. (2012), which led to the definition of the 2o3 DA, found a high predictivity (94% BA), which was above the also high BA of the LLNA (88%) for the presented dataset. This good predictivity was confirmed by the analysis of Urbisch et al. (2015), with a BA of 90%. This latter analysis included the Bauch et al. (2012) subset but extended the list from 50 to 101 chemicals. Again, the 2o3 DA was more predictive than the LLNA (BA 77.5%), which in this analysis had a lower specificity. For the Cosmetics Europe database, Kleinstreuer et al. (2018) found a lower predictivity for the 2o3 DA (sensitivity 79.3%, specificity 72.5%, BA 75.9%; n = 127), but this was again higher than that of the LLNA (BA 67.6%).

Predictivity of 2oDA vs. human data
ical basis in these published datasets. Regarding the LLNA data, negative calls were largely made according to the practice during the LLNA validation when negative chemicals were hardly ever tested at concentrations > 25% . Thus, to name an example, in the Silver List (Emter et al., 2010) chemicals were rated negative if they were negative up to a maximum test concentration of 20%, in line with how the LLNA was validated.
When the OECD expert group on Defined Approaches for Skin Sensitization (DASS) assessed the performance of DAs, the group decided that less expert judgment should be involved and that both the LLNA data and the human data needed a thorough data curation based on fixed criteria. The details are described elsewhere (OECD, 2021c,d). Briefly, for human data, only results from human repeat insult patch test (HRIPT) and human maximization tests (HMT) were considered, and negative ratings were only made in case chemicals were tested up to at least 25% test concentration and if not a single positive reaction was recorded at this or higher concentrations. For LLNA data, the maximal test concentration was required to be at least 50%, and, in case of multiple test results, all tests needed to be negative in order to accept a negative call. These stringent requirements led to a much smaller database, especially for human or LLNA negative calls. This database of highly curated data may now be considered the Gold List, but it was not compiled to include congruent data from different in vivo sources, different to the Silver List discussed above. Thus, evaluation of the performance of the LLNA to predict the human outcome in this database again indicates a high sensitivity (94%, n = 47) but a poor specificity (22%) of the LLNA for the human sensitization risk, although the latter is based on a low number of chemicals (n = 9). The LLNA data curated by the OECD can also OECD database (all chemicals with human data) excluding borderlines 89 88 88 55 (10 inconcl.) and h-CLAT negatives with log P > 3.5 d a U-937 test instead of h-CLAT; b The subset (n = 104) of the Kleinstreuer et al. (2018) dataset (n = 127) remaining after the curation process for LLNA data of the OECD expert group. c Chemicals with predictions in the statistically derived borderline range around the prediction threshold (Kolle et al., 2021) are considered inconclusive. Two congruent, conclusive results are needed for a conclusive 2o3 prediction. d Negative h-CLAT results for chemicals with log P > 3.5 are considered inconclusive. Two concordant and conclusive negative results from 442D and DPRA are needed for a conclusive negative 2o3 prediction for these chemicals according to OECD TG 497 (OECD, 2021a). cluded: BA = 88%). Thus, even if this dataset is relatively small -which may raise criticism concerning statistical power -the fact that it completely confirms results of the previous evaluation vs. the Cosmetics Europe / Basketter et al. (2014) human dataset should give this analysis high credibility. Thus, analysis on a larger number of less curated data gave the same outcome as analysis on a lower number of more curated data. This helped to overcome the criticism that either data are not curated or that the numbers are too low, as it is highly unlikely that both types of analysis came to congruent conclusions by chance.
Interestingly, when looking at the global picture of Table 2 and Figure 2a, the values for a given evaluation are always very similar for sensitivity and specificity vs. human data. This indicates that 2o3 DA offers a very balanced predictivity of the human sensitization hazard, and this seems to be superior to the situation summarized in Table 1 for the LLNA, with a predictivity that tends to be increasingly skewed towards sensitivity.

Predictivity of 2o3 DA vs. LLNA data
Since the LLNA has been the method of choice for the sensitization endpoint of industrial chemicals in the last two decades, and since a DA should fully replace the LLNA as stand-alone method for hazard identification, predictivity for the LLNA has been emphasized in most studies and evaluations, and it is of prime concern to regulators currently assessing chemicals based on LLNA results. Of course, with the limitations of the LLNA for predicting human sensitization in the different datasets (Tab. 1), a perfect predictivity of the DA for the LLNA cannot and should not be expected, as then a NAM would replicate the LLNA including all its identified weaknesses.
Predictivity of the 2o3 DA vs. the LLNA is summarized in Table 3 and was high (sens. = 81%; spec. = 88%; BA = 84.5%, n = 54) Interestingly, based on the reduced subset of chemicals from the Cosmetics Europe database, which was retained after data review in the final OECD database (n = 104), but again comparing vs. the Cosmetics Europe human data (Basketter et al., 2014), the 2o3 DA has a higher BA (85%) than when using the full Cosmetics Europe database, and again a higher predictivity than the LLNA (BA 69%).
During development of the DA guideline, the analysis of borderline outcomes from in vitro data (Gabbert et al., 2020;Leontaridou et al., 2017) was introduced to assess certainty of the outcome as described elsewhere (Kolle et al., 2021). When excluding the 15 borderline calls identified by this analysis, the BA for the Cosmetics Europe human data retained in the OECD database rises to 89%. In a strict interpretation of the 442E guideline, negative calls for chemicals with log P > 3.5 in the h-CLAT are not accepted as negative (but rated "inconclusive"). Translating this limitation into the 2o3 DA as currently implemented also in the OECD DA guideline 497 leads to three more inconclusive chemicals but does not improve predictivity for human data (BA 88%).
Finally, and most importantly, predictivity was also assessed vs. the curated OECD human dataset (OECD, 2021a), although this set is significantly smaller and sensitizers are largely overrepresented (n = 54 sensitizers; n = 11 non-sensitizers). This dataset may be viewed as rather too small for firm conclusions to be drawn. However, it should be noted that the number of human non-sensitizers in the dataset used to evaluate the DAs was considerably higher than in the dataset used to validate the LLNA (containing 68 human sensitizers and 6 human non-sensitizers (Haneke et al., 2001;ICCVAM, 1999)). Most interestingly, the predictivity values for the 2o3 DA are almost identical to those obtained against the Cosmetics Europe human data, although they underwent much less curation. This is the case for all three analyses (all chemicals: BA = 83%; borderlines excluded: BA = 88%; h-CLAT negatives log P > 3.5 additionally ex- ) and h-CLAT negatives with log P > 3.5 d a U-937 test instead of h-CLAT; b Evaluation vs. those chemicals for which human data were available; c Chemicals with predictions in the statistically derived borderline range around the prediction threshold (Kolle et al., 2021) are considered inconclusive. Two concordant, conclusive results are needed for a conclusive 2o3 prediction. d Negative h-CLAT results for chemicals with log P > 3.5 are considered inconclusive. Two concordant and conclusive negative results from 442D and DPRA are needed for a conclusive negative 2o3 prediction for these chemicals according to OECD TG 497 (OECD, 2021a). rent OECD standard, (iii) the LLNA providing standardized data (unlike human data), and (iv) a focus on the high sensitivity of the LLNA while overlooking its low specificity. The LLNA's very simple prediction model (a single value -the estimated concentration leading to a stimulation index of three) might be an additional reason for making it attractive as a sole reference for a validation. However, since the LLNA does not measure skin sensitization and resulting contact allergy -but only a surrogate (cell proliferation as an important step in the induction phase), there is an intrinsic risk in focusing solely on the LLNA, as potential limitations and blind spots of the LLNA may be translated to animal-free testing, such as the log P > 3.5 limitation for the h-CLAT discussed above.
As the early result that the 2o3 DA (and other DA such as the ITS ) is better able to predict the human sensitization outcome than the LLNA had been questioned, a detailed data review was undertaken to scrutinize this finding within the OECD expert group. Interestingly, using these more refined data to evaluate the LLNA, an ever-decreasing specificity of the LLNA vs. human data (Fig. 2) was found, which may be due to stringent, but not validated, criteria to accept negative results . On the other hand, for the 2o3 DA, the high balanced accuracy (based on a balanced sensitivity and specificity) vs. human data could be confirmed by all analyses (Fig. 2). This triangular evaluation included animal and human data and utilized different, partly overlapping datasets. It is obviously superior to validation solely based on LLNA data. This triangular analysis of predictivity should build trust in using the 2o3 DA in regulatory settings and encourage the analysis of other DA following the same approach. The triangular evaluation of predictivity underlies the notion that animal data from a single test alone should not always be the gold standard when evaluating alternatives, but that a more holistic view including more, more refined, and more relevant reference data shall be preferred.
The triangular approach including human data is only possible for the few endpoints where human data are available. For other endpoints, it will still be valuable to collect multiple data on the individual chemicals, ideally from two different animal tests. This was for example recently done to evaluate in vitro models for androgen antagonists (Gray et al., 2020). If such data are available, reference lists can be constructed in two ways: either (i) based on congruent calls from multiple sources, as we did with the Silver List -predictivity of NAMs vs. such a consensus list then provides a best estimate of true predictivity for the endpoint of interest or (ii) based on a similar triangular approach using a NAM compared to two animal tests, which will indicate to which extent prediction uncertainty exists in the data by judging how well the two animal tests predict each other and by comparing this uncertainty to the prediction of the endpoint of interest by the NAM. This approach was used in the recent study by Gray et al. (2020) where it showed that the in vitro approach does not yet offer sufficient predictivity. In either case, carefully curating and referencing the original data is a key step towards improving such evaluations, and in this regard, the recent OECD data curation effort represents a further step forward.
It is important to note that the limitation in sensitivity of the h-CLAT for chemicals with a log P > 3.5 was found when evaluating predictivity against LLNA data only (Takenouchi et al., 2013). As shown in the supporting document to the DA guideline (Annex 6) (OECD, 2021e) and as will be reported in detail elsewhere, the LLNA actually has a high false discovery rate (FDR) vs. human data for chemicals in this physicochemical range. Thus, it appears that the LLNA generates an increased rate of false-positives for lipophilic chemicals rather than that the h-CLAT (and other NAMs) generates a particularly high rate of false-negatives. Thus, the limitation introduced into OECD TG 442E, and now also translated into the DA guideline, specifically optimizes predictivity for LLNA data, effectively replicating a potential blind spot of the LLNA. As shown in Table 2, this modification does not improve predictivity for human data, and therefore it is questionable whether such a limitation that duplicates mistakes of the animal test should be carried along rather than being corrected based on learnings from analyzing human data.

Conclusions
The predictive performances of NAMs and DAs are key criteria for their regulatory acceptance and hence the replacement of animal tests. However, the predictive performance depends not only on the performance of the method but also on the quality and comprehensiveness of the reference data (Kolle et al., 2019).
While evaluating the DA for skin sensitization, the OECD conducted the probably most in-depth curation effort of reference data ever, furthering an already thorough analysis made previously . In predicting skin sensitization, we have the unique possibility to not only compare to animal (LLNA) data but also to human data. Traditionally and following the example of other areas of toxicology where human data are sparse, there is a tendency to attribute more weight to the animal data, which in this case is the LLNA. This may be due to (i) two decades of regulatory practice, (ii) the LLNA being the cur-