Genetic toxicology in silico protocol.

In silico toxicology (IST) approaches to rapidly assess chemical hazard, and usage of such methods is increasing in all applications but especially for regulatory submissions, such as for assessing chemicals under REACH as well as the ICH M7 guideline for drug impurities. There are a number of obstacles to performing an IST assessment, including uncertainty in how such an assessment and associated expert review should be performed or what is fit for purpose, as well as a lack of confidence that the results will be accepted by colleagues, collaborators and regulatory authorities. To address this, a project to develop a series of IST protocols for different hazard endpoints has been initiated and this paper describes the genetic toxicity in silico (GIST) protocol. The protocol outlines a hazard assessment framework including key effects/mechanisms and their relationships to endpoints such as gene mutation and clastogenicity. IST models and data are reviewed that support the assessment of these effects/mechanisms along with defined approaches for combining the information and evaluating the confidence in the assessment. This protocol has been developed through a consortium of toxicologists, computational scientists, and regulatory scientists across several industries to support the implementation and acceptance of in silico approaches.


Introduction
The use of computational methods to assess the biological properties of chemicals is well established in many different industry sectors including the pharmaceutical, cosmetic, food, plant protection, biocides, and general chemical industries. (Marchant, 2012;Hasselgren et al., 2013) Computational methods are used during different stages of product development for purposes such as optimizing potency towards a protein target, determining the reactivity of a chemical, predicting the rate of transmembrane permeability, or predicting toxicological endpoints. In the field of toxicology, computational (in silico) methods are widely used to predict toxicological effects directly relevant to human health, as well as to support hazard and risk assessment activities or to prioritize chemicals for in vitro or in vivo testing. The first regulation to formally include the use of in silico approaches to address information requirements for the purposes of hazard identification and risk assessment was REACH (Registration, Evaluation, Authorisation and Restriction of Chemicals) (REACH, 2006). This regulation, which applies to chemicals manufactured or imported into the European Union where their import or use is not covered by other specified legislation. In addition, since 2014, with the implementation of the International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use (ICH) M7 guideline (ICH, 2014;ICH, 2017), regulatory authorities, such as the US Food and Drug Administration (US FDA), the Japanese Pharmaceutical and Medical Devices Agency (PMDA), and the European Medicines Agency (EMA) accept in silico assessments of the mutagenic potential of drug impurities. The ICH M7 guideline represented a milestone for the regulatory acceptance of computational methods for hazard assessment in pharmaceuticals and the implementation of the guideline has influenced the use of in silico assessments for other applications, such as the risk assessment of extractables and leachables, both for pharmaceuticals and for other industries. Other examples include the revision of the Toxic Substances Control Act (TSCA) to include predictive models and expert review as part of an overall assessment as well as the US FDA Center for Devices and Radiological Health (CDRH) issuing a guidance for industry on the use of International Standard ISO 10993-1 for biological evaluation of medical devices and indicating that in the absence of 2017), ATAD5 (Fox et al., 2012)) or panels of genes (Li et al., 2015), or differential cytotoxicity using isogenic cell lines that have been knocked out for different DNA repair enzymes (Yamamoto et al., 2011) have also been utilized. Later in the section, these types of methods are referred to as "primary DNA damage". Some of these test methods are no longer commonly used due to limitations in sensitivity (UDS) or the lack of a mechanistic underpinning (SCE), while others are generally used as screens to generate complementary information to provide a weight-of-evidence for mechanistic understanding.
Depending on the industry sector, slightly different combinations of tests may be required as outlined in published guidance documents to support regulatory data requirements, such as the ICH S2(R1) guidance for drugs (ICH, 2012), European Food Safety Authority (EFSA) guidance (EFSA, 2011) for food and feed safety assessment, the REACH guidance (ECHA, 2011) for registration of chemicals or ISO 10993-1 (CDRH, 2016) for evaluation of medical devices. In this publication, the intention is not to adhere to any specific guidance, but rather to base the assessments on a decision scheme (simple version shown in Figure 1), outlining a strategy for assessing genotoxicity based on coverage of the three major endpoints of genotoxicity as well as the generic term "primary DNA damage", mentioned earlier in this section. Implicitly, this leaves room for alternatives in terms of specific study types. The commonly used genetic toxicology studies and the respective mechanisms/effects they identify are shown in Figure 2.
The purpose of this GIST protocol is to outline the process for determining whether a chemical agent is genotoxic or not, as well as the level of confidence related to the assessment. The process allows for the potential inclusion of additional information based on the results of other test methods or other supporting information, such as a history of safe use in food (Constable et al., 2007). The process of performing a risk assessment of a chemical agent will depend on many factors, such as the exposure conditions and in what context the agent is being investigated. Defined risk assessment is considered out of scope for the GIST protocol and should be performed in a situation dependent context although the GIST protocol can be used to support this activity.

In silico methodologies 2.1 Data availability for in silico models
The general protocol paper outlined some of the in silico methodologies that can be used to generate predictions (Myatt et al., 2018). These include i) rule-based (or "expert") systems that identify the presence of a structural moiety, also referred to as a structural alert, that may indicate genotoxic potential, and ii) statistical (quantitative structure-activity relationship ((Q)SAR)) models that use a variety of molecular descriptors such as structural fragments or physicochemical properties to predict activity. Here, these two types are collectively referred to as "(Q)SAR" models. In addition, "read-across" (OECD, 2014) is a methodology that utilizes experimental or computed properties, such as physicochemical properties, together with structural similarity and experimental data for structural analogs to extrapolate from source chemical(s) to a target (query) chemical(s) (OECD, 2014).
The types of in silico tools that can be developed for a specific endpoint are, to a great extent, driven by the availability (amount and quality) of experimental data for model development, as well as the degree to which the chemicals of interest exert their toxicity via a common mechanism. In silico tools are most easily developed for endpoints with a well understood and similar mode of action for which a large number of data points are available. Bacterial mutagenicity is a relevant example of such an endpoint and consequently, in this area of genetic toxicology, the development of (Q)SAR models is the most established, largely due to the realization of an electrophilic mode of action for many genotoxic agents (Miller and Miller, 1981;Ashby and Tennant, 1988) and the availability of a large data set (> 2,000 compounds). Conversely, endpoints with less data or studies where the response can be due to several different mechanistic pathways (e.g., chromosome damage) are more challenging for (Q)SAR modeling. Table 1 lists an estimate of the number of compounds in the public domain with associated genetic toxicology data, as published in the Leadscope toxicity database (Leadscope, 2018). In many cases, multiple results for the same assay and chemical will be available, sometimes with conflicting results and/or conclusions. Private or commercial organizations may have access to additional compounds with experimental data (e.g. from product development), which can be combined with publicly available data. There are some factors to consider, with respect to using experimental data for modeling or readacross: (1) data conflicts need to be resolved (this is not always possible) and experimental protocols need to be examined to ensure that only data measured and interpreted under similar conditions are merged, and (2) chemical structures need to be accurate. The general protocol provides more specific details regarding considerations when using data for modeling or read-across (Myatt et al., 2018). In the case of genetic toxicology, it may be relevant to look at modeling certain subsets of assay results. For example, data generated using the Escherichia coli (E. coli) WP2 uvrA pKM101 and Salmonella typhimurium TA102 (TA102) strains have sometimes been modeled separately (Stavitskaya, 2013) from data generated using the Salmonella strains TA98, TA100, TA1535, and TA1537, because the mechanistic basis of mutation induction is different for these two groups. The second group have GC base pairs at the primary reversion site, which are not as sensitive to detecting certain oxidizing mutagens, crosslinking agents and hydrazines which are instead better detected using the TA102 or E.coli strains which have an AT base pair at the primary reversion site (OECD, 1997a). An additional consideration is the distribution of the biological response. It is not unusual for the available data sets to be skewed so that one assay result classification (e.g., negative/ positive) occurs much more frequently. Usually, inactive compounds are more abundant but regardless, the resulting imbalance will require specific strategies to be applied during the modeling procedure to avoid unbalanced predictions due to the prior probability resulting from the training set distribution.

Mutagenicity:
2.2.1.1 Bacterial mutagenicity: The majority of in silico tools developed for this endpoint have been built using data generated in the bacterial reverse mutation (Ames) assay that relies primarily on Salmonella typhimurium tester strains. Historically, this assay has been viewed as the "gold standard" of mutagenicity testing and the first SARs relating chemical structure and bacterial mutagenicity using data generated using the Ames assay were published in 1988 by Ashby and Tennant (Ashby and Tennant, 1988). Several tools are available for modeling this endpoint, including expert rule-based systems and statistical models. The application of two complementary models, one rule-based and one statisticalbased model, is described and recommended in the ICH M7 guideline for the evaluation of potential mutagenic impurities in pharmaceuticals (ICH, 2017) and by EFSA for dietary risk assessment (EFSA 2016).

Mammalian cell mutagenicity:
Both statistical models and rule-based systems utilizing mouse lymphoma assay (MLA) (L5178Y cells) data are available. Historically, application of these models often resulted in many false positive predictions, which was in part due to some of the experimental data from which the models were derived being liberally interpreted as evidence of mutagenicity. The criteria for interpretation of the experimental data were re-evaluated by Moore et al., 2006), resulting in more stringent criteria, which led to changes to some of the experimental conclusions. In this context, and to ensure the best possible predictive power, it is important for the compilation of training sets to take the most contemporary data evaluation criteria into account. It should be noted that the currently available in silico MLA models do not provide information differentiating mutagenicity versus clastogenicity, and either or both endpoints may be implicated in a positive response in this assay.
For assays using other mammalian mutagenicity cell lines, such as those detecting mutations at hypoxanthine-guanine phosphoribosyl transferase (HPRT), and at a transgene of xanthineguanine phosphoribosyl transferase (XPRT) which are treated as equivalent by some industry sectors and regulatory agencies, there are currently not enough data available to generate useful models, although these assays may be referenced in expert systems. This also applies to in vivo mutagenicity studies. In addition to referencing them in expert systems, such data can be used for read-across to support a weight-of-evidence scenario, if they are available. It is expected that there will eventually be enough data available in the public domain to support model development.

Clastogenicity:
Both statistical and rule-based tools for in vitro and in vivo clastogenicity are available as commercial and free tools. Clastogenicity can result from numerous and diverse mechanisms of action (Bender et al., 1974;Snyder, 2000;Kaina, 2004;Snyder, 2010). Furthermore, cytotoxicity can confound the results of in vitro clastogenicity assessments (Kirkland et al., 2007b;Parry et al., 2010b;Galloway et al., 2011;Honda et al., 2018). Consequently, it is challenging to build highly predictive in silico models for this endpoint. In addition, the supporting datasets are mostly quite small and therefore the applicability domain of these models is often limited from a chemical space perspective. In general, the in silico models are better at identifying reactive compounds that damage DNA directly, thereby leading to clastogenicity, than they are at correctly predicting compounds involved in indirect, non-DNA-reactive effects leading to clastogenicity (e.g., off-target interactions disturbing cellular homeostasis or non-covalent intercalation between DNA base pairs). For better prediction of indirect, non-DNA-reactive effects, supplemental structural similarity searching or the use of specific models for the prediction of off-targets known to be involved in clastogenic effects can provide additional important information (Olaharski et al., 2009;Hsu et al., 2018).

In vitro chromosomal aberration:
The majority of available data in the public domain has been generated using Chinese hamster ovary (CHO) or Chinese hamster lung (CHL) cell lines. Initially, in the 1980's, it appeared that the two cell lines differed significantly in their sensitivity with respect to identifying genotoxic compounds but subsequent in-depth comparisons demonstrated that the apparent differences were due simply to when cells were sampled after the start of exposure (e.g., (Sofuni et al., 1990;Galloway et al., 2011). Currently, from a regulatory context data generated using these two cell lines are interchangeable, as well as those using other mammalian cell lines such as human peripheral blood lymphocytes, as long as the same protocol is followed (OECD, 2016b).

In vitro micronucleus:
Few data following a standardized protocol are available in the public domain for this endpoint, due to the relatively recent adoption of an OECD test guideline (number 487) for this assay (OECD, 2010;OECD, 2016i), and as a consequence, statistical modeling would be limited with a narrow applicability domain. The derivation of expert/structural alerts is therefore the most promising in silico approach until more data are published; however, the in vitro micronucleus (MN) assay is becoming more widely used than the in vitro chromosomal aberration (CA) assay and it is assumed that the body of data will grow in the near future. It is also likely that some larger organizations have proprietary models for this endpoint. For the currently available public data, the majority of positive data have not been differentiated between clastogenicity and aneugenicity with regard to mechanism of action. In individual cases, read-across may be possible if suitable chemical analogs, e.g. as defined in the ReadAcross Assessment Framework (ECHA, 2008;ECHA, 2017), are available.

In vivo chromosomal aberration:
The number of available data points is small (mainly bone marrow studies performed in rats), since this assay is often reserved for mechanistic investigations rather than as a core genotoxicity assay, limiting the use of statistical models beyond specific compound classes. The derivation of expert alerts and the application of read-across, when experimental data for analogs are available, may be the most relevant methodologies for this endpoint.

In vivo micronucleus:
Most publicly available data have been generated using bone marrow and/or peripheral blood studies performed in mice or bone marrow studies performed in rats. The difference in experimental procedure is a result of the fact that micronucleated erythrocytes are removed from the blood by the spleen in rats but not mice (Dertinger et al., 2011b;Hayashi, 2016). The ability to use flow cytometry to measure the frequency of micronucleated erythrocytes has greatly increased test chemical throughput while making the data collected more robust (Hayashi, 2016). Furthermore, this approach has been used to evaluate micronuclei in immature erythrocytes in the blood of rats (MacGregor et al., 2006). Data are available in sufficient amounts to build a statistical model although it will have a limited applicability domain. Rat and mouse MN data should be analyzed separately as the different species have differences in responses (positive/negative) to some chemical agents. Different strains, sexes, or administration routes are usually not separated as there is not enough data to support this and the individual datasets would be too small. Read-across or rule-based systems may better address differences in response where such factors are thought to be important and if they can be related to certain chemical classes in a systematic manner. Historically, the increases in MN formation in vivo have not been evaluated to determine if the response is due to clastogenicity or aneugenicity. Any in silico model built using these data will therefore not be specific as to the nature of the type of chromosomal changes, but rather the endpoint of the assay, MN formation per se. In some cases, the mechanism of MN formation can be inferred by combined interpretation with other assay results or specific staining techniques (e.g., kinetochore staining) (Hennig et al., 1988) or more recently based on size distribution using flow cytometric methods (Torous et al., 1998b). A specific example of a combined interpretation would be a negative prediction for in vitro or in vivo CA but a positive in vitro or in vivo MN prediction, leading to an overall prediction that clastogenic effects are unlikely but that aneugenic effects are possible. It is clear that interpretation of experimental data could lead to such a conclusion and although more uncertain, one could in theory interpret the in silico results in a similar manner. It would also be possible to look at the most similar examples in the training set and in a read-across approach determine what the mechanism might be. Since the availability of data is scarce and because the underlying mechanism has rarely been determined in historical data, this is not usually feasible with the current body of data available for modeling.

Aneugenicity:
Historically, data generated using the in vitro and in vivo MN assays were not routinely evaluated in such a way that the mechanism of MN formation could be determined. Consequently, it is often not possible to differentiate an aneugen from a clastogen when evaluating the majority of data published. The number of data points available for modeling where the mechanism has been unambiguously determined is small and would therefore not support statistical modeling. The limited data for this endpoint could be suitable for read-across if analogs with mechanistic information could be found, or for deriving expert alerts. Although not regularly reported, an increase in the number of mononucleated cells with micronuclei can indicate an aneugenic mode of action (Rosefort et al., 2004) and could be used to differentiate between the two modes of action. With new automated methods (Torous et al., 1998a;Dertinger et al., 2011a) for identifying and scoring micronuclei becoming more widely used and special methods for differentiation between chromosome fragments (clastogenicity) and whole chromosomes (aneugenicity) like kinetochore staining or analysis of the micronuclei size, it is anticipated that this situation will improve in the next few years

Other endpoints: Other methods relevant for experimental genotoxicity testing
have not yet been generally accepted for making regulatory decisions and/or the data generated by these test methods are not available in sufficient amounts to build reliable in silico models (see (Mahadevan et al., 2011;Zeiger et al., 2015;Dearfield et al., 2017)).
These test methods include those that evaluate the upregulation of specific DNA damage response elements such as GADD45A (Knight et al., 2009;Hughes et al., 2012), H2AX Mishima, 2017), ATAD5 (Fox et al., 2012), and TP53 (Clewell et al., 2014;Witt et al., 2017) using reporter genes, or multiple DNA damage response elements evaluated using targeted transcriptomic platforms (Aubrecht and Caba, 2005;Sakai et al., 2014;Li et al., 2015;Corvi et al., 2016;; those that evaluate the differential responses in wild-type and isogenic DNA repair deficient DT-40 cells (Yamamoto et al., 2011;Nishihara et al., 2016) or TK6 cells (Saha et al., 2018); those that integrate multiple endpoints into an assessment of genotoxicity (Wilde et al., 2017;Bryce et al., 2018); and those that integrate DNA damage response into an overall assessment of toxicity using high throughput transcriptomic profiling to derive points of departure for risk assessment (Farmahin et al., 2017;Mav et al., 2018). Any potential tools built using any of these test methods will not be further discussed in the GIST protocol as they tend to be less extensively validated, even though they may be useful in some cases. Once such test methods are accepted by the wider community and their use is justified through validation exercises, in silico methods built from their data should be formally incorporated within this framework.

Applying in silico tools:
The practical aspect of applying in silico tools was discussed in detail in the general protocol (Myatt et al., 2018) including how to select models based on their performance, applicability domain, and model complementarity as well as the factors to consider when running chemicals through the models such as ensuring chemical drawing conventions are adhered to that follow any requirement s of the model developer. Hence, this will not be further discussed here. Criteria for the selection of suitable in silico methodologies, as well as reporting strategies were also detailed in the general strategy paper (Myatt et al., 2018).

Expert review of in silico tools
The application of in silico tools for hazard identification may involve an expert review of both the models and the predictions. It is important to determine that the models were built according to accepted criteria  and using relevant training datasets. The endpoint training data used will dictate what can be predicted. For example, if only compounds tested in E. coli uvrA pKM101 and S. typhimurium TA102 are used to build a bacterial mutagenicity model, then the output is only relevant for these strains and may not be extrapolated to predict the outcome of a full OECD guideline compliant bacterial reverse mutation assay which requires at a minimum the inclusion of five bacterial strains. Ideally, to ensure that the data originates from comparable protocols, only experimental data generated using guideline-compliant conditions should be used. However, in practice, pragmatic approaches may need to be considered to ensure that the models cover a wide chemical space without any unnecessary compromise to data quality. As an illustration, experimental data from an assay involving a limited number of bacterial strains are often included in model building if the compound is shown to be mutagenic in at least one strain and as long as the other experimental conditions adhere to established guidelines. The justification for this is that the test guidelines only require one strain to be positive for the test article to be considered mutagenic. For compounds to be considered negative, it is preferable to have negative data from all recommended strains (OECD, 1997a). This is often not available and a certain degree of compromise in both the number of strains and data quality is usually accepted. For the purpose of this protocol, assessing the underlying data used in the model building is an important component of assigning a reliability score to the prediction, which will be discussed further in section 3.1.

Experimental assays and studies
Tables 2 and 3 list in vitro and in vivo assays, respectively, that are frequently used to assess genotoxicity, as well as annotation of the mechanism(s) each assay may identify. For detailed descriptions of the experimental protocols, the OECD test guidelines may be consulted. Test methods that are no longer supported by the OECD are listed in the tables but are not discussed further. Data that were generated in the past using such assays may be considered appropriate for use only if no additional or higher relevance data are available. In general, it is also important when using historical data to evaluate if the relevant regulatory guidelines or data requirements have changed since the data were generated, so that they can be assigned a contemporary quality score. For example, changes to some of the in vitro protocols used in the pharmaceutical sector were made after a 2006 EURL ECVAM workshop (Kirkland et al., 2007a) on assessing false positive rates of mammalian in vitro tests. The new protocols introduced requirements for p53 competent cell lines and lowering of maximum tested concentrations, amongst other things, to reduce the number of unnecessary follow-up in vivo studies (Kirkland and Fowler, 2010;Parry et al., 2010a;Fowler et al., 2012a;Fowler et al., 2012b). A number of these recommendations were adopted under the OECD test guideline revisions performed in 2014-2016.

Expert review of experimental data
An important step in any hazard identification process is a search for existing experimental data from endpoint-relevant in vitro and in vivo assays. In this context, it is pertinent to assess the quality of any identified data as well as its relevance to any of the mechanistic assessments related to the major genotoxicity endpoints. In the general protocol publication (Myatt et al., 2018), we proposed to assess data quality using Klimisch scores (Klimisch et al., 1997) as this is a widely accepted methodology used by ECHA, for example, in the Read-Across Assessment Framework (ECHA, 2017) and can readily be generated using the ToxRTool (European Commission, 2018b). Klimisch scores rank data from 1 to 4, depending on how the experiment was conducted (and reported), taking into consideration for example, whether the experiment was compliant with Good Laboratory Practice (GLP) and whether details of the experiment are available for review. These scores provide a consistent and reproducible way to classify the reliability of the test results.
An expert review of any identified experimental dataset may be performed to assign the appropriate Klimisch score. A detailed description of how this can be performed has been published by ECHA (ECHA, 2011). For assays relating to genotoxicity that are mentioned in this GIST protocol, the experimental conditions can be examined in relation to the relevant OECD test guideline. For test methods that no longer have a current OECD test guideline, such as the in vitro SCE assay in mammalian cells, a historical version of the guideline can be used to determine whether the experimental conditions were relevant at the time of data generation. Although these data are considered of lower relevance as use of the assay was discontinued for being scientifically questioned, there are situations where no other experimental data are available and they may be used in a weight-of-evidence scenario.
In addition, historical data that were generated under conditions described in a previous version of a test guideline can be used if the data were generated and reported in such a way that they can be reevaluated in accordance with the current guideline version and best practice (e.g., as described by the International Workshop on Genotoxicity Testing (Kirkland, 1994;Kirkland, 2000;Kirkland, 2003;Kirkland et al., 2007c;Kirkland et al., 2007d;Kirkland et al., 2011;Martus et al., 2015)). This may not always be possible, and an expert review will determine what reliability can be assigned on a case-by-case basis considering the particular chemical class, as well as the experimental details.
In situations where multiple experimental results are available for a test substance, different scenarios can be envisaged. If several experiments are found for the same assay that were performed in different laboratories or under slightly different (guideline compliant) experimental conditions, the data with the best Klimisch score may be given stronger weight. In cases where there are multiple conflicting results with the same Klimisch score and it cannot be determined through the expert review if one result is more reliable than the other(s), the results can either be considered unusable for hazard identification if they are of low quality, or a conservative approach might be taken where the occurrence of a positive result takes precedence. It would be critical in this situation to scrutinize the experimental conditions in detail, taking into account factors such as compound purity, potential cytotoxicity, solvent effects, etc. Alternatively, a weight-of-evidence approach could be taken with the final call being dependent on the judgement of a subject matter expert.
There are particular elements, or fields, relating to the experiment that are essential to review and document when assessing data. This practice supports an efficient and thorough review of the final assessments and ensures that the review is conducted in a consistent way. Table 4 lists the relevant fields for any in vitro test and shows an example of an Ames assay result for fluorobenzene. Table 5 lists the required fields for an in vivo MN assay with bosutinib as an example.
In addition to understanding the quality of the data, which relates to the technical aspect of the information, the scientific relevance to the toxicological endpoint result needs to be determined. "Relevance" was defined in the general protocol (Myatt et al., 2018) and relates to the predictivity of a specific toxicological effect or mechanism (gene mutation, clastogenicity, aneugenicity) to the toxicological endpoint (genotoxicity). As an example, the bacterial mutagenicity assay is considered highly relevant with respect to genotoxicity, whereas an in vitro CA test may be considered to have lower relevance . The rationale is related to how these tests are managed in a practical setting, where a bacterial mutagenicity assay is often not followed up with in vivo testing and, in many industries, a positive result in this assay is often considered sufficient to stop the development of a candidate active substance. Other industry sectors may adopt a different level of concern and a manufacturing chemical intermediate might, for example, be subject to further testing or used under strictly controlled conditions. In contrast, a positive result in an in vitro CA test can be de-risked or confirmed by performing an in vivo CA study. From a 3Rs point of view, it is desirable to perform in vivo testing as a last resort and to incorporate genotoxicity testing into general toxicology testing that may be required for other purposes. Generally, all in vivo studies are considered to have high relevance with respect to an overall assessment of genotoxicity. Conversely, tests where the OECD test guideline has been deleted (OECD, 2016a) have been assigned a low relevance. The relevance score is to some extent more subjective than the reliability score as different organizations and industry sectors, as well as regulatory agencies working to the data requirements of differing regulations, may apply different criteria in this respect. Even within an organization, different toxicologists may have individual preferences and experiences influencing their choice of assays. Furthermore, the chemical agent (with its physicochemical properties and structural aspects) may dictate which assays are relevant. This protocol reflects a general view of assay relevance, but it is recognized that there could be situations where an expert review may justify a different interpretation. The suitability of different assays in terms of follow-up actions, mechanisms identified by each assay and many other aspects have been reviewed and discussed in a publication by the Health and Environmental Sciences Institute (HESI) In Vitro Genetic Toxicity Testing Review Subgroup (Dearfield et al., 2011). Table 6 provides a non-exhaustive list of available sources of genetic toxicology data. There are also databases that comprise several sources. Individual databases will support different types of queries such as various identifiers (Chemical Abstracts Service registration number (CASRN), synonym or chemical name) and/or chemical structure. If possible, it is desirable to know the batch of compound that was tested, as well as the associated characterization data, as it is relevant to know the purity of the tested chemical. The presence of potential impurities is important; even small quantities of a mutagenic impurity may result in a false positive result. This type of information is not always available in public databases but can often be found in corporate databases. Structure searches should be performed with care, considering factors such as stereochemistry, tautomerism, salt form, and counter ions, for example. It might be necessary to search for both the parent compound and alternative forms when searching for a particular chemical if it is not known how the structures have been reported. It may be helpful to perform a substructure search, which looks for compounds with open substitution patterns, or a "family search" that will retrieve different salt forms and also analogs with different chirality. Some databases additionally provide regulatory authority classification with respect to mutagenicity and carcinogenicity; the International Agency for Research on Cancer (IARC), will for example, provide carcinogenicity classification and ECHA provides carcinogenicity, mutagenicity or reproductive toxicity (CMR) classifications.

Other data references
In addition to the above listed databases, other sources of data, such as model training sets or other compilations of experimental data, can be searched for supporting information. In some cases, substances may have already undergone a risk assessment by a regulatory committee, and this output can be useful either directly or in a modified format in the hazard identification process. For instance, for the evaluation of bacterial mutagenicity the ICH M7(R1) addendum (ICH, 2017) provides detailed information on risk assessment of a number of chemicals and is applicable in the pharmaceutical sector. The addendum discusses acceptable intakes of certain chemical residues or impurities that are mutagens and/or carcinogens and that are common in pharmaceutical manufacturing. Another source is the "European Commission, Joint Research Centre (JRC) Genotoxicity & Carcinogenicity Consolidated Database of Ames Positive Chemicals" (European Commission, 2018a) (also listed in Table 6), which contains >700 unique chemical compounds that are bacterial mutagens and have a variety of additionally reported in vitro and in vivo genotoxicity and carcinogenicity data. This database contains an "overall call" based on a set of defined criteria for the reliability and quality of the data when results from more than one source are available. For many commercial organizations, it may be difficult to find structural analogs for proprietary compounds in public databases due to differences in chemical space. In these cases it is relevant to search proprietary databases, as these often contain high quality sources of information. From the documentation and reporting point of view, as well as for any regulatory submission, this may be an issue as a thorough expert review and final assessment needs to be documented and disclosed to reviewers to enable their independent evaluation. Any analogs or other relevant structures should preferably be included in the final report for full transparency of the assessment.

Reliability score
The general protocol (Myatt et al., 2018) provides detailed information on how to combine in silico predictions with experimental data, where these are available. The process will not be outlined in this publication, but involves expert review of the model(s), the prediction(s) as well as a review on the quality of the experimental data. In general, it is preferable that experimental data be of Klimisch score 1 or 2, depending on the situation, to be considered of high enough quality to support decision making. It is recognized that this is not always possible. However, depending on the use case, there could be situations where expert review and data quality assessment is not feasible and a lower level of confidence is acceptable, such as screening.
To enable a standardized method of performing an assessment of experimental results and in silico results together, an extension to the Klimisch score (Table 7) has been introduced to allow scoring of in silico components alongside experimental results using a Reliability Score (RS) (Myatt et al., 2018). Experimental data of Klimisch score 1 and 2 are essentially unchanged in their original Klimisch description but are referred to as RS1 and RS2. Furthermore, the lower quality Klimisch categories 3 and 4 have been placed in the lowest RS category of 5. This accommodates the use of in silico results of high quality in categories RS3 and RS4, illustrating their higher acceptability in certain regulatory contexts, compared to low quality experimental data or single, lower quality in silico result (RS5). For genetic toxicology, this is of particular importance for both REACH and ICH M7 applications, for example.

Toxicological effect or mechanism assessment
Toxicological effects are defined as observations derived from the experimental tests considered relevant for genetic toxicology (i.e., the in vitro and in vivo tests listed in Tables  2 and 3). An assessment will take into account all of the experimental and in silico information available for the query compound for each effect separately, in a weight-ofevidence scenario. A simple hypothetical example is shown in Figure 3.
In this case, experimental data were found for the compound and it was reported to be inactive in a limited (too few strains) bacterial reverse mutation test. After expert review, it was concluded that the assay was run under appropriate conditions, but only in strains TA98 and TA100. The result is hence not sufficient to support a full assessment of bacterial mutagenicity and a Klimisch score of 3 is assigned to the data, which results in a reliability score of RS5. Two complementary in silico models for bacterial mutagenicity (incorporating E. coli/S. typhimurium TA102 and additional Salmonella data into both models) were applied and the compound was predicted to be negative in both models. The individual models have initial reliability scores of RS5 but since they concur, the combined score would be RS4. Further expert review showed that the predictions were of good quality and there were, for example, no reactive features identified. The in silico predictions are therefore assigned a reliability score of RS3. The weight-of-evidence for this compound supports the assessment that the compound is not a bacterial mutagen and the overall reliability score for bacterial gene mutation is set to RS3. For comparison, if the experimental results had been reported as positive in one of the two strains, the Klimisch score would still have been 3 and the initial reliability score would have been RS5. However, during expert review of the experimental data, it would have been appropriate to consider the result sufficient for an assessment of bacterial mutagenicity and to change the reliability score to RS3 as one positive strain is considered enough to make a positive call for the compound.

Toxicological endpoint assessments
Combining the genotoxic effect assessments that relate to a specific genotoxic endpoint is required to generate an overall endpoint call. Figure 4 shows a continuation of the hypothetical example from Figure 3 and illustrates the inclusion of a mammalian gene mutation result.
To perform this summary assessment, the concept of "Confidence" was introduced. Where "Reliability" relates to the quality of the experimental data or the in silico prediction and "Relevance" relates the assay to the mechanism or toxicological effect, "Confidence" combines the two parameters in addition to assessing the completeness (or coverage) of the information. It provides a method for merging information on the technical reliability of a result with the relevance of the assay from which it was derived, for predicting the toxicological endpoint being assessed. The determined confidence for each endpoint (gene mutation, clastogenicity, aneugenicity) will eventually propagate to the confidence for the overall call as to genotoxicity. As was discussed earlier with respect to relevance, the assigned confidence is somewhat subjective. To provide a starting point for how to combine terms, a set of rules has been devised for combining results, based on a conservative approach for combining relevance and reliability for the most commonly occurring components of genotoxicity hazard identification. This rule set is available in the supplementary material of this publication and can be adapted to accommodate organizational preferences or other needs. There may be times when it is not desirable to perform a full evaluation of all the genotoxicity endpoints. For example, only bacterial mutagenicity is required for an ICH M7 assessment. Subsets of the components can be used as appropriate in a situation dependent manner. A scheme including the genotoxic effects and endpoints that are amenable to the generation of in silico tools, using data currently available in the public domain, is shown in Figure 5. It is possible that private organizations have additional types of data that could also be used to generate in silico tools.
An expert review of all the endpoint evaluations (described in section 4.4) may be performed to balance the relevance of each assay call to the overall genetic toxicology assessment. Depending on the use case, the confidence required may vary. For situations where false negatives may be acceptable and not be associated with health consequences, such as prioritization for more in-depth experimental testing, a lower level of confidence may be acceptable. However, in a human health hazard identification and risk assessment situation, a more conservative view is taken and higher confidence is required. In the general protocol (Myatt et al., 2018), we outlined the general principles around the influence that a particular level of confidence has.

Expert review of combined endpoint assessments
The expert review of genotoxic effects may include review of the in silico predictions and experimental data, as outlined earlier. The assessments might involve an expert review to weigh the individual assay results and in silico predictions, as well as any other information, such as experimental data for structural analogs or details that would influence the interpretation or translatability of a result. For example, a compound with antibacterial properties may be difficult to assay in a bacterial reverse mutation assay, due to the expected high cytotoxicity in a bacterial reverse mutation assay, and therefore, mammalian cell systems are usually recommended in these cases. Along similar lines, if such a compound is predicted with in silico tools to be negative in a bacterial mutation test, even with high reliability, but predicted by an in silico model to be a mammalian mutagen, the expert review may consider that the bacterial reverse mutation result may be misleading in the context of a combined "gene mutation" review. Even though the bacterial reverse mutation result would normally be considered to be of higher relevance due to the availability of more chemically diverse and abundant data for this endpoint, the mechanistic expert review could in this case rank the mammalian in silico prediction higher. It may at this point also be important to include information from primary DNA damage experiments (or models) to determine the mechanism of action. Table 8 includes some points to consider during an endpoint assessment. The expert review will also determine the level of confidence that can be placed in the endpoint summary.

Worked examples
A number of case studies that have been contributed by co-authors are discussed in the following section. For some of these, it is not possible to disclose the chemical structures as they are proprietary compounds. However, the included examples have been selected to show various aspects of the GIST protocol; emphasizing how the various model outputs and experimental data components can be fitted into this framework without judging the validity of the generated components. Figure 6 shows a case study of an acid chloride impurity which is being assessed for bacterial gene mutation potential for ICH M7 risk assessment. No experimental data could be found for the compound and two in silico tools, one statistical-and one rule-based, were applied. The prediction from the statistical model indicates that the compound may be a bacterial mutagen due to the presence of the acid halide functionality. The rule-based model gives an "Indeterminate" prediction and also highlights the acid halide functionality. Acid halides are a structural alert class for bacterial mutagenicity that were discussed recently  and it was shown that with the exception of dimethylcarbamic chloride, the compounds tested and available for model building were active in the bacterial reverse mutation assay due to a reaction between the DMSO solvent and the test agent. When retested in other solvents, the majority of compounds show no mutagenic activity. Despite the positive and indeterminate in silico predictions, each with a reliability of RS5, an expert review revealed that the underlying data for the statistical model supporting the prediction are with high certainty false positives and the prediction was refuted. Expert review of the supporting text for the alert supports this outcome. The overall assessment of bacterial mutagenicity concludes that the compound is predicted to be inactive (negative) and the reliability score is set to RS3. The approach to this assessment is aligned with current ICH M7 guidance.

Drug impurity -API X (bacterial gene mutation):
There may be situations when an expert review can give an indication that the experimental results might not be correct. This is illustrated in the following example using Active Pharmaceutical Ingredient (API) X. API X was initially tested in the bacterial reverse mutation assay and found to have mutagenic activity. In contrast, as shown in Figure 7, the in silico predictions from both the statistical and the expert alert models predict API X to be inactive in the bacterial reverse mutation assay. An expert review of the information indicates that the models as well as the predictions appear robust and the reliability score which initially is set to RS4 due to two concurring models, is raised to RS3 after the expert review. In cases where experimental data are positive and in silico predictions are negative, the conservative approach would be to accept the positive experimental data, in which case the assessment would be positive with a reliability score of RS1, RS2, or RS5, depending on the quality of the experimental data. However, if the scientific review suggests that there is a valid reason to question the experimental result, the initial assessment for the compound could be Indeterminate, given the conflicting results from the experimental and in silico outputs, although this outcome would not be acceptable as a final conclusion from a drug regulatory standpoint. A reliability score is not assigned if the assessment is considered indeterminate. Given that the structure of API X is not predicted to be DNA reactive, it could be relevant to consider if there are other reasons for the observation of mutagenic activity related to the experimental procedures and/or the test article. One of the more frequently occurring reasons for an unexpected positive response in bacterial mutagenicity assays is the presence of a potent mutagenic impurity in the test article. In this particular case, an aldehyde was identified as a degradation product in API X and shown to be mutagenic. Follow-up testing of purified API X found it to be non-mutagenic and the bacterial gene mutation assessment would at this point be updated from the "Indeterminate" to "Negative" with a reliability score of RS2. Since the formation of the degradant could be avoided by modification of the synthetic route, it had no direct bearing on the classification of API X. Figure 8 shows the assessment components for 3-methyl-5-isothiazolamine related to bacterial gene mutation. Experimental data were available in the public domain for this compound where it was reported to have been tested in TA98, TA100, TA1535, TA1537 and TA1538 with and without metabolic activation using induced rat liver S9 and hamster liver S9 (Cameron et al., 1985). Further examination of the data revealed that the experiments were conducted under acceptable conditions and that the tested concentration range went to higher levels than normally required by OECD TG 471, the test guideline for the bacterial reverse mutation assay, but the compound was not tested in an E. coli or S. typhimurium TA102 strain, which is required to fulfill the current OECD test guideline. The standard maximum concentration is usually set to 5 mg/plate and this study reported maximum concentrations of 7.43 mg/ plate. The data were initially assigned as positive with a Klimisch score of 3, indicating that the experiment was partially compliant with guidelines. However, when assessing the individual bacterial strain concentration responses, the biological relevance of the data was further questioned as the compound was only active in TA1538 (a strain not required by the OECD test guideline) at concentrations higher than the guideline recommended 5 mg/plate and only with hamster S9 metabolic activation. With rat S9 and at concentrations up to 5mg/ plate, the compound was found to be inactive. At this point, an expert review of the data indicated that as the compound was negative at concentrations up to regulatory requirement of 5 mg/plate, the compound could potentially be viewed as negative with a reliability score of RS5 as this cannot be increased, considering that the compound was not tested in E. coli or S. typhimurium TA102 strains. Additionally, there is discrepancy seen with the two metabolic activation systems. In silico methods were applied to further refine the hazard identification. When reviewing these results, the statistical model output from a Salmonella model classified the compound as out of domain and the E. coli model predicted it to be negative. Review of the E. coli model results indicated that the prediction was not supported by many analogs or structural descriptors and is mainly driven by physicochemical properties. The expert alert model predicts 3methyl-5-isothiazolamine to be positive for bacterial gene mutation. In this case, however, there are compounds in the reference set that contain the thiazolamine functionality that the alert is based on, but they are not necessarily isothiazolamines. Additionally, further review shows that the majority of the reference structures also have other alerts such as aromatic nitro groups. At this point, there is contradictory information to consider: the low reliability (RS5) experimental result indicating the compound is negative up to 5 mg/plate but active at higher concentrations, and the inconclusive in silico results. By formally following the proposed scheme, it would be acceptable to view the compound as negative, but with a reliability score of RS5, as the expert review did not reveal evidence supporting a higher score. In a conservative scenario, if this compound, for example, is an impurity that has consequences for human safety, retesting the compound in a guideline acceptable study would be preferred. Indeed, when 3methyl-5-isothiazolamine was retested according to the OECD guideline in a full 5-strain bacterial reverse mutation assay, with and without induced rat liver metabolic activation, the compound was found to be non-mutagenic (Ahlberg et al., 2016). At this point, the assessment could be updated with a "Negative" result with a reliability score of RS1 assigned. It should be noted that this assessment refers specifically to bacterial gene mutation and that any other available experimental data, such as MLA data, would be used to support the corresponding endpoint they relate to, which may or may not differ from the bacterial mutagenicity assessment. For a more comprehensive analysis of potential genotoxicity, such data may need to be considered and follow-up testing may need to be performed. Figure 9 shows the assessment of a compound containing an aromatic amide functionality. Bacterial gene mutation and mammalian gene mutation effects/mechanisms were identified as relevant to the assessment of the gene mutation endpoint. Two independent and concurring in silico models were run to predict bacterial gene mutation, one expert rule-based and the second statistical-based, and both model predictions were negative. An expert review was performed on the in silico model results and the review concluded that the predictions were well supported and there was sufficient evidence to increase the reliability to RS3 from the individual models' scores of RS5. A single statistical model predicted the compound as negative for mammalian gene mutations (built using MLA training set data). An expert review was performed but the evidence concerning the prediction was not considered sufficient to raise the reliability score higher than RS5. The results from the bacterial and mammalian gene mutation endpoints were used as part of the assessment of the overall gene mutation potential. The confidence was assigned as "Medium" as outlined in the suggested set of rules in the supplementary information. It should be noted that this prediction itself refers to the in vitro gene mutation response. In a scenario where this result would feed into a framework supporting overall genotoxic potential, it would be pertinent to consider that certain aromatic amides and sulfonamides do not show activity in the bacterial assay due to the amide bond not being metabolized by S9, but may be active in an in vivo experiment.

Plant protection product active ingredient metabolite assessment (genetic toxicology):
A herbicide metabolite was assessed using in silico methods for genotoxicity.
Experimental data generated on the active ingredient (AI) was available and the data confirmed that the AI has no genotoxic potential based on negative bacterial gene mutation, in vitro mammalian gene mutation and in vitro CA assay results, as well as a negative in vivo CA study. The metabolite was noted to have high structural similarity to the AI. Figure  10 shows the initial in silico genotoxicity assessment of the metabolite. The metabolite was predicted by two methodologies to be inactive in the bacterial reverse mutation assay. It was out of domain for the mammalian gene mutation model as well as the in vivo CA model (however, the related endpoint "in vivo MN prediction" was in domain). Two expert alert systems for in vitro CA induction were applied, one indicating that the compound has clastogenic potential due to the presence of a carboxylic acid related alert, and the other that it does not. Expert review of the in silico results was performed by looking at specific details of the alert and the surrounding SAR. Sufficient experimental data for analogs matching the alert convinced the assessor that the alert could be dismissed, and the in vitro CA endpoint was set to negative with a reliability score of RS3, after the expert review. Expert review was also performed on the gene mutation endpoints as well as the predicted in vivo MN results to confirm that these were of sufficient quality. In the case of gene mutation, the reliability score could be increased to RS3, but this was not the case for the in vivo MN assessment and it remained at RS5.
Following the suggested conservative scheme included in the supplementary material, for combining toxicological effect outputs, the gene mutation endpoint was considered as negative with a low confidence due to the lack of information on the mammalian gene mutation endpoint. Similarly, the in vitro and in vivo clastogenicity/aneugenicity endpoints were considered to have low confidence related to the negative assessments as there was limited information available. The combination of these assessments resulted in the metabolite being considered of low genotoxic potential but with a low confidence. It should be noted that this assessment did not take aneugenicity into account at all with the exception of a predicted negative in vivo micronucleus result. This is an additional reason to consider this assessment of being of low confidence.
Following the in silico assessment exercise, the metabolite was tested in experimental assays for confirmation. The compound was tested in an OECD and GLP compliant bacterial reverse mutation assay and found to be inactive. It was also tested in an OECD and GLP compliant in vitro micronucleus assay and again, no activity was detected. Figure 11 shows how these experimental data would influence the assessment if the protocol framework was applied. The increased reliability scores from the bacterial gene mutation and the in vitro micronucleus tests would result in high confidence in the individual endpoints as well as in the overall genotoxicity assessment, which now would result in a negative outcome with medium confidence. It may appear surprising that the confidence is only set at medium, despite highly reliable experimental results demonstrating no genotoxic activity. However, to distinguish from a situation where in vivo studies were also performed, the confidence cannot, in a general sense, be higher as there needs to be room to increase the weight-ofevidence by the inclusion of in vivo results or expert review. The addition of an in vivo negative outcome would have brought the confidence up to "high". However, in this particular case, an expert opinion was included in the final outcome, which raised the confidence to high. Sufficient experimental data were available for the parent AI in a full regulatory battery of in vitro and in vivo studies, showing that the AI had no genotoxic potential. The structural similarity between the metabolite and the AI was high and the available in vitro data for the metabolite showed similar responses, therefore no further concern was raised about the in vivo activity of the metabolite. Furthermore, it is recognized that there are different regulatory guidelines with respect to in vivo studies and that in some industries, an in vivo test would not be required for a high confidence assessment. Figure 12 illustrates the in silico genotoxicity assessment of a plant protection product AI metabolite with the potential to leach into the groundwater. The AI is categorized as an IARC Class 2 carcinogen and hence may bear risk to humans, and control strategies are required. It has, however, been shown experimentally to be non-genotoxic and it is hypothesized that the carcinogenicity is mediated through an endocrine disruption mechanism. The metabolite is a polar molecule containing functional groups in a similar environment to the parent molecule. The bacterial gene mutation assessment was performed by read-across and the application of statistical models and expert alerts. The read-across exercise concluded that the metabolite is likely to be negative but the analysis was not considered robust due to the lipophilicity of the metabolite being outside the range of the analogs. Therefore, the result was set at RS5 even though read-across could technically be considered an expert reviewed method and could therefore have been set to RS3 directly with a more robust analysis. Two independent statistical models for bacterial mutagenicity were applied, both indicating that the metabolite was negative, and the reliability score was set to RS4 as there were two concurring and independent models. The rule-based method highlighted an alert (positive, RS5), but after an expert review, the alert was dismissed, as the chemical environment of the alerting moiety was dissimilar between the training set examples and the metabolite and the alert was therefore considered not relevant (the result from this model is considered negative with a reliability score of RS3). No in silico assessment was made for mammalian gene mutation using computational models but comparison (read-across) with the predicted genotoxicity profile of the parent molecule indicated that there should be no concern for mammalian mutagenicity. The call for the in vitro gene mutation endpoint was set to negative with medium confidence.

Plant protection product groundwater metabolite assessment (genetic toxicology):
In vitro CA was also investigated using read-across. The weight of the evidence did not give a clear indication of potential for CA induction and was considered Indeterminate. Rulebased methods predicted the metabolite to be positive in the in vivo MN test and in the in vitro CA assay. Both of these predictions were given reliability scores of RS5. Expert review of the examples related to the in vitro CA prediction questioned the relevance as they did not bear strong structural similarity to the metabolite. The alert triggered in the in vitro CA model was also generic in nature and hence not specific to the structural environment in the metabolite (or the AI, for that matter). Furthermore, the parent AI triggered the same in silico response but had been confirmed to show no clastogenic effects in vivo. For the predicted positive outcome in the in vivo MN test, expert review suggested that the predicted activity would be due to carbamate and simple substituted acrylamide compounds, formed as downstream metabolites of the metabolite, rather than to the metabolite under review. For the analogs investigated with experimental data, only the carbamates appeared to truly flag as being related to any activity. Due to the physicochemical properties of the metabolite, it was considered highly unlikely that these would form in vivo and hence, the in vivo alert was overruled. The summary assessment for the metabolite concluded that there was medium confidence that there was no gene mutation potential and low confidence for the lack of clastogenic potential. Aneugenic effects have not been covered. The overall genetic toxicology assessment was therefore set to negative with low confidence. After review of the submission, the regulatory authority also concluded that some experimental testing be conducted to specifically ascertain the predicted lack of genotoxic potential.

Reporting
"Good in silico practice" requires a reproducible, transparent, and standardized procedure and it is important to document the entire process of performing the genetic toxicology assessment. This is comparable to Good Laboratory Practice (GLP) documentation of in vitro or in vivo studies and will enable the results to be reviewed rapidly and thoroughly by, for example, regulatory agencies. The general protocol (Myatt et al., 2018) lists relevant types of information that should be included in the report to ensure that the information is complete. Specifically, chemical structures (including analogs in case of read-across) and the models used need to be well documented.

Discussion
The GIST protocol should be applied in a context-dependent manner and in accordance with relevant guidelines. For example, if the application is for an ICH M7 assessment, then in addition to the recommendations provided in the guidance document, there are publications that provide more detailed procedures as well as case examples to illustrate best practices (Barber et al., 2015;Amberg et al., 2016). Similarly, there are, for example, guidelines for chemical registration through the REACH regulation (REACH, 2006;ECHA, 2008;ECHA, 2017) and Canada's Chemicals Management Program (Canada, 2016), the EFSA definition of residue guidance (EFSA, 2016) and the Toxic Substances Control Act (TSCA) (TSCA, 2016).
The protocol presented in this publication represents the current state-of-the-art in in silico genetic toxicology. As new methods, both experimental and computational, are developed and as new data become available, the recommendations presented herein will need to be revised and updated. Additionally, it is important that the protocol reflects current regulatory standards, data requirements, and changes as these are revised. For example, as more in vitro micronucleus data are generated with differentiation of clastogenicity from aneugenicity mechanisms, statistical modeling may become an option for the separate mechanistic endpoints. Since aneugenicity is generally considered to be a thresholded endpoint, this would involve an important change in the current GIST protocol. Similarly, new assays will be accepted in a regulatory context. Also, there are considerable efforts underway to develop and evaluate the impact that methods such as toxicogenomics, flow cytometric biomarker assays, and other mechanistic platforms (see Section 2.2.4) can have on genotoxicity testing. As with any other protocol, it is therefore important to regularly revise the GIST protocol to include new developments and remove outdated sections as appropriate.

Conclusion
Applying a standardized format for performing and reporting in silico assessments for hazard identification will enable a transparent and consistent review of the results. This is beneficial both for organizations and individuals performing such analyses as well as review boards and regulatory agencies that consider such analyses. Along the same principles of standard practice for in vitro or in vivo data generation, the aim is to foster the use of good in silico practices to promote the use of these methodologies to their full potential. Outlines key effects/mechanisms and their relationship to genetic toxicity Defines principles and procedures for combining the information Outlines a methodology to assess the confidence any assessment The initial in silico genetic toxicology assessment for the plant protection product active ingredient metabolite. Note the change in assessment outcome for in vitro CA before and after expert review. *NA refers to "Not available" since these results were not possible to generate. Influence of including the experimental results in genetic toxicology assessment for the plant protection AI metabolite. Differences compared to Figure 10 are indicated in red text. *NA refers to "Not available" since these results were not possible to generate. Hasselgren et al. Page 43 Regul Toxicol Pharmacol. Author manuscript; available in PMC 2020 October 01.

Figure 12.
In silico assessment of a plant protection product metabolite.

RTECS
Registry of Toxic Effects of Chemicals (RTECS). Commercial database available through third parties (e.g., Leadscope) (Sweet et al., 1999;RTECS, 2018) TOXNET/ChemIDPlus Open access on-line toxicity search system from the US National Library of Medicine with access to archived versions of CCRIS and GENE-TOX (Wexler, 2001;TOXNET, 2018) OECD QSAR Toolbox Open access to database of genotoxicity as well as other toxicology data. (OECD, 2019) VITIC Commercial database from Lhasa Limited, including data from published and unpublished sources (VITIC, 2018) * Modified from Amberg et al.  Regul Toxicol Pharmacol. Author manuscript; available in PMC 2020 October 01. Reliability of toxicity assessments based on computational models and experimental data. (Myatt et al., 2018) Reliability Score Klimisch Score