A Proof-of-Concept Rat Toxicity Study Highlights the Potential Utility and Challenges of Virtual Control Groups *

The virtual control group (VCG) concept provides a potential opportunity to reduce animal use in drug development by replacing concurrent control groups (CCGs) in nonclinical toxicity studies. This work investigated the feasibility and reliability of using VCGs in place of CCGs. A historical control database (HCD), constructed from Genentech Inc. rat toxicity study data, was reviewed to understand trends and sources of variability in control animals over time, and to identify data curation requirements for assembling VCGs, e.g. alignment of units of measurement. Several endpoints were investigated and stratified against different study design parameters. Sex, route of administration, fasting status, and body weight at study initiation were among the parameters that were indicated as key matching criteria. With a high-level understanding of potential sources of variability, a retrospective proof-of-concept (POC) study was designed, evaluating a historical rat pilot toxicity study for test article-related changes. A masked interpretation of the study was conducted using its CCG, and two unique VCGs that were constructed from individual animal data pulled from our HCD. While the results of the microscopic pathology assessment and most endpoints were similar across the different control groups, the POC revealed the risk of using VCGs to interpret subtle test article-related changes in clinical pathology parameters. Within the context of our POC, it appears the use of a VCG is not completely equivalent to the CCG especially with clinical pathology parameters. Additional work is needed to understand the potential utility, and thus, viability of VCGs in other contexts.


Introduction
Standard for Exchange of Nonclinical Data (SEND) has enabled greater accessibility and interoperability of in vivo toxicology data, streamlining terminology in order to automate large-scale and robust data analyses.As a result, the field of toxicology has a Tab.1: Summary metrics of historical control data Historical control data was aggregated and stratified on select baseline characteristics and study design parameters.As the number of matching criteria increased, the availability of individual historical control animals decreased, a potential limit to feasibly generating enough VCGs for practical implementation in routine pilot toxicity studies.A detailed list of key parameters or covariates that were considered for matching are provided in Tab. 2 Tab.2: Potential sources of variability Virtual control groups (VCGs) were created from historical control animals that matched covariates of the primary study design.These covariates were investigated as potential sources of variability in toxicity studies, where the level of variability and their impact to study results may not always be known.While the proof-of-concept studies strived toward using historical controls that matched on all covariates, several covariates were incompletely matched for the entire VCG, and therefore incompletely matched to the test groups of the primary study.(*) denotes the manual verification of matching status following formation of the VCGs.a 7-day study conducted from 2016-2019 (here an exact three-year window was used based on month and day) could be matches for a male rat of the same strain weighing 250 grams and being a test group animal in a 7-day PO toxicity study conducted in 2019.Based on these criteria, 12 male Sprague-Dawley rats were identified as matched animals given our baseline matching factors (Tab.1).Based on the covariates defined by the primary study design, the matched pool size (n=12) restricted us to two VCGs, where each VCG contained unique individual historical control animals (Fig. 1).Since the CCG would be fully replaced by the VCG, the POC study randomization of animals was based on mean bodyweight at initiation and matched to the study's test group animals as reference, excluding the CCG.Following the assembly of the two VCGs, covariates that were unavailable in our database, such as fasting status during blood sampling for clinical pathology assessments, were manually reviewed and assessed for alignment across the historical controls.Three covariates were identified as incompletely matched across the control animals and the POC test groups (Tab.2).

Matched covariates Incompletely matched covariates
Data from the primary study were extracted from our internal nonclinical data repository and data from matched individual animals were extracted from the HCD to produce three study datasets of identical format, which were finally imported into our SEND data visualization platform.In this masked experiment, the CCG was designated as VCG2; the matched animals were randomly assigned to VCG1 and VCG3 such that mean bodyweights of the VCGs were considered similar to the mean bodyweights of the test groups of the primary study (Fig. 2a).As part of the curation of individual animal data into VCG1, VCG2 and VCG3 for study number POC1, POC2 and POC3, respectively, the original animal IDs were replaced with new animal IDs to aid with a masked interpretation.The following SEND data domains were used for constructing the POC studies: Study Design with Demographics and Disposition, Body Weights, Clinical Pathology, Organ Weights (OM), Macroscopic Observations (MA), and Microscopic Observations (MI).Clinical observations were also excluded from interpretation, since 1) a review of this data indicated that findings were normal for all control groups, therefore adding little value to the interpretation (data not shown) and 2) clinical observations were not recorded at the same frequency for all historical controls, and along with any endpoints, tests or samples that were not included in the primary study, these data were excluded from aggregation in order to keep the data presentation consistent across the POC for a masked interpretation.Furthermore, data tables for the POC studies were mocked up to be indistinguishable from data tables typically used for internal toxicity studies.

Fig. 1: Schematic depiction of VCG formation
To understand the feasibility and reliability of leveraging VCGs, a POC study was conducted utilizing VCGs in a historical pilot toxicity study, replacing the CCG with VCGs, and interpreting the changes observed in test groups for test article-related effects.a) Historical control data was aggregated and stratified based on a single test site, species and strain.And applying additional parameters determined by the historical pilot toxicity study, e.g.dose route of administration and frequency, a matched pool of animals were defined.b) Animals were then randomized from this matched pool of animals into VCGs, selecting the appropriate number of subjects per species.c) VCGs and CCG were then blinded to scientists, and acted as the control group against which test article-related effects might be identified in each of the three test groups.Created with BioRender.comDigital histopathology whole slide images (WSI) were available for each VCG animal, and for all animals in the primary study.Images for the VCGs were organized appropriately into their respective studies; however, images were not reinterpreted and review of microscopic (MI) data was based on the existing tabulated MI data.Single WSI of liver and spleen tissue from three animals in each control group were reviewed for staining quality and image quality; representative images are presented in Fig. 3.

2.3
Study data interpretation Data assembly and data interpretation were performed by different scientists to enable a masked interpretation of the POC.Individual and summary data tables for each POC study were generated in our SEND visualization platform in the format consistent with routine study tables to maintain a masked comparison.Identifying characteristics, such as study metadata, source information, and original data collection dates, were all removed prior to interpretation.Summarized data was interpreted by GNE scientists.Clinical pathology was reviewed by a Board Certified veterinary clinical pathologist (AA), and tabulated MI findings were reviewed by two Board Certified veterinary anatomic pathologists (RS, PM).The totality of data was then interpreted for each study based on the included control group.Pathologists performing the assessment were masked to the construction of the VCGs and to the identity of the CCG to unbias the interpretation.None of the pathologists had been previously involved in the selected study or the originating project, including not being involved in the primary or peer review of the study.They had no prior knowledge about compound GXX or its intended pharmacological mode of action.Pathologists could access digital histopathology WSI through an internal image-data repository as needed, if tabulated MI findings could not be directly compared.Anatomic pathology assessments were independently conducted by the two anatomic pathologists, who subsequently conferred on their findings and arrived at a consensus interpretation.

3.1
Review of historical control data Historical control data were plotted across time to visually detect potential drifts, or abrupt changes that could aid in delineating epochs in the data where we can expect animals to be more similar, and therefore more comparable, to one another.This high-level review allowed us to identify potential sources of variation in a limited capacity; however, not all sources of variation can be identified or robustly characterized.Several baseline conditions were identified as potentially meaningful stratification factors for identifying biologically relevant groupings, or matching, of animals within the historical data.Qualitative data, such as MI findings, were less amenable to statistical analysis due to the subjective nature of reporting.Tab. 2 lists the parameters identified as potential sources of variability that could be controlled in the study design and that seemed likely necessary to match the test subjects in order to produce a proper (biologically relevant) control group.Notably, following selection of a single test-site (GNE), several additional covariates matched across individual animals e.g.non-GLP standard, sample processing protocols, and analytical laboratory, because of test-site consistency.These covariates have been defined as advisable to match on based on recent publications (Grevot et al., 2023;Gurjanov et al., 2023).Representative histology images of the CCG (a) from the primary historical toxicity study and the VCGs (b, c) from three historical toxicity studies conducted from 2017-2019, identified as Study 1, 2 and 3, are shown here.Although H&E slides can vary in their appearance based on laboratory procedures, e.g.tissue preparation conditions, the stain quality of the images is comparable, with only slight differences in the intensity and value of the staining.Slight differences in tissue appearance can also be attributable to the range of normal physiologic state.Magnification: 10X; Scale bar: 40 microns

POC study findings and interpretation 3.2.1
Bodyweight changes observed at all dose levels were consistent in all studies independent of control group Bodyweight measurements were recorded from Day 1 to Day 8. Clear GXX-related bodyweight decreases were observed in high dose animals from Day 2 until the day of necropsy, and bodyweight changes observed in the test groups were comparable across the three studies.At the high dose, percent bodyweight changes (compared to baseline measurements on Day 1) corresponded well with clear intolerability to GXX, with group mean decreases of 2% and 1% on Study Day 4 and 6, respectively, when several high dose animals were found dead or euthanized under moribund conditions.Group mean percent bodyweight changes in the CCG, both VCGs, low and mid dose groups showed similar trends, increasing over the duration of the study, with an observed gain of 14 -17% on Day 7. Fasting prior to blood collection on Day 8 resulted in expected bodyweight decreases of comparable magnitude across all groups, minus the high dose group which had been removed earlier in the study (Fig. 2b).

3.2.2
Clinical pathology changes observed at the lower dose groups were associated with misinterpretations As shown in Tab. 3 and Tab.S2 2 , mild changes in hematology endpoints were observed at the low and mid dose groups across the three studies, and all changes observed were associated with misinterpretations when interpreted within the context of the VCGs.In POC1 and POC3, minimal to mild increases in absolute eosinophil counts (~ 2x), absolute reticulocyte counts (~ 1.3x), red cell mass parameters (~5% increase), along with decreases in MCHC were observed.However, in POC2, when using the CCG, no WBC-related changes were observed, and the red cell mass changes trended towards decreases versus the increases observed in POC1 and POC3.While the eosinophil increases in POC1 and POC3 might not affect the overall interpretation of the data due to lack of other supportive findings, the differences in the direction of the RBC mass changes between the VCGs and the CCG is a misinterpretation that will affect the overall interpretation of the data.The trend towards RBC mass increases observed when using VCGs would be suggestive of hypovolemia/dehydration, while the decreases observed when using the CCG would be suggestive of trends towards anemia due to GXX-related decreased red cell production or red cell loss.

Tab. 3: Comparison of hematology endpoints
Clinical pathology trends for routine hematology endpoints were observed in GXX-treated groups (10, 30 and 100 mg/kg) and compared when using the CCG (interpreted as VCG2), VCG1 and VCG3.Arrows denote the direction of the trend, while dashes denote no significant differences when compared to the control group.Red color denotes a difference in the direction of change, while blue color denotes a difference in the magnitude of change from the observations made utilizing the CCG.

Tab. 4: Comparison of clinical chemistry endpoints
Clinical pathology trends for routine clinical chemistry endpoints were observed in GXX-treated groups (10, 30 and 100 mg/kg) and compared when using the CCG (interpreted as VCG2), VCG1 and VCG3.Arrows denote the direction of the trend, while dashes denote no significant differences when compared to the control group.Red color denotes a difference in the direction of change while blue denotes a difference in the magnitude of change from the observations made utilizing the CCG.
potassium (K+) and chloride (Cl) concentrations were suggestive of dehydration, and the changes in glucose and inorganic phosphate concentrations were suggestive of decreased food consumption.Moreover, the decreased glucose concentration observed in the lower dose groups was inconsistent with the increased glucose concentration observed in the high dose group in POC1.These changes were clearly absent when compared with CCG at the lower dose groups, and there were no abnormal in-life clinical observations (data not shown) or histopathology (for GLDH) findings to support the clinical chemistry changes in POC1 and POC3 indicating misinterpretations.

Modest alignments of clinical pathology changes in the high dose group between the CCG and VCGs
In hematology, test article-related findings at the high dose group were largely consistent between the VCGs and the CCG except for some inconsistencies in the changes observed between the low dose groups and the high dose groups.For instance, as shown in Tab. 3, trends towards test article-related decreases in red cell mass parameters (HCT, Hgb, RBC) with no evidence of marrow response (decreased reticulocytes and RDW) were observed in POC2.Some of these changes were also captured in POC1 and POC3.These decreases were consistent with the changes or lack thereof in the lower dose groups for the POC2, but inconsistent with the increases observed in the related parameters in the lower dose groups in POC1 and POC3.For the white blood cells, test article-related decreased absolute eosinophil counts were observed in the POC2 and POC3, but were not captured in POC1.As such, the changes in the eosinophils counts in POC1 and POC3 high dose groups were inconsistent with the changes observed at the lower doses.
In clinical chemistry, test article-related findings at the high dose group were also largely consistent between the VCGs and CCG except for some misinterpretations and over/under-interpretations relative to the CCG.As shown in Tab. 4, test articlerelated decreased albumin concentration (ALB) was observed in the POC2, with no changes in POC1 and POC3.At the lower dose groups, increases in ALB were observed in POC1 and POC3, suggestive of dehydration.While a more severe change was expected at the high dose group, no changes in ALB were observed in POC1 and POC3, clearly indicating data inconsistency and misinterpretation.For POC2, the lack of ALB changes at the lower dose groups was consistent with other clinical pathology endpoints, histopathology, and in-life clinical observations (data not shown) that there were no inflammatory changes in the lower dose groups.Other parameters associated with inconsistency between the lower dose and the high dose groups included Na, Cl (POC1), total protein, and glucose (POC1) concentrations.
Other parameters associated with misinterpretations were inorganic phosphate (IP) and creatine kinase (CK) concentrations.GXX-related increased IP was observed in POC2, a finding that was compatible with decreased glomerular filtration rate (GFR) as supported by test article-related increased urea nitrogen (BUN) and creatinine (Cr) concentrations and histopathology correlates.In POC1 and POC3, decreased IP concentrations were observed and inconsistent with the increased BUN and Cr that were also noted in these groups.Additionally, test article-related increased creatine kinase concentration (CK) was observed in both POC2 and POC3, suggestive of an underlying skeletal muscle injury, whereas decreases in CK were observed in all dose groups in POC1, a finding considered to not make any biological sense.Other endpoint differences observed between the CCG and the VCGs were associated with slight over/under-interpretations and were deemed to have minimal effect on the overall interpretation.These changes included test article-related decreases in chloride, sodium and triglycerides, along with increases in total cholesterol (POC1), glucose (POC1), ALP (POC3), and calcium (POC3).

Clinical pathology parameters have higher variability with VCGs relative to CCG
To assess whether there were any differences in variability in the clinical pathology parameters that might be responsible for the deviations observed with the VCGs relative to the CCG, we quantified the coefficients of variation (CVs) of the clinical pathology parameters in the VCGs and CCG (Tab.S4 2 ).The results showed that while the CVs were largely comparable between the CCG and the VCGs, the VCGs had more parameters with higher CVs relative to the CCG including higher CVs for CK, triglyceride (TRIG), absolute lymphocyte counts (LYMPH), absolute and relative eosinophil and monocyte counts in both VCGs.Additionally, total white blood cell counts, and glucose concentration had higher CVs in the POC1.In contrast, the relative and absolute basophil counts along with the GLDH concentration were the only parameters with higher CVs in the CCG relative to the VCGs.In total, POC1 and POC3 respectively had 9 and 7 parameters with higher CVs, while the CCG only had 3 parameters with higher CVs.On a few occasions, the higher CVs were also consistent with the higher standard deviations observed with some parameters including CK and glucose in POC1, relative and absolute eosinophil counts in POC1 and POC3, and GLDH in the CCG/POC2 indicating the higher variabilities may have contributed to their differences.

3.2.5
Control group did not affect interpretation of OM, MA or MI findings Organ weight measurements for routine tissues were taken at necropsy on Day 8.No significant differences in organ weight changes were observed across the 3 POCs for the target tissues collected (Fig. S1 to Fig. S4 2 ).Elevated liver weights (absolute liver weight and relative to brain and body weight) were observed in moribund high dose group animals while other collected tissues did not present clear evidence of GXX-related changes when compared to their respective controls.
Summary and individual tables of MI findings (SEND MI domains with standardized findings, and original findings as collected by the study pathologist on the primary study available) were evaluated independently by two pathologists (RS, PM) who were masked to the identity of the CCG.Evaluation consisted of the comparison of the incidence and severity of the findings in the test versus control group as well as identifying findings considered to be incidental.Subsequently, the two pathologists conferred regarding their results, and reached a consensus interpretation.GXX-related findings were identified in the kidney, liver, heart, spleen, and lymph nodes of the high dose group only (Tab.S5).The GXX-related findings observed at the high dose were not typical of spontaneous or background findings observed in this rat strain, and as a result were easily distinguished from the background findings in controls, both virtual and concurrent.The severity level inter-group comparison was not required since there was no exacerbation of spontaneous findings in the treated groups.Non-dose-dependent findings in the kidney and liver were identified in the low and mid dose groups; these were considered incidental background findings and were interpreted as unrelated to GXX (Tab.S5).
Although microscopic re-assessment of tissue sections was not conducted in the course of this work, we did perform a limited evaluation of histopathology whole slide images (WSI) in our archive in order to evaluate whether WSI could support review of histologic tissues in the future.Fig. 3 illustrates the appearance of hematoxylin and eosin (H&E)-stained slides assembled from VCG1, VCG2 (CCG), and VCG3 from this POC.A single slide containing H&E-stained formalin-fixed paraffin-embedded liver and spleen tissue sections was chosen for each animal.A pathologist review of the representative WSI confirmed that the quality and condition of the tissue sections was comparable across control and test animals.Slight differences in the intensity of staining were observed in animals of the same VCG, as these animals originated from different studies (Fig. 3).These inconsistencies may have given clues to the true identity of the CCG but were interpreted as not severe enough to rule out the use of historical WSI for re-assessment of tissue findings.

Discussion
Toxicology testing relies on the cumulative interpretation of various endpoints, such as clinical signs, food consumption, bodyweight changes, clinical and anatomic pathology, and this interpretation arguably depends on control groups as a baseline against which to appropriately characterize toxicological effects.Our investigation focuses on understanding the reliability and risks of the VCG approach to in vivo toxicology testing, and how the industry might leverage historical control data more effectively to determine toxicological effects in live animals and potentially reduce animal use.

4.1
Some concurrent control animals would be necessary to maintain the historical control database A major concern surrounding VCGs is whether or not a VCG can reliably produce the same results as a CCG, which is the current gold standard for controlled toxicity testing.The first step to mitigating these concerns is understanding whether historical control data remains stable over time across various toxicological endpoints, to ensure that historical control animals, when stratified to "match" the study design of a contemporary study, can reasonably compare to live control animals for the purpose of controlling experimental studies.Our prior analyses demonstrated that results can vary considerably over years (data not shown), even when study design covariates are matched where possible (Tab.2).This may be a result of biological or genetic drift in animals or endpoint drift related to technical changes or variability (e.g., reagents, equipment, personnel, methods, etc.).While our internal analyses revealed that a 3-year window for matching the study start year is best suited to account for drifts in our data, larger windows of 5 years (Golden et al., 2023;Palazzi et al., 2024) or even 10 years (Gurjanov et al., 2023) have been proposed.In the context of comparing concurrent controls and HCD, it was further suggested to maintain a timeframe from 2 to 7 years to account for possible variations in analytical techniques (Gurjanov et al., 2023).
A full replacement of CCGs with VCGs would completely deplete the database of historical control animals suitable for matching after the specified number of years and make the detection of drift in the prevalence and nature of background findings difficult or even impossible.Concurrent control animals also play a critical role as measures of environmental control, similar to infection detection by colony sentinels (Steger-Hartmann et al., 2020).Therefore, from a practical point of view, it seems meaningful to consider hybrid control groups as a VCG alternative, replacing some but not all live control animals with historical control animals (Gurjanov et al., 2023;Golden et al., 2023); or utilizing the same CCG for multiple studies running concurrently.Both of these scenarios allow for continuous data generation and help to avoid skewing historical datasets in the long-term.However, because live control animals would still be required, reduction in animal use in toxicity studies would be less than the initially projected 25%.

4.2
Assembling VCGs requires a FAIR historical database Successful execution of the POC relied heavily on automated data aggregation and curation, and therefore, required standardization of data collection, management, and presentation.Access to a FAIR (Findable, Accessible, Interoperable, Reusable) database of nonclinical data was necessary to mine historical control data, and curate three separate datasets for constructing the POC studies, allowing us to evaluate the reliability of VCGs in a time-and cost-effective manner.It allowed us to identify inconsistencies in data presentation, and importantly, afforded us the ability to interpret historical control data as whole animal data, as opposed to analyzing toxicological endpoints dissociated from the physiological context of the whole animal like in the case of synthetic controls (Grevot et al., 2023).Disassembling individual animal data reduces the reliability of test article-related findings, as meaningful changes rely on the totality of the data.While SEND and related tools have made mass data aggregation possible, initial efforts to mine historical data for the POC were strained by differences in the way historical data was captured and presented, such as differences in units of measurement or assignment of imprecise test codes.

4.3
VCGs significantly interfere with the interpretation of subtle changes in clinical pathology parameters Pathology data include both clinical pathology and anatomic pathology data.They are closely interrelated endpoints that display different aspects of the same pathophysiologic processes.Clinical pathology data largely comprise of quantifications of analytes in body fluids that is regulated by homeostasis or influenced by a variety of acute systemic changes, while anatomic pathology data are generated by the macroscopic and microscopic examinations of body tissues for injuries.In toxicity studies, the relationships of these two endpoints are influenced by a variety of factors, including but not limited to the intrinsic connections between tissues and body fluids, pathophysiological processes, the influence of study design variables (such as the timing of sample collection), and different sensitivities and specificities of various endpoints for identifying pathologic processes.The understanding of these factors helps for an integrative approach of both the clinical and anatomic pathology data interpretation for the potential identification of clinical pathology noninvasive biomarkers (i.e.analytes in body fluids) that may be valuable for the clinical monitoring of potential safety issues, assuming the toxicity and biomarker response are anticipated to be translatable to humans (Siska et al., 2022).
In this POC study, for most endpoints evaluated (including bodyweight, organ weight, and macroscopic and microscopic pathology endpoints), the differences in the findings of each of the three studies had little to no impact on the overall interpretation of GXX toxicity, especially at the high dose, where toxicological effects were clearly GXX-related.At this toxic dose, the control group, whether concurrent or virtual, served little purpose to characterize the large dramatic effect.However, to identify more subtle changes, such as those expected at lower doses, findings in the control group carried more weight in the interpretation.
In clinical pathology, 18 of the 21 findings observed at the low and mid doses using VCGs did not align with findings using the CCG and lacked correlative changes in the in-life clinical observation (data not shown) and histopathology endpoints.In the high dose groups, there were fewer misinterpretations using VCGs consisting of differences in ALB, IP and CK interpretations.Moreover, there was a pattern of misfits with some of the parameters that negatively impacted the weight of evidence approach that is generally employed in clinical pathology interpretation as no one endpoint is interpreted in isolation.For instance, using the CCG, the trend towards decreases in RBC mass in the lower dose groups was sustained in the high dose group with higher severity, suggesting a dose-dependent effect of the test article.This was a pattern that bolstered the clinical pathologist's confidence in interpreting these changes as a GXX-related effect on RBC production or loss.However, a trend towards red blood cell increase was noted in the lower dose groups in POC1 and POC3, while the opposite was observed in the high dose group.This pattern made it difficult for the expert to make a reasonable and conclusive interpretation of the GXX-related effect on the RBCs.Similar misfits were also noted with eosinophils, albumin, sodium, chloride, potassium and phosphate when using VCGs versus the CCG.Due to all these misfits noted in the VCGs, the clinical pathologist was able to easily identify the CCG as the actual control group even though this was a masked study.This was not expected in a blinded study.In a real-world situation, many of these misfits would have been disregarded due to lack of clear dose-dependency.However, given the sheer number of deviations with lack of dosedependency affecting so many parameters from just one study, a qualified clinical pathologist would have been worried about the reliability of the data they are interpreting and would have to do some investigations on the preanalytical, analytical and postanalytical procedures that may have caused this level of discrepancies -a task that will likely be impossible with VCGs, as in this case where all the known/accessible covariates have already been controlled for.Moreover, with this level of deviations, it is almost impossible to accurately detect true test article-related subtle changes that could help with safety monitoring before an injury becomes too severe -a characteristic needed for safety biomarkers.Therefore, these results reveal the inherent potential of VCGs to produce erroneous clinical pathology interpretations despite the stringent approaches employed in this POC study to filter for VCG covariates that would strongly match the legacy study selected for this exercise.While most of the misfits in this POC study have less significance on the overall outcome of the study given the parameters affected (except for the red cell mass parameters), in a real-world situation we will never know when VCGs erroneously produce a believable pattern that makes the clinical pathologist assume that some specific traditional biomarkers are more or less sensitive than their true sensitivity.This would lead to a much bigger impact than was observed in this POC study especially if the endpoints affected in future studies with VCGs are those with no histopathology correlates or endpoints that are believed to be more sensitive than histopathology for some specific tissue injuries thereby impacting the ability to safely monitor for drug toxicities in humans.Studies have shown that laboratory errors are largely due to preanalytical and postanalytical variables, which are difficult to control for, when generating VCGs (Hooijberg et al., 2012).Such variables include animals' age, origin/source, the season when study was conducted, husbandry, caging, diet, blood collections (date, time, volume), sampling (date, time, method, site, order of collection if different sites), restraint (method and duration), sample collection tube type, specimen processing, assay kits (different lots), fasting (exact start/stop time, the exact duration, access to water), and many other procedures that may affect clinical pathology parameters (Ameri et al., 2011;Aulbach et al., 2017;Aulbach et al., 2015;Dhabhar et al., 1994;Drevon-Gaillot et al., 2006;Everds, 2017;Gunn-Christie et al., 2012;Kale et al., 2009;Kozlosky et al., 2015;Li et al., 2022;Riley, 1992;Smith et al., 1986;Tripathi et al., 2017;Vap et al., 2012;Wolford et al., 1987).While individual covariate effects may be small, the additive effect of several covariates will lead to a large variability in the VCGs, as the totality of variability is unique to each study.This may explain the reason for observing higher CVs with more parameters in the VCGs relative to the CCG, as the larger variability might have contributed to the deviations observed with the eosinophil counts, CK and glucose concentrations observed in POC1 and/or POC3.In this POC, we know that vehicle, dose volume and housing conditions were incompletely matched to our primary study, as they were not readily searchable in our HCD.Further work is necessary to understand how variability driven by these covariates may have contributed to the erroneous clinical pathology interpretations.
The major goals of nonclinical toxicology studies are to identify potential hazards of a compound; understand the likely risk to a healthy volunteer/ patient and identify a safe first in human (FIH) dose; and propose a clinical monitoring strategy.In this small POC based on a dose-range finding study, it is unlikely that the differences noted in clinical pathology interpretation would have halted the progression of this molecule.Similar observations were also noted in a recently published VCG report in which several clinical pathology parameter values were either inconsistently non-significant (i.e.VCG did not pick up biologically and statistically significant changes observed with CCG) or inconsistently significant (i.e.VCG picked up new biologically and statistically significant changes that were not observed with CCG) (Gurjanov et al., 2024).Similarly in those studies, the overall conclusions with regards to NOAEL and STD10 did not change.That said, given the number of clinical pathology parameters that did not align between the VCGs and the CCG in our study and the referenced report, along with the misinterpretations that ensue, VCGs may have a broader implication of causing misconceptions about the accuracy and sensitivity of standard clinical pathology biomarkers for safety monitoring in the clinic.As such, the observations from all these studies indicate that VCGs may introduce the risk of compromised analyses, which in turn may necessitate the use of more animals to repeat such experimental studies when such errors are detected and ended up having significant repercussions.This possibility defeats the overall purpose of VCGs to reduce the use of animals in experimental studies.Moreover, given these problematic findings in a biologically constrained animal species (i.e.rats) in this study, this also calls into question the potential for the successful application of this strategy in a highly biologically variable animal species such as non-human primates.For these reasons, considerably more work must be done to validate the use of VCGs to identify hazards, establish FIH doses, and provide adequate information for safety monitoring strategies in the clinic.

4.4
Further work to establish the utility of WSI analyses in studies employing VCGs is required In this POC, we restricted our evaluation of histopathologic findings to review of the tabulated MI data in our database, and did not perform a re-evaluation of WSI.We chose to test whether historical control MI findings could support the same interpretation from VCGs as from the CCG, despite the knowledge that slide review is an inherently subjective process susceptible to inter-and intra-pathologist variability.We chose this method as a first step to evaluate our approach to VCG use and additionally as a means to inform different VCG strategies, such as those using VCG groups larger than the test groups such as in the augmented approach (Golden et al., 2023).These consisted of so many animals that reliance on historical MI findings would be essential due to the impracticality of re-reviewing hundreds of histologic WSI (see discussion below).This approach was successful in this POC largely because the MI findings observed in the high dose group were distinct and easily distinguished from background findings observed in any of the control groups.Interpretation did not rely on comparing changes in incidence or severity of MI findings identified in the high dose group to the control, and proved to be robust to any variability that might have been present in the tabulated MI findings data.Indeed, the low and mid dose groups had no test article related findings, further obviating the need for comparing MI findings in the high dose group to a control.However, given the known subjectivity of pathologists' histopathology interpretations, direct review of histopathology slides is likely to be highly useful in several contexts of VCG use, particularly when highly consistent identification of findings is required to allow assessment of subtle changes or changes that increase in severity or incidence across dose groups.For this type of application, the availability of WSI of slides for review along with concurrent WSI will be a critical enabler of VCG approaches.While we relied on tabulated MI data for toxicity assessment, we separately examined WSI from each of three animals in each control group (Fig. 3).The slides appeared to have sufficiently consistent quality to support their use in future exploration, including their re-evaluation to address the impact of WSI re-interpretation on study outcomes in the VCG context.

4.5
Opportunities for historical control data including and beyond VCGs This work has demonstrated some risks of using VCGs in the context of a 7-day rat pilot toxicity study, and further work is necessary to define the minimum criteria for VCG implementation.That said, this work has identified other potential avenues for the application of internal historical control data that may refine toxicity study design and reduce live animal use.For example, VCGs can be assembled to any size, increasing the statistical power of the control group (Bonapersona et al., 2021;Golden et al., 2023).By design, our POC assembled VCGs that matched the size of the test groups, as we aimed to mimic baseline characteristics of the test groups as closely as possible.However, similar methods could be applied to assemble control groups of any size.Additional work is needed to investigate the implications of using larger cohorts of virtual controls, without regard for the size of the test groups.Other promising opportunities to reduce animal use is to leverage historical control data in its entirety for specific study types or scenarios, most notably dose-range finding studies.The internal efforts to aggregate and curate historical data for the purpose of this POC has broadly increased data accessibility and identified development opportunities for establishing a comprehensive FAIR (Findable, Accessible, Interoperable, Reusable) HCD to aid in study interpretation, especially for toxicity studies where a CCG may not be required (e.g.dose-range finding toxicology studies).Furthermore, the HCD is arguably the most relevant data to establish reference ranges for various endpoints, including clinical pathology parameters.Control animals undergo comparable procedures to experimental animals, whereas naive animals experience radically different stress levels and housing conditions that may skew aggregated data when reference ranges are generated from naive animals.Internal reference ranges, based on HCD, can be further stratified to highlight a selected route of administration, dose frequency or bodyweight at initiation, increasing their relevance to the experimental data.

Conclusions
VCGs, when fully replacing the CCG, were not able to fully replicate the clinical pathology results of a 7-day Sprague-Dawley rat pilot toxicity study.Importantly, we have demonstrated the risk of missing or mis-interpreting subtle toxicological effects.This will inadvertently cause misconceptions about the accuracy and sensitivity of standard clinical pathology biomarkers for safety monitoring in the clinic.Therefore, it appears that the use of a VCG to fully replace a CCG is not feasible under the conditions outlined in our POC.However, there may be specific study types for which rich historical databases and/or hybrid VCGs are a viable approach to reduce animal use without compromising study outcomes.Additional work is needed beyond this single POC to see if our results are replicated, and to demonstrate the utility, and thus, viability of VCGs, in other contexts or other types of preclinical safety studies.
Fig. 2: Bodyweight plots Randomization from a matched pool of animals was controlled to produce two VCGs with a similar mean bodyweight to the test groups at study initiation (first day of treatment).a) Distribution of individual bodyweights at baseline for the concurrent control group (CCG), VCGs and test groups.b) Mean percent change from baseline bodyweight over time shows similar increases across CCG, VCGs and test groups.At the high dose (100 mg/kg), animals were found dead or euthanized on Day 4 or 6; at this dose level, total number of animals decreases up to Day 6.

Fig. 3 :
Fig. 3: Comparison of representative historical control images captured from whole slide imaging (WSI)Representative histology images of the CCG (a) from the primary historical toxicity study and the VCGs (b, c) from three historical toxicity studies conducted from 2017-2019, identified as Study 1, 2 and 3, are shown here.Although H&E slides can vary in their appearance based on laboratory procedures, e.g.tissue preparation conditions, the stain quality of the images is comparable, with only slight differences in the intensity and value of the staining.Slight differences in tissue appearance can also be attributable to the range of normal physiologic state.Magnification: 10X; Scale bar: 40 microns .