Associations Between Clinical Signs and Pathological Findings in Toxicity Testing

Animal testing for toxicity assessment of chemicals and pharmaceuticals must take the 3R principles into consideration. During toxicity testing in vivo, clinical signs are used to monitor animal welfare and to inform about potential toxicity. This study investigated possible associations between clinical signs, body weight change and histopathological findings observed after necropsy. It was hypothesized that clinical signs and body weight loss observed during experiments could be used as early markers of organ toxicity. This represents a potential for Refinement in terms of improved study management and decrease of pain and distress experienced during animal experiments. To this end, data from three sequential toxicity studies in rats were analyzed using the multivariate partial least squares (PLS) regression method. Associations with correct prediction over 80% were found between the occurrence of mild to severe clinical signs and histopathological findings in the thymus, testes, epididymides and bone marrow. Piloerection, eyes half shut and slightly decreased motor activity showed the strongest associations to the pathological findings. A 5% body weight loss was found to be a strong empirical predictor of pathological findings but could also be predicted accurately by clinical signs. Thus, we suggest using mild clinical signs and a 5% body weight loss as toxicity markers, and as a non-invasive surveillance tool to monitor research animal’s welfare and toxicity testing. These clinical signs may also enable Reduced animal use due to their informative potential to support scientific decisions regarding drug candidate selection, dose setting, study design and toxicity assessment.


Introduction
Every year, approximately 10 million laboratory animals are used for scientific purposes in the European Union (EU) only (EC, 2019). In the United States (US), the Animal Welfare Act (USDA, 1966) excludes rats and mice, but the Humane Society of the US estimated that 25 million vertebrate animals are used, annually, for research purposes. The majority of research animals are rats and mice, used for the study of human and animal diseases. However, in the EU, about 2 million animals are annually employed for regulatory use, required for the marketing of chemicals and pharmaceutical substances (EC, 2019). In the EU, risk assessment of chemicals follows the Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH) legislation (EC Regulation No 1907/2006, see EC, 2006) that requires animal testing to be performed if no suitable alternatives are available. In the pharmaceutical industry, in vivo safety assessment studies are performed during the non-clinical phase of the pharmaceutical development process (EMA, 2010;ICH, 2009). Most candidate drugs are thus tested in vivo at doses high enough to enable the identification of adverse effects and their dose-response relationships (Hornberg et al., 2014;Sewell et al., 2014;Sparrow et al., 2011). Identification of toxicity during animal studies promotes safe exposure levels in humans, balancing the risk-benefit of the chemical exposure to support the decision-making process, which is also applicable to non-pharmaceutical chemical testing (Olson et al., 2000). There are numerous regulations and guidelines to be considered for toxicity testing using animal models, as for example the Organisation for Economic Co-operation and Development's (OECD) Guidelines for the Testing of Chemicals, Section 4 and the International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use (ICH) Safety Guidelines. These studies follow strict scientific, ethical and regulatory requirements and provide valuable information regarding the safety of the candidate drugs, their potential side effects and mechanism(s) of action (EMA, 2010;EU, 2010;ICH, 2009;NC3Rs, 2009). The scope of these legislations is to ensure high-quality data that are reproducible and comparable between studies, as well as to protect laboratory animals from unnecessary suffering, while not compromising the informative value of the study. Since 2013, in the EU, all use of animals for scientific purposes, including testing for regulatory purposes, has to be performed in alignment with EU directive 2010/63 with emphasis on the 3Rs -Replacement, Reduction and Refinement (EU, 2010). Indeed, certain animal models used for toxicity testing have been replaced with cell or computer-based methods within the area of risk assessment, for example the in vitro ARE-Nrf2 luciferase test method (OECD, 2018) and the in chemico direct peptide reactivity assay (DPRA) (OECD 2019) for skin sensitization evaluation purposes. Moreover, work towards Reduction and Refinement includes improvement of e.g. project and study design (Ringblom et al., 2017a;Kalantari et al., 2017), animal housing (Zidar et al., 2019) and experimental procedures (such as the European Partnership for Alternative Approaches to Animal Testing [EPAA] Refinement and 3R Science prizes). Great progress has been made in the area of toxicity testing, for example in terms of the refined use of body weight loss assessment for decisions regarding the maximum tolerated dose (MTD) (Chapman et al., 2013), reduction of animal use by microsampling of low blood volumes (Jonsson et al., 2012) and by including fewer recovery animals (Sewell et al., 2014;Sparrow et al., 2011). Systematic 3R-approaches reveal a major reduction potential in the use of animals in pharmaceutical toxicity testing (Törnqvist et al., 2014). Such systematic 3R-development and implementations are rarely seen in academic research where animal models and non-animal-based models are not regulated by guidelines. Still, efforts to guide also academic researchers and Laboratory Animal Facilities is prioritized in the EU. One example is the European Commission publication of a Severity Assessment Framework 2012, with animal model descriptions, clinical and behavioral monitoring sheets, aiming at improving animal welfare and reduced suffering (EC, 2012).
Clinical observations in animal experimentation are used worldwide for assessment of the general animal condition and for setting the humane endpoint, i.e. the clinical signs that define the point at which a research animal is pre-terminally sacrificed due to unjustified suffering weight against scientific benefit (Morton, 1997;OECD, 2000). In toxicity testing in the pharmaceutical industry, clinical signs are also used for dose-setting and study design purposes (NC3Rs, 2009;Sewell et al., 2015). According to international guidelines and Good Laboratory Practices (GLP) for safe drug development, these clinical signs should be thoroughly monitored, registered and reported (OECD, 2008;WHO, 2009). Although clinical signs are registered, they are rarely used as informative endpoints of toxicity for risk assessment purposes. For example, neither the operating procedures for setting acute exposure guideline levels, nor the subsequent reference doses, mention clinical signs (NAC, 2001). The WHO states in the criteria document for risk assessment of chemicals in food that "other findings", such as clinical signs and changes in body weight, may suggest a need to establish an acute reference dose (WHO, 2009). A typical example by the Joint FAO/WHO Expert Committee on Food Additives (JECFA) is when they report "clinical signs of toxicity [and], reduction in body weight" to establish a NOAEL for a study with flavouring agents (JECFA, 2016). To the best of our knowledge, clinical signs are very rarely discussed or used as a critical endpoint to establish reference doses.
In the present study, we hypothesized that mild clinical observations and body weight loss observed in animal studies can be used as early markers of toxicity, defined as pathological findings detected after necropsy. These associations do not reflect the underlying biological mechanisms, as clinical signs are not necessarily organ-specific or associated with one single organ. Still, we hypothesized that clinical signs as well as more general signs of toxicity could serve as a marker and keyevent in an adverse outcome pathway network, where the adverse outcome, in this case, would be organ pathology. To facilitate the future use of clinical signs as early markers of toxicity in areas other than pharmaceutical safety assessment, we created a short list of clinical signs and tested if a few selected signs of general toxicity would be an equally accurate in predicting pathological findings compared to the scenario when all clinical signs registered were used. Towards this end, the multivariate data analysis method partial least squares regression (PLS) was employed, analyzing pre-existing data from three non-clinical and sequential in vivo toxicity studies in rat, testing an anti-cancer candidate drug.

2.1
Animal studies Data from animal studies are presented according to the ARRIVE (Animal Research: Reporting of In Vivo Experiments) guidelines (Kilkenny et al., 2010;NC3Rs, 2010). The analyzed data were reused from previously performed studies, thus avoiding any additional animal experimenting for the present investigation. Data were collected from three in vivo safety assessment studies performed at Swetox, the Swedish Toxicology Sciences Research Center, and were conducted following international guidelines, namely the European Medicines Agency's Guideline on repeated dose toxicity (EMA, 2010) for pharmaceutical development quality data for safety assessment and the US Food and Drug Administration Guidance for Industry -M3(R2) Nonclinical Safety Studies for the Conduct of Human Clinical Trials and Marketing Authorization for Pharmaceuticals (US FDA, 2010). All studies were approved by the Southern Stockholm Ethical Committee for Research Animals (ethical permit number S7-15) and performed according to Swedish animal welfare legislation L1 (SFS, 1988) and L150 (SJVFS, 2012 for study I; SJVFS, 2015 for study II and III) in the Swedish Toxicology Sciences Research Center (Swetox) facilities. The tested compound was an anti-cancer candidate drug, intended for human therapy through oral administration. Two dose-range finding studies (study I and II) were performed in order to decide acceptable oral doses for a 28-day repeated dose toxicity study (study III). Study I consisted of a 7-day oral toxicity study, study II was a 13-day oral toxicity study and study III was a 28-day oral toxicity study (Table 1). These studies were performed in rats, in accordance with the regulatory guidelines (EU, 2001), to set doses for the first human trials.
The rats were ordered based on strain, weight and age. Upon arrival, all animals were thoroughly examined to ensure good condition and subsequently randomized into cages and dose groups. The rats were habituated to handling and experimental procedures during the acclimatization period, which has shown to greatly reduce stress during the experiment. Each rat was individually handled for 1-2 minutes daily, 5 days per week, for 2-3 weeks before the first dose/study start, and during the last two weeks trained for dosing by sham doses of tap water. For study II, a shorter acclimatization period of two ALTEX preprint published October 26, 2020 doi:10.14573/altex.2003311 3 weeks was used. During all three studies, the animals were group-housed, separated by gender and dose group, keeping the environment they were introduced during the acclimatization period. Besides control animals, there were three dose groups in study I and III, and two dose groups for study II. Males and females were kept in different cages at all times. The used BK Rat Cage dimensions were 30 cm width x 42 cm length x 21 cm height. All cages were enriched with wood chip bedding, nesting material, a plastic tube, a toilet paper roll carton and wooden sticks for gnawing. The dosing vehicle solution for the test compound was 2-Hydroxypropyl-β-cyclodextrin in acetate buffer (pH 4.5-4.8), a non-toxic and common vehicle solution for lipophilic drugs (Gould and Scott, 2005).
In study I, the rats (Wistar Hannover Galas, Charles River Laboratories, Denmark) were dosed for a total of 7 days (Table 1). Due to declining general condition of some animals and by reaching the predetermined humane endpoint, all males and females in dose groups 3 and 4 were pre-terminally sacrificed. The remaining animals were sacrificed, as planned, 24 hours after the last dose. All animals were necropsied.
In study II, the doses were adjusted based on toxicokinetic data from study I ( Table 1). The animals (RccHan:Wistar, Harlan Laboratories, Netherlands) were dosed for a total of 13 days. All animals were sacrificed for necropsy 24 hours after receiving the last dose.
In study III the doses were adjusted based on toxicokinetic data from study I and II (Table 1). The rats (RccHan:Wistar, Harlan Laboratories, Netherlands) were dosed for a total of 28 days. All animals were sacrificed for necropsy 24 hours after receiving the last dose. 10 females 30 mg/kg a All doses and concentrations are expressed in terms of the tested substance. All doses were administered at dose volumes of 10 ml/kg according to the dosing scheme. b The vehicle formulation used in Group 1 was the same %w/v as the vehicle used for group 4.

2.2
Independent variables As independent variables, we used all available registered clinical observations, the toxicokinetic parameter Cmax and body weight changes. For clinical signs and body weight variables, a binary (0 or 1) scoring system was employed for modelling purposes. Clinical signs were registered by trained animal technicians during all three animal studies ( Table 2). The signs to be registered were decided before the study start, based on an in-house reference list with general and organ/physiology related clinical signs of adverse outcomes. Observations were registered daily, either immediately after the dosing procedure or in the morning of dose-free days. The dosing procedure was not performed with blinded technicians. In addition, scheduled repeated observations for a 24-hour period were performed twice during all three studies. The clinical signs were used as indicators of potential toxicity during the animal studies and to monitor animal welfare in general.
For the Cmax parameter, the individual concentration of the tested substance was used. Cmax is a toxicokinetic parameter that denotes the individual's highest measured substance concentration in a specific compartment, in this case rat serum. The sampling was performed by drawing 75 µl blood from the tail vein using capillary tubes, followed by centrifugation and separation in aliquots. A total of 5, 6 and 9 blood samples were drawn, at different timepoints, for toxicokinetic study performed on the first and last day of dosing for study I, II and III, respectively. Cmax was quantified at the end of each study, using an inhouse developed liquid chromatography and tandem mass spectrometry (LC-MS/MS) method. The toxicokinetic profiling and calculations were made by PKxpert AB (Stockholm, Sweden).
Bodyweight measurements were performed regularly during all studies (Table 3). For determination of bodyweight changes, the arithmetic mean of the control group was determined for each weighing. Thereafter, each individual rat's weight was compared to the average of the latest weighing of the control group, for the respective gender. If this difference was greater than 5%, i.e. if the animal did not display a bodyweight increase ≥ 95% compared to the control group's average, a value of 1 was set. This value remarks an abnormal body weight loss/gain compared to the control group, regardless of when it occurred. Conversely, animals with a similar body weight increase, i.e. ≥ 95% compared to the control group, were scored with a 0 (no finding). A 10% bodyweight has previously been suggested as a relevant threshold in MTD studies up to 7 days, for pharmaceutical development purposes (Chapman et al., 2013). In the present study we wanted to study if even less pronounced body weight loss could be an efficient tool in toxicity testing. A 5% bodyweight change was decided to be used after testing a 3, 5 and 10% bodyweight loss. Thus, a 5% bodyweight change was used and tested both as an independent variable as well as a dependent variable.

Tab. 2: Description of codes used for reporting of clinical observations
Clinical observations scored in binary, i.e. either 0 (no finding) or 1 (finding). If a specific clinical sign observed was recorded for an animal on a particular day, it was categorized as 1. Otherwise, it was scored as 0.
Variable Definition Blood on the gavage probe During the gavage dosing procedure, the probe displays blood on the tip directly after dosing.

Difficulty dosing
Rat resist during gavage dosing procedure. Eye(s) half shut/shut Rat with one or both eyes partially shut/shut usually linked to poor condition or opacities.

Gender
Categorization in male or female animals.

Hairless patches
Rat display hairless patches on the limbs. Hair loss general Rat display a general hair loss body condition. Hunched posture Rat whose back is abnormally arched in a concave manner.

Loose feces
Rat display altered feces consistency, i.e. softer than normal. Slightly decreased motor activity Rat with slightly decreased motor activity, compared to normal activity.

Pale
Rat with pale extremities, skin or mucosa. This observation excludes pale eyes and gums. Pale eyes Rat with pale eye or paler than normal. Piloerection Rat with erected fur. Can be observed in connection with dosage (related to substance's flavor). Ploughing Rat plough their nose in the cage bedding. Can be observed in connection with dosage (related to substance's flavor), but also in undosed animals.

Reflux
Rat with gastroesophageal reflux.

Salivation increased
Rat with an increased rate of salivation observed after administration of the dose. Usually observed in connection with dosage.

Salivation reflex
Rat with an increased rate of salivation, usually before or during dosing procedure. Staining around the eyes Rat with porphyria around the eyes.

Staining in the nostrils
Rat with porphyria in the nose nostrils. Often stress or illness related, can spread to other areas of fur.

Stiff body
Rat body is stiff during handling. Struggling during handling Rat struggle excessively during handling and/or dosing procedure.

Tiptoe gait
Rat walking on tiptoes. Usually associated with a poor health condition. Trembling Rat trembling. Vocalization during handling Rat vocalizing when handled. This symptom is usually observed in connection with oral dosing, but can be observed in unhandled animals.

Tab. 3: Bodyweight measurement days, per study
Study number Weighing days Study I All animals were weighed on study day -1 (before the first dose) and day 3. Additionally, group 1 and group 2 males were weighed on study day 6 and 8 (before sacrifice), and group 1 and group 2 females were weighed on study day 4, 7 and 8 (before sacrifice). Group 3 males were weighed on day 6 and 7, and group 3 females were weighed on day 4 and 5. Group 4 males and females the on day 4. Study II All animals were weighed on study day -1 (before the first dose), day 5, 9 and 13. Group 3 animals were additionally weighed on study day 10. Study III All animals were weighed on study day -1 (before the first dose), day 4, 22 and 29. Additionally, all males were weighed on study day 10 and 16, and all females on study days' 8, 14 and 18.

Dependent variables
All animals were sacrificed the day after the last dose for necropsy purposes, except for study I where necropsy was performed the day after the pre-terminal sacrifice for animals in dose groups 3 and 4. Organs were collected, fixed and processed to wax blocks, sectioned and stained according to standard operating procedures. The slides were then analyzed for microscopic pathology by a qualified veterinary pathologist. The slides were given an ID and the microscopic pathology examination was performed in a one-blinded manner. The following organs and injuries were recorded and included in the analysis, regardless of the severity of the pathological findings: − Epididymidescellular debris − Liverdiverse findings merged (glycogen depletion, increased hepatocellular mitoses and single hepatocellular necrosis) − Lymph nodelymphoid depletion − Large intestines (caecum, colon and rectum)mucosal gland necrosis, atrophy or dilatation − Bone marrow in the sternumdecreased cellularity − Testestubular atrophy − Thymuslymphoid depletion A 5% body weight loss was also tested as a dependent variable.
In the binary scoring system employed, a pathological finding of any kind and severity was scored as 1 for that particular organ and individual, and a 0 was scored if no pathological finding was observed for that specific organ and individual. Multivariate analysis The PLS regression is a multivariate data analysis method firstly described in 1975 (Wold, 1975). PLS regression tests possible relationships between a set of independent and dependent variables, being useful to predict the outcome of the dependent variables on a new sample (Eriksson et al., 2005;Hubert and Branden, 2003). This method performs well with missing data points, in both independent and dependent variables (Eriksson et al., 2013). The data analysis and randomization were performed using Simca software (version 15, Sartorius Stedim Data Analytics AB, Umeå, Sweden). All data were mean-centered and auto-scaled prior to analysis. A default 7-fold cross-validation was used for model development in order to determine the significant number of components (latent variables).
In this study, two types of training and independent test sets were tested in order to investigate the predictive ability of models based on the entire set of data (all 3 studies) as well as future forecasting capacity on new data (study III based on models trained on studies I and II data): Setup 1 -Merging the data from all three studies. For this setup the data were randomly divided into a training set (two-thirds of the data) and an independent test set (one-third of the data); Setup 2 -data from study I and II were employed as training set and study III used as independent test set. Two sets of clinical signs were used in the present study; all registered clinical signs ("full list") and a limited subset of signs ("short list"). The short list was selected in beforehand based on the authors' previous experience of pharmaceutical toxicity testing in rodents and inspired by the UK's LASA guidance on dose level selection (NC3Rs, 2009). The short list of clinical observations; piloerection, decreased motor activity, stained eyes and nostrils, hunched posture, trembling, vocalization during handling, and body weight loss are general signs of toxicity and commonly observed in toxicity testing. These particular clinical signs are also used for severity assessment and for setting limits for humane endpoints in rodents used in other research areas. The short list was put together before any analyses were made. The aim was to create and test a short list of signs that would be easy for anyone to observe and use in toxicity studies, as well as in other research areas, to facilitate early detection of toxicity in drug development and academic research. In addition, gender and Cmax was included in the short list used in this study.
The employed binary scoring system (0 or 1) scored a 1 for each finding in a given variable, regardless if dependent or independent variable. Conversely, for each individual with no findings in a given variable a 0 was scored. The arithmetic mean value was thus 0.5, becoming the standard cut-off value of the dependent variable for model performance classification. This cut-off value can change depending on the model's imbalance with respect to the classes (0 or 1) of the input data. The training set cross-validation procedure was used to determine the cut-off value for assigning the prediction as pathological (> cut-off) or no pathological finding (< cut-off). This was done by setting a cut-off value that maximized balanced accuracy given by the function: Balanced accuracy = 0.5 x (specificity + sensitivity) Balanced accuracy =

* [( ) + ( )]
A threshold level of ≥ 0.8 was employed for model performance classification, i.e. the ability to accurately predict ≥ 80% of the dependent variable findings in the test set. This threshold level is a commonly used value and in line with the suggestions of Ekins and colleagues (2018), who identified 0.8 as the ideal number. Conversely, a poor model performance is defined as the lack of ability to predict at least 80% of the events. The balanced accuracy is composed of two terms, the specificity and sensitivity. Specificity is defined as the rate of true negatives predicted, i.e. the proportion of events correctly predicted as negative when the true result was negative (class 0). Sensitivity, on the other hand, is the rate of true positive results predicted, i.e. the proportion of correctly predicted positive events when the true result was positive (class 1). The same threshold level of ≥ 0.8 was used for classification of the model performance regarding specificity and sensitivity.
The importance ranking of the different independent variables is given in terms of Variable Importance in Projection (VIP) score. In brief, the VIP score describes the contribution of an independent variable to the outcome (dependent variable) of the derived PLS model. It is obtained by estimating the weighted sum of the squared correlations between the independent and the dependent variables. The greater a VIP score is, the more information-bearing and predictive power it possesses. The VIP method is implemented in the SIMCA-P computer package and high VIP scores, with values ≥ 1, are classified as important variables in the model (Lazraq et al., 2003).

Pathology predictions based on merged data from all three studies
Using all registered clinical signs to predict pathology in any organ at necropsy, the predictive models showed a balanced accuracy ≥ 0.8 for four out of seven organs with pathological findings, when modelling using data from all three studies, i.e. Setup 1 ( Table 4). The predictions were made with an accuracy between 81% and 98% when using all clinical signs, including registered body weight loss (Table 4). Similar levels of accuracy were observed when using the clinical signs in the short list (80% to 96%, Table 4). The four organs with accurate predictions of pathological findings were the thymus, bone marrow, testes and epididymides. The predictive power of the individual clinical signs (used as independent variables) to describe the pathological findings varied. Piloerection and body weight loss were rated as the most information-bearing ALTEX preprint published October 26, 2020 doi:10.14573/altex.2003311 6 predictors when describing bone marrow pathological findings, both when using the full list (Figure 1a) as well as the short list of clinical signs (Figure 1b). Eyes half shut was also among the top-ranked signs when using the full list of clinical signs (Figure 1a). However, this endpoint was not included in the short list. A 5% body weight loss, piloerection and eyes half shut were shown to be among the most information-bearing variables for all seven organs, predicting pathological findings in the bone marrow, thymus, testes, epididymides, lymph nodes, large intestines and liver (Figure 2a, b and c). In addition, slightly decreased motor activity showed high VIP scores for all organs, however with higher standard deviations (Figure 2d). Body weight loss was shown to be not only useful as an independent variable to predict organ pathology (Figure 2a) but also to be accurately predicted by other clinical signs, both when using the full list and the short list (Figure 3a and 3b).
When using all clinical signs in the full list, the models resulted in a borderline acceptable performance for the pathological findings in the large intestines (balanced accuracy ≈ 0.8) and poor performance for the model prediction for pathological findings in the liver and lymph node (balanced accuracy = 0.4 -0.7). For the liver and lymph node organ predictions, poor model performances in the test sets were anticipated due to the low balanced accuracies observed in the training sets (Table 4). For these three organs, a similar model performance was observed when using the short list in comparison to the inclusion of all clinical signs (Table 4).

Tab. 4: Results using random selected training and test set (Setup 1)
The full list indicates the use of all clinical descriptors to predict pathology, while the short list is a selected subset of clinical observations.

Predicting pathology in study III using data from study I and II
When studies I and II were used as the training set to predict study III (Setup 2), the results indicated an overall good model performance (balanced accuracy ≥ 0.8) for most of the investigated endpoints (Table 5). Using the full list of clinical signs, acceptable predictions between 83 to 100% were seen in the thymus, testes, bone marrow, and epididymides ( Table 5). The prediction of a 5% body weight loss also showed a high balanced accuracy of 85 to 92%. The large intestines had acceptable model performances for the training sets, but not for the test sets (93 and 66% prediction, respectively). Low balanced accuracies were obtained from modelling attempts for the pathology registered in the liver and lymph node (between 0.6 and 0.9, Table 5). Overall, the model performances were better when using the full list as compared to models based on the short list of clinical signs. The short list resulted in lower balanced accuracies for the test set, especially in relation to thymus, testes and bone marrow (Table 5). Piloerection, body weight loss, slightly decreased motor activity and, in the short list, eyes half shut showed the highest VIP scores also when using Setup 2, e.g. when predicting pathology in the bone marrow ( Figure  4a   Comparing Setup 1 and 2, i.e. using study I, II and III with random allocation of 33% of their data for the test set in comparison to using study I and II for model training and study III as test set, higher balanced accuracy and model performance were obtained in Setup 1 ( Figure 5). This difference was observed regardless of the choice of a long or short list of clinical observations ( Figure 5). The same pattern was also observed for model sensitivity (Figure 6) but not for specificity (Figure 7), when comparing the use of the short list of clinical signs to the full list, when comparing Setup 1 and 2.

Tab. 5: Prediction results for study III (test set) using study I and II as training set (Setup 2)
The full list indicates the use of all clinical descriptors to predict pathology, while the short list is a selected subset of clinical observations.

Importance of clinical observations in the derived models
In order to identify the most informative predictors of toxicity, the VIP score method was used. The rank order of importance of the independent variables was identified by the PLS models. The most important were piloerection, followed by eyes half shut, 5% body weight loss, Cmax, ploughing and decreased motor activity (Table 6). This rank order of the clinical signs obtained when merging all dependent variables was consistent with the pattern obtained when ranking the clinical signs for single variables/organs, as shown in section 3.1 and 3.2 for bone marrow (Figure 1 and Figure 4). This indicates that piloerection, a 5% body weight loss, eyes half shut and slightly decreased motor activity are important predictors of pathology. Additionally, the Cmax predictive power was highly ranked by the PLS models, as expected, as higher exposure usually correlates well with organ toxicity. Ploughing, observed in connection with the dosing procedure, was also highly ranked. However, ploughing is often related to the substance's flavor, and was mostly observed in the high dose groups where the concentrations of the test substance were the highest. Therefore, this sign was regarded as related to the concentration of the test substance and its flavor, rather than connected to the pathogenesis. It is noted that ploughing and eyes half shut were excluded in the short list but considered informative using the long list of clinical signs ( Table 6). Some of the clinical observations, i.e. vocalization during handling and trembling, which were part of the short list, were not particularly important in neither of the tested setups, which is most likely related to the very few registered occurrences. Finally, the rank order of the variables was similar regardless of the setup tested (Table 6).

4.1
Comparison of both tested setups In the present study, even mild clinical observations were associated with pathological findings, which suggests that clinical signs can be used as early predictors for adverse outcomes observed after necropsy. Piloerection, eyes half-shut and decreased motor activity showed the strongest association with the pathological findings in the thymus, testes, bone marrow and epididymides. Body weight loss also showed a high empirical association with pathological findings.
The first tested setup included all three studies (all available data) to tentatively obtain the best associations possible between clinical signs and pathology. The second setup used the two shorter dose-finding studies to test if they could accurately predict the results of the third study, which had a longer duration and used lower doses. Accordingly, there was a greater risk for more severe effects in study I and II, given the higher dosages, compared to study III. For both tested setups (merging all data or predicting study III), good model performances were found when associating all clinical signs with a sequential 5% body weight loss as well as pathological findings in the thymus, testes, bone marrow and epididymides. The short list of clinical signs was shown to be useful, even though a slightly better model performance was observed when using all available clinical observations in the first setup. The results show that PLS models can describe and empirically predict associations between clinical signs and pathological findings, even with limited amounts of data, including the unfortunate case of study I where two dose groups were pre-terminally sacrificed. Piloerection, stained nostrils and decreased motor activity have previously been identified as markers of general toxicity, often used to assess the severity of toxicity-induced distress (NC3Rs, 2009) as well as to support decisions for preterminal sacrifice due to general poor animal condition (Morton and Griffiths, 1985). Members of Swedish Animal Ethics Committees also top-ranked these endpoints in relation to animal distress (75 th percentile weights) (Ringblom et al., 2017b). Piloerection is a general symptom often associated with toxicity but can also be related to the substance's taste or animal discomfort without dose administration. It is thus dependent on the time of occurrence, being plausible to discard if observed immediately after dose administration. Stained nostrils are a typical sign of toxicity and decreased animal wellbeing, especially when observed recurrently. Decreased motor activity is related to poor animal condition and often linked to more severe suffering (Sewell et al., 2015). In the present study, slightly decreased motor activity was observed only in some animals in study I and II, being however important when observed despite its low frequency of observations. Interestingly, eyes half shut was also a strong predictor for pathological findings, being traditionally regarded as a clinical sign that indicates pain (Langford et al., 2010). Eyes half shut is included in animal welfare assessment guidelines (e.g. Morton and Griffiths, 1985;EC, 2012), but not in the UK's LASA guidance on dose level selection (NC3Rs, 2009). Our results strongly support that these three descriptors (piloerection, eyes half shut and finally decreased motor activity), even when observed as mild symptoms or low frequency, could be predictors of general toxicity. They carry, thus, a Refinement potential, in terms of study management and animal welfare monitoring. To our knowledge, there are no publications that describe the use of clinical signs in research animals to score side effects caused by the test drug and use for other purposes than as additive information in the risk assessment and for animal welfare reasons.
Body weight loss is a useful tool to determine the MTD in different species used in toxicity testing, and for rodents a 20% body weight loss has previously been regarded as sufficient to identify an appropriate MTD (NC3Rs, 2009). The use of this substantial body weight loss for deciding MTD has though been challenged (Chapman et al., 2013). For rats, a body weight loss > 10% in MTD studies up to 7 days resulted in a reduced dose in subsequent studies, in the majority of the collated toxicity studies from the pharmaceutical industry (Chapman et al., 2013). A 10% body weight loss has also been shown to be a sign of evident toxicity in acute inhalation studies (Sewell et a. 2015). In the present study, we showed that a less severe weight loss of 5% in rats has a strong empirical association to pathology findings in 7 up to 28-days toxicity testing studies. Based on the observed predictive value of body weight loss, we suggest a 5% body weight decrease to be an important predictor to consider when investigating toxicity in future animal studies. Further studies are though required to support this suggested threshold for toxicity, which can be useful for decision-making of, for example, administering the next dose or as a point of departure for reference dose setting. Bodyweight loss was in the present study also shown to be published October 26, 2020 doi:10.14573/altex.2003311 predicted by clinical signs. Another parameter that was accurately associated with the pathological findings was Cmax. As expected, higher Cmax concentrations, i.e. higher exposure, were well associated with all modelled organs. However, Cmax does not represent a quick assessment of animal wellbeing, as it requires a resource-consuming toxicokinetic study and repeated blood sampling. Overall model performance was good (81 to 98% balanced accuracy), and accurate predictions was seen in four of the seven investigated organs: thymus, bone marrow, testes and epididymides. Bone marrow, testes, epididymides could be regarded as "target organ for side effects" for the tested anti-cancer drug, as they are continuously proliferative organs (Remesh, 2012). The intestines could also be considered a continuously proliferative organ, but it was not as well predicted by the derived models (78 to 88%). Thymus and lymph node pathology, and body weight loss, could be related to secondary toxicity due to stress caused by drug-induced specific organ toxicity. Liver toxicity is often regarded as non-specific organ toxicity. The pathological findings in the liver and lymph node showed an irregular pattern across the present studies, and there were too few observations for the specific injuries (glycogen depletion, increased hepatocellular mitoses and single hepatocellular necrosis) to be modelled individually. There were also fewer observations related to lymphoid depletion in the lymph node in the third study, due to the lower doses tested. The clinical signs in the present study seemed to be related to drug-specific side effects rather than to stress-related secondary toxicity.

The short list of clinical signs
Would a selection of clinical signs, previously established as relevant for toxicity assessment, yield similar results compared to modelling using all clinical signs? The short list was used to represent the most meaningful observations from a toxicological point of view. In general, the model predictions resulted in similar balanced accuracies when the short list of clinical observations was used, indicating that all clinical observations are important, but some are more important than others (Table 6). However, a lower balanced accuracy was evident when the short list was used in combination with study III as the test set ( Figure 5) indicating a lack of generalizability for some of the derived models.
The model's performance deterioration was especially noticeable in the sensitivity results for the class containing the positive pathological findings, i.e. the recall of animals with positive pathological indications ( Figure 6). However, for the models where the data from all three studies were used, the model deterioration with respect to balanced accuracy and sensitivity was not very pronounced ( Figures 5 and 6). The specificity results did not deteriorate in a similar way, as the models predicted more negative results (the majority class) to the cost of lower sensitivity, especially in the second setup where findings in study III were predicted from study I and II (Figure 7).
In conclusion, the model performance improved with the number of endpoints included but most importantly on the amount of data employed, suggesting that more evident associations would be observed if data from other studies and substance classes were available. Furthermore, this analysis showed that the short list of clinical signs yielded similar results as the more complete assessment of clinical observations, in the scenario where data from all three studies were combined.

Temporality
In the present investigation, PLS is not a time-resolved modelling approach, as it disregards the time-point when the clinical signs were observed, only taking into account the total number of times averaged over the number of study days (for comparability between different study durations). These averages are then tested for potential associations regarding the pathological findings in each organ. This absence of temporality may seem like a limitation but is actually the contrary, being potentially useful in terms of creating a real-time surveillance system to predict toxicity and assess animal wellbeing. During the course of an animal study, clinical observations can be registered in a software enabled with PLS modelling, which can then perform real-time predictions and alert the investigators in beforehand for a likely outcome in target organs for some of the animals or for the study outcome as a whole. These pathological injuries can naturally only be confirmed after necropsy, but a predicted pathological outcome may support study management during its course, in terms of intended study outcome and decision-making for dose administration, increasing simultaneously animal wellbeing through unnecessary suffering. As it might serve as a tool to support decision-making in case of change or interruption of the toxicity study, it represents a potential Refinement action.

4.4
Future perspectives and final remarks A wider understanding and use of clinical signs in any animal model and research area are necessary to provide important information about side effects and risks associated with the used test substance. Drugs with positive treatment effects as well as even mild side effects in a disease model might not be suitable in the clinic. By making decisions that support candidate drugs with the least toxic effects, early in drug development in the pharmaceutical industry and academia, many animals and unnecessary additional animal experiments could be potentially avoided.
During toxicity testing studies, clinical observations should be registered but are seldom used for decision-making in chemical risk assessment or pharmaceutical development (OECD, 2008;WHO, 2009). This reflects the underestimation of the information bared by clinical signs. In this study, we have demonstrated that clinical signs have predictive power over which organs injuries are potentially going to be observed after necropsy. Future research in this area may focus on repeating the analysis with larger amounts of data, including longer studies and other classes of substances in order to corroborate or refute the presented proof-of-principle that there are associations between clinical signs and specific pathological findings. Further elucidation of these associations would improve study management and design, promoting Refinement and Reduction by enabling greater use of the information obtained during the in vivo studies and by preventing animal suffering. Furthermore, these prediction models with a short list of clinical signs can also be useful during the efficacy studies made early in the pharmaceutical development process, which can represent valuable information for test substance selection before the regulatory toxicity testing. Based on our findings, we suggest that eyes half shut should be included in such a list as well as in dose setting guidelines for toxicity testing. In conclusion, clinical observations were clearly associated with pathological findings, which suggests that clinical signs can be used as early predictors for adverse outcomes observed after necropsy. For the investigated anti-cancer drug, piloerection, eyes half shut and decreased motor activity showed the strongest associations with the pathological findings in the thymus, testes, bone marrow and epididymides. Additionally, a strong empirical association was observed for a 5% body weight loss, which was an accurate predictor regarding organ injuries, but could also be predicted by the clinical signs. The empirically derived PLS regression models predicted accurately over 80% of the animals' pathological findings in the mentioned organs, when building the model using all clinical observations from three animal studies. These results show that PLS modelling represents a promising analytical method and a strong candidate for a real-time toxicity and animal welfare monitoring system. We conclude that clinical observations can be used as early markers of toxicity, as well as to assess and improve welfare during pharmaceutical development, reducing animal use and unnecessary suffering. In addition, we suggest that signs registered during toxicity testing studies, as well as in other research areas, could be simplified using a short list including a 5% body weight loss, piloerection, eyes half shut and decreased motor activity. Further research is required to improve the accuracy of these predictions and to further support the proof-of-principle this analysis has presented.