Improvement of the Welfare Quality scoring model for dairy cows to fit experts’ opinion

 We refined the Welfare Quality model for dairy cows to better fit experts’ opinion  We improved the ‘Absence of prolonged thirst’ criterion to avoid threshold effects.  We improved ‘Absence of disease’ criterion to limit compensation between measures  we performed a global sensitivity analysis of the original and the new model  The welfare Quality protocol for dairy cows will be updated with the new model. Abstract


Introduction
The welfare of animals is considered to involve multiple dimensions (e.g.Fraser, 2003).Therefore, the assessment of welfare, as a whole, requires a multicriteria approach.Animal welfare assessment is, by nature, value-based (Veissier et al., 2011).It implies combining facts (e.g.measures) and interpretation of these facts, therefore combining objective and subjective information (e.g.interpretation and aggregation of measures) to get welfare scores (Spoolder et al., 2003).For dairy cattle, several protocols have been proposed to define welfare measures and aggregate their results into a score reflecting the overall welfare status of a herd (e.g.(Bartussek, 1999;Burow et al., 2013); (Welfare Quality®, 2009) or simplified versions of this protocol (Tuyttens et al., 2021;Stomp et al., 2023).Welfare Quality proposes a comprehensive scoring model based on measures whose relevance has been checked by concurrent, construct or consensus validity (Knierim et al., 2021) and on a scoring system fine-tuned according to experts' opinion (Welfare Quality, 2009).The experts were animal scientists -for their knowledge on animals and on the meaning of measures -, social scientists -for their knowledge of how various societal groups value animal welfare -, and stakeholders -for their knowledge of what can be done in practice - (Veissier et al., 2011).Since its original publication in 2009, Welfare Quality protocols have been extensively used, including for certifying farms in Spain and Finland.However, the original Welfare Quality scoring model received criticism.First, the results of the model for dairy cows are too sensitive to the number and cleanliness of drinkers (Heath et al., 2014;de Graaf et al., 2017de Graaf et al., , 2018;;van Eerdenburg et al. 2021).Thereby, this measure is resource-based and does not necessarily reflect the level of thirst.Second, it is not sensitive enough to the prevalence of lameness or mastitis due to compensating mechanisms.For example, a farm where 50% of the cows are affected by mastitis, but no other disease is noticed on the farm would still receive a high score -64.5 -for the 'absence of disease criterion', because the criterion is calculated on the proportion of alarming problems (here only one problem: mastitis) out of the 8 potential disease problems.Therefore, the original scoring model does not correspond to the opinion of some dairy cattle welfare experts, nor does it encourage farmers to reduce such disorders (e.g. de Vries et al., 2013;Heath et al., 2014;van Eerdenburg et al., 2018).Doubts were also expressed on the reliability and the validity of the 'Qualitative Behaviour Assessment ' (de Graaf et al., 2017).While work continues on refinement of the measures, the General Assembly of Welfare Quality Network decided to put efforts into improving the scoring model, namely the calculation of scores for two criteria 'Absence of prolonged thirst' and 'Absence of disease'.
Amendments have been suggested to improve the scoring of the provision of water or the health status.For criterion 'Absence of prolonged thirst ', Van Eerdenburg et al. (2018) proposed to divide the number of drinkers by their average cleanliness to produce a score.For criterion 'Absence of diseases ', de Vries et al. (2013) argued for a limited compensation between the results obtained across health disorders (nasal discharge, ocular discharge, hampered respiration, diarrhoea, vulvar discharge, milk somatic cell counts, mortality, dystocia, downer cows).Indeed, in the original Welfare Quality protocol, to calculate the score for Criterion 'Absence of disease', warning and alarm thresholds are defined for each health disorder and a weighted sum of alarms and warnings is calculated, resulting in the impact of each of the nine health disorders being diluted into a whole.
Sensitivity analyses of a model are essential to identify how variations in the inputs (here the results obtained for each welfare measures) influence the outputs (here the variation in the overall assessment of a farm) (see (Iooss and Lemaître, 2015) for a review).Due to lack of time during the Welfare Quality project, no sensitivity analysis was performed.Two studies (de Vries et al., 2013;de Graaf et al., 2018) looked at how the model performed in scenarios in which the results from a farm were replaced by higher values (de Vries et al., 2013) or by the best or worst possible values (de Graaf et al., 2018).Their approaches had two main methodological biases: first, the increase applied to results varied between measures (e.g. the higher the initial value the smaller the shift to the best possible value) and second, interactions between measures and non-linear effects of measures (e.g.threshold effects) were not addressed.These biases, in turn, may lead to interpretation bias (Saltelli et al., 2006;Saltelli and Annoni, 2010).This paper presents a formal sensitivity analysis of the Welfare Quality scoring model in its original and alternative versions.
The present paper aims to improve the Welfare Quality calculations for the criteria 'Absence of prolonged thirst' and 'Absence of disease' in dairy cows, so that the results are more sensitive to input data and better fit experts' opinion.We propose new calculations for these two criteria.To check the benefit of the new calculations proposed, we perform a global sensitivity analysis, using the Morris method (which avoids the above-mentioned biases (Saltelli and Annoni, 2010)) on the original Welfare Quality scoring model and the alternative model that include the new calculations.We then compare the results of the two models to experts' opinion.

Welfare Quality scoring model for dairy cattle
The Welfare Quality protocol for assessing the welfare of dairy cattle on-farm can be found at http://www.welfarequalitynetwork.net/.It includes 49 measures taken on animals or their environment, grouped into 11 criteria then 4 principles before an overall assessment is produced.The scoring model that builds from scores on individual measures to an overall assessment comprises three steps briefly described here.
Step 1: From measures to criterion scores Aggregation starts by combining 49 measures (Supplementary Table S1) into 11 criterion scores expressed on a 0 -100 scale, with 100 as the best score.Several aggregation methods are used depending on the measures included in a criterion.For 'Absence of prolonged thirst', the five measures (the total length of water troughs; the number of water bowls; the number of water troughs; the cleanliness of water points and the water flow) are aggregated by the use of a decision tree.At each node of the tree, a decision is taken based on a Yes/No answer to a specific question (e.g. are the drinkers clean (drinkers with fresh feed residuals are not counted as dirty)?Are drinkers in sufficient number?).The decision tree finally defines seven possible situations, all assigned a score.
For 'Absence of disease', the percentage of cows affected by each of ten health problems are converted into three classes: 'below warning threshold', 'above warning threshold and below alarm threshold' or 'above alarm threshold', with warning threshold being half of the alarm threshold (for example, warning and alarm threshold for nasal discharge are 5 and 10%).The number of warnings and alarms is then combined into a weighted sum (with more weight attributed to alarms) which is in turn translated into a score by the use of a spline function.
Step 2: From criterion scores to principle scores.
Criterion scores are aggregated into principle scores expressed on the same 0 -100 scales as for criteria.For instance, Principle 'good feeding' embraces 'Absence of hunger' and 'Absence of prolonged thirst' and Principle 'Good health' embraces 'Absence of injuries', 'Absence of disease' and ''Absence of pain due to management procedures'.Choquet integrals are used for this aggregation, which allows to limit the possible compensation of poor scores by good ones while considering Criterion 'Absence of disease' is more important than Criterion 'Absence of injuries' that is more important than Criterion 'Absence of pain due to management procedures'.
Step 3: From principle scores to overall welfare category.
The final aggregation is from principle scores to overall welfare category.The welfare is considered 'excellent' when the farm scores  50 for each principle and  75 on two of them.When the farm scores  15 on each principle and  50 on at least two of them, it is classified as 'enhanced'.'Acceptable' farms score  5 for all principles and  15 for at least three principles.The remaining farms are 'not classified'.We used the INRAE 'Welfare Assessment of Farm Animals' webtool (https://www1.clermont.inrae.fr/wq/) to calculate scores and to assign farms to welfare categories as defined in the original Welfare Quality protocol.We used a modified version of the webtool to implement new calculations taking into account the proposed improvements in the scoring model.

Proposed improvements of the Welfare Quality scoring model
The Welfare Quality Network (http://www.welfarequality.net/)discussed alternative ways to calculate 'Absence of prolonged thirst' and 'Absence of disease' so that the scores obtained better match with experts' opinion.These alternatives are tested in the present paper.

Absence of prolonged thirst.
A linearized version for the interpretation of 'drinkers' availability' weighted by a 'cleanliness' score is proposed to avoid threshold effects.To do so: 1.
We calculate the total number of water bowls and we convert it into trough length (1 bowl = 60 cm of trough).We then calculate the cumulated length of water troughs.Finally, we add the cumulated length of water troughs to the cumulated length of the bowls (previously transformed into trough length).As in the original model, if a drinker is not functioning properly or the water flow is insufficient (i.e.lower than 20 L/min for a trough or lower than 10 L/min for a bowl) then its length is divided by two.

2.
We then divide the total drinkers' length by the number of cows.

3.
We calculate a 'drinker availability' score based on the length of trough per cow according to one linear equation (if cows have access to only one drinker) and to a two piecewise linear equation (if cows have access to at least 2 drinkers) (Figure 1).The following equations were used: Note that in the particular case of tied cows, when there is 1 bowl for 2 cows, each cow has access in theory to half a bowl, corresponding 30 cm equivalent trough length (60 cm divided by two).In the worst case (insufficient water flow and only one bowl for 2 cows), the average drinkers' length per cow is 15 cm (30 cm divided by two), which remains above the recommendation of 6 cm per cow (cf.equations and Figure 1).This results in a drinkers' availability score of 60, which is the best score when cows have access to only one drinker.

4.
We calculate a 'drinker dirtiness' score as the average dirtiness of drinkers (a clean drinker scored 1, a partially dirty 2, and a dirty one 3)

5.
The score for 'Absence of prolonged thirst' is then the 'drinkers' availability' score divided by the 'drinker dirtiness' score.

Absence of disease.
For each health disorder, we asked nine animal welfare scientists (seven of them authors of de Graaf et al. 2017, some of them being also Veterinarians by training, and all of them being experts of the model) to give us their expert scores (on a [0-100] scale as in Welfare Quality) for four prevalences: the alarm threshold defined in the original Welfare Quality scoring model and ¾, ½ and ¼ of this threshold (e.g., for nasal discharge we asked experts to attribute a score on the 0-100 scale to 10%, 7.5%, 5% and 2.5% of cows affected, 10% corresponding to the alarm threshold).In addition, we asked them the lowest prevalence to which they would attribute a score of 0. We regressed an I-spline curve to model the experts' score according to the prevalence of each health disorder, I-spline curves being used in Welfare Quality to account for nonlinearity between prevalence and experts' score (Welfare Quality®, 2009).Spline calculations were performed with R 3.6 (R Core Team, 2019) with the help of the 'spline2' package (Wang and Yan, 2020), so as to minimise the sum of square errors between scores given by experts and the calculated ones.As in Welfare Quality (Welfare Quality®, 2009), splines were interpolated by a piecewise polynomial of degree 3, in order to be easily manipulated.
We chose to keep the three lowest scores obtained (from the I-spline curves) across health disorders.We then aggregate them with a Choquet integral to produce the score for 'Absence of disease': Where S a , S b , S c are the three lowest scores, sorted such as .We use 0.3   ≤   ≤   for µ c and 0.155 for µ bc ; these values corresponding to averaged values used to calculate the 'Good health' principle score for dairy and beef cattle in the Welfare Quality protocol.
Expert opinion had been previously collected on two subsets of this dataset.We used these subsets to check the consistency of the models' results with experts' opinion:

Subset 1
From the 491 above mentioned dairy cattle farms, data from 44 farms (25 loosehousing and 19 tied stalls; 20 from Denmark and 24 from Austria) were assessed within the Welfare Quality project, by 4 animal scientists from the project team who attributed an overall welfare score to each farm on a visual analogue scale of 120mm (thus leading to a score from 0 to 120).

Subset 2
From the 491 above mentioned dairy cattle farms, data from 60 dairy cattle loosehousing farms from The Netherlands were used.Veterinary practitioners expressed their opinion on the overall farm welfare on a 3-point scale: 1 -weak welfare, 2sufficient welfare or 3 -good welfare.The veterinarians came from four large veterinary practices spread out over The Netherlands.The classification was made by consensus of all the veterinarians (n > 5) that visited the dairy farms on a regular basis.See van Eerdenburg et al. (2021) for more details.

Sensitivity analysis of the original and alternative models
We used the Morris method (Morris, 1991) modified by Campolongo et al. (2007) to perform sensitivity analyses of both the original Welfare Quality scoring model and the alternative model including the modifications for 'Absence of prolonged thirst' and 'Absence of disease'.We performed the analysis for loose-house farms and tied-stall farms separately because Welfare Quality measures slightly differ between the two systems.
The Morris method allows the identification of the important inputs of a model, including those involved in interactions.The Morris method is used when the number of model inputs (i.e.measures in our case) is too important and thus testing all combinations is too expensive from a computational point of view.The method is based on a 'Onefactor-At-a-Time' (OAT) design of experiments.In brief, within the input space (consisting of all combinations of possible values for each input), an initial point (e.g. a farm with 20% too lean cows and 10% lame cows and 2% mortality and etc.) is randomly selected; a next point is defined by increasing or decreasing the value for only one input by an elementary shift.The difference in model output produced by the two points is calculated.This is repeated until all inputs have varied once.Thereafter the whole process is repeated starting from another initial point, until convergence of order of influence of inputs.The Morris method calculates elementary effects (R i ) due to each input using the equation: where y(X) is the output.X = (x 1 .x 2 .…. x n ) is the n-dimensional vector of inputs studied. is the elementary increment (decrease or increase) of the OAT.
For each input, we obtain two sensitivity indices calculated from of all the R i obtained for the input (Saltelli et al., 2004): -The absolute mean (μ*) of , estimating the overall influence of the input i on   the output.For instance, a μ*-value of 50 for an input means that an increase (= ) in the input of 0.2 of its distribution, i.e. an increase of 20 percentile thus considering the initial value and the distribution of the input values (e.g.uniform vs. observed distribution), increases the score by 10 points -50*0.2=10);-The SD of , estimating higher order effects, i.e. nonlinear effects (e.g. threshold effects) or interactions with other inputs.
The most influencing inputs are those with high values for both μ* and SD.
As OAT is subject to randomness, the exploration of the input space was improved by using Latin Hypercube Sampling (LHS) (e.g.Van Griensven et al., 2002;Francos et al., 2003), which maximises the 'maximin' criterion (Johnson et al., 1990) and so ensures that initial positions of points are well distributed.
We expect the 'overall' output to be little sensitive to each input because there are many inputs and only four overall categories.To be able to discriminate influent inputs we used 500 OAT (allowing to calculate 500 R i for each input).Moreover, we checked the convergence of the method (i.e.same ranking of inputs according to their influence and changes in μ* or SD < 0.01 for all inputs).We decided on an elementary increment () corresponding to 1/5 of its distribution.In a first step, we used a Uniform distribution of inputs, which is common when one wants to understand model behaviour (Monod et al., 2006).In that case the increment corresponds to 1/5 of the range (e.g., 20 if the input is expressed on a 0-100 scale).In a second step, we considered the distributions observed in the whole dataset (n=491 farms).In that case the increment corresponds to a 20-percentile.We thus could check if the influence of the inputs were similar when tested on the two types of distributions.
In order to avoid unit effects and to facilitate the sensitivity analysis interpretation, we rescaled input variables between zero and one, before calculating the two sensitivity analyse indices.
Calculations for the sensitivity analysis were performed with R (version 3.6) (R Core Team, 2019).The 'sensitivity' package (Iooss et al., 2020) was used for calculation of sensitivity indices.

Consistency of the original and alternative models with experts' opinion
We evaluated the consistency of the original and the alternative model with experts' opinion: -In Subset 1, experts expressed their opinion on each of the 44 farms by providing an overall welfare score on a continuous scale.We used mixed linear regression to relate experts' opinion (the variable to be explained) and the welfare category ('not classified', 'acceptable', 'excellent', or 'enhanced') produced by the original or the alternative scoring model (explanatory variable), with the expert as random factor.
-In Subset 2, experts expressed their opinion on each of the 60 farms using a 3-point scale.We used an ordinal regression (also known as cumulative link model) to relate experts' opinion and the category produced by the original or the alternative scoring model.
We applied the Vuong test for non-nested models (Vuong, 1989) to check if the alternative model better fits experts' opinion than the original model or inversely.
Calculations were performed with R 3.6 (R Core Team, 2019), with the use of the package 'lme4' (Bates et al., 2015) for mixed linear regression, the 'car' package (Fox and Weisberg, 2019) for the Levene Test, the 'ordinal' package (Christensen, 2019) for calculation of ordinal model, and the package 'nonnest2' (Merkle and You, 2020) for the Vuong tests.

Results
Polynomial approximation of the I-spline curves designed to score 'Absence of disease' are given in Supplementary Table S2.The measure 'Frequency of coughing per cow per 15 min' was not kept, because it never reached the level of warning threshold in the database, and partners from the Welfare Quality Network agreed to remove it from the protocol because it cannot be measured accurately at animal level.
We only detail results of the sensitivity analysis for criteria and principles affected by the changes in calculations (Criteria 'Absence of prolonged thirst' and 'Absence of disease'; Principles 'Good feeding' and 'Good health'; and overall assessment), for the Uniform distribution, and for the case of loose housing.The detailed results for other criteria and principles (which are not impacted by the updated protocol), or calculated with the observed distribution or Uniform distribution in tied farms are given in Supplementary tables S3 to S13 for criteria and Supplementary tables S14 to S17 for principles.

Sensitivity of the new calculations for 'Absence of prolonged thirst'
The five inputs involved in the criterion 'Absence of prolonged thirst' were all influential (i.e.μ* and SD  0.1) at criterion level in the case of loose housing ( Table 1): the number and the cumulated length of the water troughs, the number of water bowls, the cleanliness of the water elements and the answer to the question 'Is the water flow enough? (True/False)'.In the original model, the mean effects (μ*) ranged from 14.7, for the cumulated length of water troughs, to 27.2 for water flow, except for the cleanliness of water points, which mean effect was equal to 109.8.As the input 'cleanliness of water points' is a Boolean (true/false), when applying the sensitivity analysis, the shift over the distribution necessarily implied a change from true to false (or inversely).In the model with thirst improvement, the mean effects (μ*) ranged from about 9.5, for the cumulated length of water troughs, to 20.4 for the number of water bowl, except for the cleanliness of water points, which mean effect was of 109.8.From the original model to the alternative model with thirst improvement, the SD of the elementary effects were reduced by about 50% (from 40% for the number of water troughs to 66% for the number of water bowls).
In the case of tied housing, only the number and the cleanliness of the water bowl were influential (Supplementary Table S4).There is no water trough in tied stall barns, so that the 'Total length of water troughs' and the 'Number of water troughs' always equal 0, that explains the absence of influence in the sensitivity analysis.
Theoretically, the water flow would influence this criterion.However, as described previously, even if all the water bowls present an insufficient water flow the drinkers length remains far above the recommendation, so that the minimum score (i.e.only one drinker available per cow) was 60 in both models when water was clean, and was 32 and 20 in the original model and the alternative model, respectively, when the water was not clean.
The principle 'Good feeding' was sensitive to the five inputs of 'Absence of prolonged thirst' and sensitive to the input '% of very lean cows' ( Table 2).The new model slightly increased the influence of the '% of very lean cows' and of the 'cleanliness of water points', by about 9%, while reducing the influence of the other inputs by 33% (for the number of water bowls) to 51% (for the water flow).
Between the original model and the model with thirst improvement, the SD of the elementary effects for the 5 inputs linked to thirst were reduced by about 55% (from 27% for the cleanliness to 75% for the number of water bowls), whereas the standard deviation for '% of lean cows' (linked to hunger) was slightly increased.

Sensitivity of the new calculations for 'Absence of disease'
In the original model, there were 10 influential inputs that constitute the Criterion 'Absence of disease' ( Table 3).The mean effects (μ*) of the 10 inputs used in Criterion 'Absence of disease' ranged from 5.5 to 9.9.When the new calculations for 'Absence of disease' were used, the mean effects (μ*) of inputs ranged from 1.1 to 15.4, with only 9 inputs considered since the 'Frequency of coughing per cow per 15 min' has been removed.Results for tied stalls, with Uniform distribution, were similar to those obtained for loose housing.With the observed distribution, the "Frequency of coughing per cow per 15 min" had no influence, as no observation in the database reached the warning threshold (3 coughs/cow/15 min, while the maximum observed was of 1.07) (Supplementary Table S8).The measure '% cows with increased respiratory rate' had very little influence due to the low prevalence in the database.With the observed distribution for loose housing, the mean effects (μ*) of the 8 other inputs used in Criterion 'Absence of disease' ranged from 10.20 to 18.05.When the new calculations for 'Absence of disease' were used, the mean effects (μ*) of these inputs increased from 11% to 37% (Supplementary Table S8).
The principle 'Good health' was sensitive to the 10 inputs of Criterion 'Absence of disease' and to the 14 inputs of Criteria 'Absence of injuries' and 'Absence of pain induced by management procedures' ( Table 4).The influence of several diseases was increased in the alternative model compared to the original one: the influence of '% cows with increased respiratory rate'', + 262%; '% cows with mastitis', + 54%; '% of mortality', + 14%.The influence of other 'Absence of disease' inputs was reduced (-15% to -86%).With the observed distribution for loose housing, the influence of all diseases increased (from +60% to +165%, Supplementary Table 16), except for "% of cows with vulvar discharge" which decreased by 9%.The influence of the inputs from Criteria 'Absence of injuries' and 'Absence of pain due to management procedures' was also reduced (-8% to -29%).Within the criterion 'Absence of injuries', lameness was 66% more influential than skin alterations/lesions.
Within the principle 'Good health', using a uniform distribution, lameness ('% of not lame cows') which was the second most influential input (after the type of method used for dehorning cows) with the original model is the third most influential with the alternative model.

Sensitivity of the overall scoring and consistency with experts' opinion
In both the original and the alternative models, the overall score showed very low levels of sensitivity (the maximum µ* for an input is 0.39) with high levels of interaction effects (SD is an average 10 times higher than µ*) ( Table 5).
Within Subset 1, the consistency to experts' opinion was slightly better for the alternative model compared to the original one (Z=-1.548,p original_better = 0.939, p new_better = 0.061, with p original_better the probability to reject the hypothesis that the original model match better with the expert opinion than the new one and p new_better the probability to reject the hypothesis that the new model match better with the expert opinion than the original one).
Within Subset 2, the consistency to experts' opinion was similar between the original and the alternative model (Z=0.371,p original_better = 0.355, p new_better = 0.645).

Discussion
We applied two modifications of the Welfare Quality scoring model for dairy cows.
We modified the calculation of the score for criterion 'absence of prolonged thirst' to avoid thresholds effects and to more precisely take into account the cleanliness of drinkers.We modified the calculation of the 'absence of diseases' criterion by calculating an elementary score for each health disorder and computing the three lowest of them to limit compensation between health disorders.An alternative scoring model is proposed that incorporates these two modifications, which aim to change the sensitivity of the model to measures and to better match experts' opinion.
The modified calculation of the score for 'Absence of prolonged thirst' reduced the overly large influence of the resource-based measures that the original model has been criticised for.This overly large influence in the original scoring model was partly due to the fact that this criterion was based on a decision tree, which by nature implies the definition of several thresholds and thus threshold effect.For instance, adding a drinker in a pen resulted in a large improvement of the score for 'Absence of prolonged thirst' if it allowed a farm to switch from one branch to another in the decision tree (e.g., switching form the branch 'only one drinker' to the branch 'at least 2 drinkers').The modifications proposed for the calculation of the 'Absence of prolonged thirst' score avoid most of the threshold effects because they include continuous equations instead of thresholds.This is confirmed by the sensitivity analysis, which shows a reduction of the standard deviation of the effects of each individual measure on Criterion 'Absence of prolonged thirst' (standard deviation of the elementary effects: from an average of 70.6 in the original model to 33.2 in the alternative model).There may be a problem for tied stall where the minimum score is 60.We did not put our efforts into refining the calculation for cows in tied stalls because they are likely to disappear, at least in Europe (EFSA Panel on Animal Health and Animal Welfare et al., 2023).
The new calculations for criterion 'Absence of disease' significantly reduces potential compensations between health disorders.Indeed, by considering the three lowest scores associated with a health disorder, we avoid the compensation by the six others.Moreover, by using Choquet integral on the three lowest scores we can allow only partial compensation between these three scores.For instance, contrary to the original scoring model, a shift from 0% of mortality to an extreme value e.g.100% results in a very low score for 'Absence of disease' with the alternative model.These expected effects were confirmed by the sensitivity analysis.In the original model, with the Uniform distribution, all inputs had a similar influence, in terms of mean levels (μ*) and interactions (SD).With the alternative model, the influence of each health disorder varies.For example, mortality, which was found by de Graaf et al. (2018) as not influential enough in the original model, is now the third most influent input for 'Absence of disease'.Another positive effect of the new model is the reduction of threshold effects, due to the use of alarm and warning thresholds.With the use of spline curves, thresholds are not used anymore and each health disorder can vary continuously from 0 to 100.However, by reducing the compensation between health disorders, the average value of the 'Absence of disease' score is lower in the alternative model than in the original model.Because in Welfare Quality low values are more influential than high ones (Botreau et al., 2008), the influence of 'Absence of disease' on the principle 'Good health' is increased and as a consequence the influence of the other criteria ('Absence of injuries' and 'Absence of pain due to management procedures') and their related measures is reduced.
The overall assessment is not very sensitive to each input.Indeed, there are 49 inputs and only 4 categories for the overall assessment and these inputs are rather independent of each other, so it is expected that a change in one input only rarely modifies the overall assessment and only a combination of inputs changes can modify the overall assessment.
The number of measures to be aggregated varies between criteria.The more numerous the measures aggregated into a criterion or a principle, the lower the influence of each measure.Indeed, when calculating the average of two or four data points, the influence of each data is ½ (respectively ¼) when two (respectively four) data points are averaged.The aggregation method is more complex than an average but the impact on the number of measures to be combined is similar.Alternatively, one could make groups of measures of the same size to build criteria and groups of criteria of the same size to build principles; for instance, from a set of 45 measures one could make 9 criteria of 5 measures each then 3 principles of 3 criteria each, before aggregating the 3 resulting principles into an overall score.This would result in the same mathematical expectation for the influence of each individual measure but will certainly weaken the biological meaning of criteria.The question lies in whether we consider that welfare is composed of criteria that can be measured in different ways or of measures with each measure representing an aspect of welfare.Welfare Quality rather identified the criteria that are meaningful for animal welfare and represent separate aspects of welfare, then grouped the criteria into functional principles (feeding, housing, health, behaviour).The fact that the number of measures varies with criteria, inevitably results in varying influence of each individual measure across criteria.Therefore, the influence of measures should be interpreted keeping in mind the number of measures in a criterion.
The alternative model is more aligned with experts' opinion than the original model (in particular in terms of sensitivity).Assessing animal welfare is a value-based exercise (Fraser, 1995) There is definitively room for further improvements of the Welfare Quality scoring model.The way the measures are aggregated within a criterion can still be improved.
For example, we could change the way to aggregate integument alterations as proposed in van Eerdenburg et al. ( 2018), or we could consider the distribution of individuals problems (e.g. an animal affected by 2 disorders may have more welfare consequences than 2 animals affected each by a single disorder) as proposed by Sandøe et al. (2019).This would require a specific consultation of experts.

Conclusions
The alternative model, proposed in this paper to improve the Welfare Quality scoring, performs better than the original one.Compared to the original model, the alternative one significantly reduces the influence and 'threshold effects' of measures related to drinkers, changes the influence of each health disorder reducing compensation between them.The Welfare Quality Network is updating the welfare Quality protocol for dairy cows by including the alternative scoring model proposed in this paper.were scaled with a Uniform distribution for loose housing (detailed in Supplementary Table S1).The higher the μ* and SD, the more influence the input has.'NA' implies here that the 'frequency of coughing' input was removed from the model.

Figure captions Figure 1 .
Figure captions . The Welfare Quality scoring model was, therefore, based on expert opinions and, by definition, cannot fit all expert opinions, because values vary between experts.For example, a farm can be considered 'Excellent' (best) by an expert and 'not classified' (worst) by another (e.g.de Graaf et al., 2017, fig 3, Herd  2).The model should nevertheless fit consulted experts' opinion.Here we were able to compare the overall assessment produced by the original and the alternative models to that attributed by experts.The alternative model better matches experts' opinion than the original model, at least when these opinions are expressed on the same scale as that of the model (four categories of welfare: Excellent, Enhanced, Acceptable, Not classified)

Table 1 .
Mean (μ*) and SD of the elementary effects associated with Criterion 'Absence of prolonged thirst' of the Welfare Quality scoring model for dairy cows.Results were scaled with a Uniform distribution for loose housing (detailed in Supplementary TableS1).The higher the μ* and SD, the more influence the input has.

Table 2 .
Mean (μ*) and SD of the elementary effects associated with Principle 'Good feeding' of the Welfare Quality scoring model for dairy cows.Results were scaled with a Uniform distribution for loose housing (detailed in Supplementary TableS1).The higher the μ* and SD, the more influence the input has.

Table 3 .
Mean (μ*) and SD of the elementary effects associated with Criterion 'Absence of disease' of the Welfare Quality scoring model for dairy cows.Results

Table 4 .
Mean (μ*) and SD of the elementary effects associated with Principle 'Good health' of the Welfare Quality scoring model for dairy cows.Results were scaled with a Uniform distribution for loose housing (detailed in Supplementary TableS1).The higher the μ* and SD, the more influence the input has.'NA' implies here that the 'frequency of coughing' input was removed from the model.