A comparison of inferential analysis methods for multilevel studies: Implications for drawing conclusions in animal welfare science
Introduction
The statistician George Box stated “all models are wrong, but some are useful” (Box and Draper, 1987); which raises the question, how do we determine which statistical model, or in other terminology, inferential analysis method, is most appropriate? In recent years, a spotlight has been directed at the transparency of animal research methodology, with low rates of methodological reporting being associated with less scientific rigour and lower reproducibility (Vogt et al., 2016, Ioannidis et al., 2009, Kilkenny et al., 2009). Articles pertaining to animal research have been criticised in the past for their design, statistical analysis and reporting (McCance, 1995, Kilkenny et al., 2009, Sargeant et al., 2010). The publication of a list of guidelines for animal research known as the ARRIVE guidelines (Kilkenny et al., 2010), has helped to improve the quality of animal research (Gulin et al., 2015). These guidelines highlight the importance of choosing the appropriate experimental assessments, sample sizes and statistical inferential analysis methods. It is important to ensure the sample size is sufficient to test the study hypothesis, but also bearing in mind the ethical and financial implications of using an unnecessarily large sample size within an experiment. There is a plethora of techniques to produce sample size estimates, and the appropriate technique will depend on the inferential analysis used for a study. Sample size can often be quite difficult to calculate for more complex designs, though the importance of conducting these calculations accurately has been well communicated, particularly in clinical trials literature (Freiman et al., 1978, Biau et al., 2008).
Discussion in this area naturally leads into consideration of the methodology of the statistical analysis conducted on the collected data. Many of the papers focussing on the quality of research using animals have primarily targeted experimental design, animal numbers, and reporting, but have not discussed the appropriate analysis of what can often be complex datasets. Precise replication of a published study is rarely performed, and typically different studies will use different experimental designs and statistical inferential techniques to address the question. Although this can make comparisons between published studies difficult, agreement in the overall conclusions under such circumstances can be considered strong evidence for the named association, though more subtle or complex relationships may potentially be missed. An identified significant treatment effect across studies through use of meta-analysis, is typically considered to be robust evidence for an association, and also allows the magnitude of the effect size to be more precisely estimated than in single studies considered in isolation (Borenstein et al., 2009). However meta-analysis also has limitations, for example when few studies have been published in an area, when they differ substantially, or when the inferential analysis used is inappropriate for the design.
Within the field of animal welfare, many published results on a particular issue are mixed or conflicting, leading to somewhat mixed messages about what the most appropriate solution for an identified welfare hazard might be. To some extent, it is possible that this is at least partly due to publication bias (e.g. Hopewell et al., 2009, Brown et al., 2017) and the drive for novelty rather than further support for a set of hypotheses in published research. However, the lack of agreement between studies may be due to other factors − the differences may reflect genuine differences between the studies, arising for reasons as yet unmeasured or unaccounted for. They may be due to the use of indicators that have not been thoroughly validated in all respects for the species in question (Cronbach and Meehl, 1955). Finally, the observed lack of agreement may be due to inappropriate statistical analysis, leading to masking of true effects, or the discovery of false positives.
Even when two studies ask a very similar research question with largely similar methodology, mixed results can emerge. A typical example of this can be found in studies that investigate causes, and consequently solutions, for aggression in pigs. For example, Beattie et al. (1996) investigated whether an enrichment object or floor space had more influence on pig behaviour. Their analysis showed that duration of harmful behaviour was significantly higher in less enriched pens, and measured pig aggressive behaviours had no significant association with space allowance. By comparison, Turner et al. (2000) found that smaller space allowances were associated with more skin lesions and longer-lasting aggressive events. These studies were similar in a number of respects, except that Turner et al. (2000) regularly adjusted pen sizes to maintain a consistent stocking density (weight per m2) throughout the experiment, whereas Beattie et al. (1996) maintained pen dimensions (hence stocking density would increase throughout the study). Consequently, the two studies are incomparable with conventional meta-analytic approaches. Variation in the indicators used could also potentially explain differences in model outcomes For example, different indicators of injuries in pigs result in differences in the final conclusion, even if the studies use otherwise similar experimental designs and methods for inferential analysis. In relation to the provision of straw for pigs, different indicators of aggression have lead to different conclusions; for example, Lahrmann et al. (2015) found reduced shoulder injuries for straw-housed pigs, whereas Morgan et al. (1998) found that straw-housed pigs performed more aggressive interactions and Statham et al. (2011) and Arey and Franklin (1995) have both reported no significant effect of the provision of straw on outbreaks of aggression. Aggression can, and indeed, has been described and measured using a wide variety of indicators. Examples of indicators for aggression are: duration of fights and number of bites (Andersen et al., 2000); prevalence of giving/receiving belly nosing, mounting, ear and tail biting, and biting the pen bars, chains or other pen details (Brunberg et al., 2011); the ratio of aggressive events to social interactions (Drickamer et al., 1999); skin lesions on different body areas (Desire et al., 2016). Frequently, there is little or no overlap between studies, or construct validation to demonstrate that all indicators recorded measure what they are proposed to measure (e.g. tail biting has been considered an indicator of aggression; however this has been reconsidered in more recent years, e.g. Taylor et al., 2010).
Here we used a study investigating aggression in pigs to compare differences between two areas for the assessment of skin injuries (believed to be indicative of aggression in pigs), an ear score and a composite body score (Conte et al., 2012), and the effects of analysing the data via four inferential methods: (i) generalised linear models; (ii) repeated measures analysis; (iii) linear mixed effect models; and (iv) linear mixed effect models for repeated measures. We compare the significant associations between the two injury assessments and the covariates detected via the exploratory and four methods of inferential analysis. These four approaches were chosen because, to varying degrees, these models could account for some of the features of the data and model parameters could be directly interpreted.
Methods (i)–(iii) were considered sub-optimal relative to (iv), as these models were unable to account for correlation in the repeated measures, and/or random effects from the hierarchical structure in the data (pens within replication). We hypothesised that not accounting for random effects from the pens within replication and correlation between repeated measures will either result in additional spurious relationships and/or mask possible significant relationships between our injury assessments and the covariates. By ignoring random effects, we hypothesise there will be more statistically significant associations with environmental factors, and by ignoring the repeated measurements, we hypothesise the association between injury score and time covariate will be more complex.
We investigated the effects of sample size within multilevel designs by analysing the data from different replications (n = 18 pigs * 4 pens per replicate) as separate studies, and comparing the coefficient estimates from each of these analyses. A reduced sample size leads to a decrease in power, which means it is more difficult to identify the environmental factors associated with the injury scores. We hypothesise, that with a reduced sample size, there will be fewer statistically significant associations between injury scores and environmental factors.
Section snippets
Animals and housing
The study was conducted at the Agri-Food and Biosciences Institute, Hillsborough, County Down, Northern Ireland. The study used commercial crossbreed PIC 337 (Large White x Landrace) pigs. Pigs received a commercial weaner diet ad libitum and water was always available, according to the standard practices on the farm.
Each pig was weighed when they were four weeks and again at ten weeks old. The pigs’ sex and weights at 4 weeks of age were used by the stockman to balance the groups to achieve a
Results
For 862 individual pigs we had a measurement for at least one of the injury assessments. For body score there were two pigs with missing data for the first observation, seven pigs with missing data for the second observation and nine pigs with missing data for the third observation. For ear score there were three pigs with missing data for the first observation, seven pigs with missing data for the second observation and 10 pigs with missing data for the third observation.
Discussion
Comparing models where each incorporated different aspects of the study design demonstrated how important using the most appropriate inferential analysis is when producing valid results. By appropriately accounting for all sources of variation within the multilevel structure of the data (i.e. pens within replications) and considering the potential time-dependent correlation between observations, we increased the likelihood of identifying the true associations between the covariates and injury
Ethics statement
All procedures described were approved by the University of Lincoln's Ethics Committee on 8/9/2015, code COSREC62. This research was conducted at the Agri-Food and Biosciences Institute, Northern Ireland and conformed to the Association for the Study of Animal Behaviour's guidelines on the use of animals in research: http://asab.nottingham.ac.uk/ethics/guidelines.php.
Funding statement
This work was supported by the BBSRC (grants BB/K002554/1 and BB/K002554/2). MF was supported by a Department for Employment and Learning Northern Ireland studentship and Queen’s University Belfast.
Conflict of interest
None.
Acknowledgement
The authors would like to thank AFBI for use of experimental room and care of the animals.
References (45)
- et al.
The effects of weight asymmetry and resource distribution on aggression in groups of unacquainted pigs
Appl. Anim. Behav. Sci.
(2000) - et al.
Effects of straw and unfamiliarity on fighting between newly mixed growing pigs
Appl. Anim. Behav. Sci.
(1995) Time course for the formation and disruption of social organisation in group-housed sows
Appl. Anim. Behav. Sci.
(1999)- et al.
Effects of food and time of day on aggression when grouping unfamiliar adult pigs
Appl. Anim. Behav. Sci.
(1994) - et al.
An investigation of the effect of environmental enrichment and space allowance on the behaviour and production of growing pigs
Appl. Anim. Behav. Sci.
(1996) - et al.
Tail biting in fattening pigs: associations between frequency of tail biting and other abnormal behaviours
Appl. Anim. Behav. Sci.
(2011) - et al.
Prediction of reduction in aggressive behaviour of growing pigs using skin lesion traits as selection criteria
Animal
(2016) - et al.
Predictors of social dominance and aggression in gilts
Appl. Anim. Behav. Sci.
(1999) An introduction to ROC analysis
Pattern Recognit. Lett.
(2006)- et al.
Effect of straw on the behaviour of growing pigs
Appl. Anim. Behav. Sci.
(1991)
The effect of long or chopped straw on pig behaviour
Animal
The effects of straw bedding on the feeding and social behaviour of growing pigs fed by means of single-space feeders
Appl. Anim. Behav. Sci.
Experience of moderate ̈a bedding affects behaviour of growing pigs
Appl. Anim. Behav. Sci.
A longitudinal study of the effects of providing straw at different stages of life on tail-biting and other behaviour in commercially housed pigs
Appl. Anim. Behav. Sci.
Tail-biting a new perspective
Vet. J.
The effect of space allowance on performance, aggression and immune competence of growing pigs housed on straw deep-litter at different group sizes
Livest. Prod. Sci.
Lme4: Linear Mixed-Effects Models Using Eigen and S4
Statistics in brief: the importance of sample size in the planning and interpretation of medical research
Clin. Orthop. Relat. Res.
Statistical methods for assessing agreement between two methods of clinical measurement
Lancet
Introduction to Meta-Analysis
Empirical Model-building and Response Surfaces
Publication bias in science: what is it, why is it problematic, and how can it be addressed?
Cited by (2)
A case study on animal behavior analysis using GAMLSS
2021, Revista Brasileira de BiometriaFactors influencing individual variation in farm animal cognition and how to account for these statistically
2018, Frontiers in Veterinary Science