Elsevier

Applied Animal Behaviour Science

Volume 197, December 2017, Pages 101-111
Applied Animal Behaviour Science

A comparison of inferential analysis methods for multilevel studies: Implications for drawing conclusions in animal welfare science

https://doi.org/10.1016/j.applanim.2017.08.002Get rights and content

Highlights

  • Inappropriate choices of statistical model lead to misleading associations between changes in environment and behaviour.

  • Body injury scores were higher in the enriched environment, but ear injury scores were higher in the less enriched environments.

  • Different risk factors suggest injuries to the body and ears occur as a consequence of differently motivated underlying behaviours.

Abstract

Investigations comparing the behaviour and welfare of animals in different environments have led to mixed and often conflicting results. These could arise from genuine differences in welfare, poor validity of indicators, low statistical power, publication bias, or inappropriate statistical analysis. Our aim was to investigate the effects of using four approaches for inferential analysis of datasets of varying size on model outcomes and potential conclusions. We considered aggression in 864 growing pigs over six weeks as measured by ear and body injury score and relationships with: less and more enriched environments, pig's relative weight, and sex. Pigs were housed in groups of 18 in one of four pens, replicating the experiment 12 times. We applied four inferential models that either used a summary statistic approach, or else fully or partially accounted for complexities in study design. We tested models using both the full dataset (n = 864) and also using small sample sizes (n = 72).

The most appropriate inferential model was a mixed effects, repeated measures model to compare ear and body score. Statistical models that did not account for the correlation between repeated measures and/or the random effects from replications and pens led to spurious associations between environmental factors and indicators of aggression, which were not supported by the initial exploratory analysis. For analyses on smaller datasets (n = 72), due to the effect size and number of independent factors, there was insufficient power to determine statistically significant associations.

Based on the mixed effects, repeated measures models, higher body injury scores were associated with more enrichment (coef. est. = 0.09, p = 0.02); weight (coef. est. = 0.05, p < 0.001); pen location on the right side (coef. est. = 0.08, p = 0.03) and at the front of the experimental room (coef. est. = 0.11, p = 0.003). By comparison, lower ear injury scores were associated with more enrichment (coef. est. = −0.51, p = 0.005) and pen location at the front of the experimental room (coef. est. = −0.4, p = 0.02). These observed differences support the hypothesis that injuries to the body and ears arise from different risk factors. Although calculation of the minimum required sample size prior to conducting an experiment and selection of the inferential analysis method will contribute to the validity of the study results, conflict between the outcomes will require further investigation via different methods such as sensitivity and specificity analysis.

Introduction

The statistician George Box stated “all models are wrong, but some are useful” (Box and Draper, 1987); which raises the question, how do we determine which statistical model, or in other terminology, inferential analysis method, is most appropriate? In recent years, a spotlight has been directed at the transparency of animal research methodology, with low rates of methodological reporting being associated with less scientific rigour and lower reproducibility (Vogt et al., 2016, Ioannidis et al., 2009, Kilkenny et al., 2009). Articles pertaining to animal research have been criticised in the past for their design, statistical analysis and reporting (McCance, 1995, Kilkenny et al., 2009, Sargeant et al., 2010). The publication of a list of guidelines for animal research known as the ARRIVE guidelines (Kilkenny et al., 2010), has helped to improve the quality of animal research (Gulin et al., 2015). These guidelines highlight the importance of choosing the appropriate experimental assessments, sample sizes and statistical inferential analysis methods. It is important to ensure the sample size is sufficient to test the study hypothesis, but also bearing in mind the ethical and financial implications of using an unnecessarily large sample size within an experiment. There is a plethora of techniques to produce sample size estimates, and the appropriate technique will depend on the inferential analysis used for a study. Sample size can often be quite difficult to calculate for more complex designs, though the importance of conducting these calculations accurately has been well communicated, particularly in clinical trials literature (Freiman et al., 1978, Biau et al., 2008).

Discussion in this area naturally leads into consideration of the methodology of the statistical analysis conducted on the collected data. Many of the papers focussing on the quality of research using animals have primarily targeted experimental design, animal numbers, and reporting, but have not discussed the appropriate analysis of what can often be complex datasets. Precise replication of a published study is rarely performed, and typically different studies will use different experimental designs and statistical inferential techniques to address the question. Although this can make comparisons between published studies difficult, agreement in the overall conclusions under such circumstances can be considered strong evidence for the named association, though more subtle or complex relationships may potentially be missed. An identified significant treatment effect across studies through use of meta-analysis, is typically considered to be robust evidence for an association, and also allows the magnitude of the effect size to be more precisely estimated than in single studies considered in isolation (Borenstein et al., 2009). However meta-analysis also has limitations, for example when few studies have been published in an area, when they differ substantially, or when the inferential analysis used is inappropriate for the design.

Within the field of animal welfare, many published results on a particular issue are mixed or conflicting, leading to somewhat mixed messages about what the most appropriate solution for an identified welfare hazard might be. To some extent, it is possible that this is at least partly due to publication bias (e.g. Hopewell et al., 2009, Brown et al., 2017) and the drive for novelty rather than further support for a set of hypotheses in published research. However, the lack of agreement between studies may be due to other factors − the differences may reflect genuine differences between the studies, arising for reasons as yet unmeasured or unaccounted for. They may be due to the use of indicators that have not been thoroughly validated in all respects for the species in question (Cronbach and Meehl, 1955). Finally, the observed lack of agreement may be due to inappropriate statistical analysis, leading to masking of true effects, or the discovery of false positives.

Even when two studies ask a very similar research question with largely similar methodology, mixed results can emerge. A typical example of this can be found in studies that investigate causes, and consequently solutions, for aggression in pigs. For example, Beattie et al. (1996) investigated whether an enrichment object or floor space had more influence on pig behaviour. Their analysis showed that duration of harmful behaviour was significantly higher in less enriched pens, and measured pig aggressive behaviours had no significant association with space allowance. By comparison, Turner et al. (2000) found that smaller space allowances were associated with more skin lesions and longer-lasting aggressive events. These studies were similar in a number of respects, except that Turner et al. (2000) regularly adjusted pen sizes to maintain a consistent stocking density (weight per m2) throughout the experiment, whereas Beattie et al. (1996) maintained pen dimensions (hence stocking density would increase throughout the study). Consequently, the two studies are incomparable with conventional meta-analytic approaches. Variation in the indicators used could also potentially explain differences in model outcomes For example, different indicators of injuries in pigs result in differences in the final conclusion, even if the studies use otherwise similar experimental designs and methods for inferential analysis. In relation to the provision of straw for pigs, different indicators of aggression have lead to different conclusions; for example, Lahrmann et al. (2015) found reduced shoulder injuries for straw-housed pigs, whereas Morgan et al. (1998) found that straw-housed pigs performed more aggressive interactions and Statham et al. (2011) and Arey and Franklin (1995) have both reported no significant effect of the provision of straw on outbreaks of aggression. Aggression can, and indeed, has been described and measured using a wide variety of indicators. Examples of indicators for aggression are: duration of fights and number of bites (Andersen et al., 2000); prevalence of giving/receiving belly nosing, mounting, ear and tail biting, and biting the pen bars, chains or other pen details (Brunberg et al., 2011); the ratio of aggressive events to social interactions (Drickamer et al., 1999); skin lesions on different body areas (Desire et al., 2016). Frequently, there is little or no overlap between studies, or construct validation to demonstrate that all indicators recorded measure what they are proposed to measure (e.g. tail biting has been considered an indicator of aggression; however this has been reconsidered in more recent years, e.g. Taylor et al., 2010).

Here we used a study investigating aggression in pigs to compare differences between two areas for the assessment of skin injuries (believed to be indicative of aggression in pigs), an ear score and a composite body score (Conte et al., 2012), and the effects of analysing the data via four inferential methods: (i) generalised linear models; (ii) repeated measures analysis; (iii) linear mixed effect models; and (iv) linear mixed effect models for repeated measures. We compare the significant associations between the two injury assessments and the covariates detected via the exploratory and four methods of inferential analysis. These four approaches were chosen because, to varying degrees, these models could account for some of the features of the data and model parameters could be directly interpreted.

Methods (i)–(iii) were considered sub-optimal relative to (iv), as these models were unable to account for correlation in the repeated measures, and/or random effects from the hierarchical structure in the data (pens within replication). We hypothesised that not accounting for random effects from the pens within replication and correlation between repeated measures will either result in additional spurious relationships and/or mask possible significant relationships between our injury assessments and the covariates. By ignoring random effects, we hypothesise there will be more statistically significant associations with environmental factors, and by ignoring the repeated measurements, we hypothesise the association between injury score and time covariate will be more complex.

We investigated the effects of sample size within multilevel designs by analysing the data from different replications (n = 18 pigs * 4 pens per replicate) as separate studies, and comparing the coefficient estimates from each of these analyses. A reduced sample size leads to a decrease in power, which means it is more difficult to identify the environmental factors associated with the injury scores. We hypothesise, that with a reduced sample size, there will be fewer statistically significant associations between injury scores and environmental factors.

Section snippets

Animals and housing

The study was conducted at the Agri-Food and Biosciences Institute, Hillsborough, County Down, Northern Ireland. The study used commercial crossbreed PIC 337 (Large White x Landrace) pigs. Pigs received a commercial weaner diet ad libitum and water was always available, according to the standard practices on the farm.

Each pig was weighed when they were four weeks and again at ten weeks old. The pigs’ sex and weights at 4 weeks of age were used by the stockman to balance the groups to achieve a

Results

For 862 individual pigs we had a measurement for at least one of the injury assessments. For body score there were two pigs with missing data for the first observation, seven pigs with missing data for the second observation and nine pigs with missing data for the third observation. For ear score there were three pigs with missing data for the first observation, seven pigs with missing data for the second observation and 10 pigs with missing data for the third observation.

Discussion

Comparing models where each incorporated different aspects of the study design demonstrated how important using the most appropriate inferential analysis is when producing valid results. By appropriately accounting for all sources of variation within the multilevel structure of the data (i.e. pens within replications) and considering the potential time-dependent correlation between observations, we increased the likelihood of identifying the true associations between the covariates and injury

Ethics statement

All procedures described were approved by the University of Lincoln's Ethics Committee on 8/9/2015, code COSREC62. This research was conducted at the Agri-Food and Biosciences Institute, Northern Ireland and conformed to the Association for the Study of Animal Behaviour's guidelines on the use of animals in research: http://asab.nottingham.ac.uk/ethics/guidelines.php.

Funding statement

This work was supported by the BBSRC (grants BB/K002554/1 and BB/K002554/2). MF was supported by a Department for Employment and Learning Northern Ireland studentship and Queen’s University Belfast.

Conflict of interest

None.

Acknowledgement

The authors would like to thank AFBI for use of experimental room and care of the animals.

References (45)

  • H.P. Lahrmann et al.

    The effect of long or chopped straw on pig behaviour

    Animal

    (2015)
  • C.A. Morgan et al.

    The effects of straw bedding on the feeding and social behaviour of growing pigs fed by means of single-space feeders

    Appl. Anim. Behav. Sci.

    (1998)
  • C. Munsterhjelm et al.

    Experience of moderate ̈a bedding affects behaviour of growing pigs

    Appl. Anim. Behav. Sci.

    (2009)
  • P. Statham et al.

    A longitudinal study of the effects of providing straw at different stages of life on tail-biting and other behaviour in commercially housed pigs

    Appl. Anim. Behav. Sci.

    (2011)
  • N.R. Taylor et al.

    Tail-biting a new perspective

    Vet. J.

    (2010)
  • S.P. Turner et al.

    The effect of space allowance on performance, aggression and immune competence of growing pigs housed on straw deep-litter at different group sizes

    Livest. Prod. Sci.

    (2000)
  • D. Bates et al.

    Lme4: Linear Mixed-Effects Models Using Eigen and S4

    (2015)
  • D.J. Biau et al.

    Statistics in brief: the importance of sample size in the planning and interpretation of medical research

    Clin. Orthop. Relat. Res.

    (2008)
  • J.M. Bland et al.

    Statistical methods for assessing agreement between two methods of clinical measurement

    Lancet

    (1986)
  • M. Borenstein et al.

    Introduction to Meta-Analysis

    (2009)
  • G.E. Box et al.

    Empirical Model-building and Response Surfaces

    (1987)
  • A.W. Brown et al.

    Publication bias in science: what is it, why is it problematic, and how can it be addressed?

  • View full text