Animal-Assisted Therapy for Youth : A Systematic Methodological Critique

Animal Assisted Therapy (AAT) for youth has the potential to benefit both physical and mental health outcomes. Yet little is known about the extent to which study designs in this area are aligned with established standards of intervention research. This critical review assesses current research methodologies focusing on AATs for youth with physical and mental health concerns. The main aims of this review are to advance the knowledge base of empirically supported treatments and identify next steps that researchers can take to secure the place of AATs as sound and valid interventions for youth.


Introduction
Although anecdotal support for Anima l-Assisted Therapy (AAT) for youth is widespread and increasing among some practitioners, it is critical that AAT has clear empirical support to be accepted in the larger fields of medicine and psychology (Kazdin & Weisz, 1998).Empirical evidence provides clinicians with needed understanding of expected outcomes and comparison to other treatment options.Aligning research of AAT with empirically supported treatment is not a new goal in the field (Kazdin, 2010); however, to our knowledge there are no reviews of the literature of AAT for youth that systematically examine the quality of its research methodology and empirical support for these interventions.
Empirica l ly supported treatments can be tested using a prescribed set of principles; these have been applied generally to AAT (Kazdin, 2010).However, current reports of these interventions typically have small, unrepresentative samples and no control group (Krueger & Serpell, 2010;Herzog, 2011).A recent review of randomizedcontrolled AAT trials revealed only eight studies that met the criteria for inclus io n, demonstrating the lack of rigorous methods used in this area.These authors were not able to make strong conclusions regarding efficacy of AATs due to this small sample size (Maujean, Pepping, & Kendall, 2015).The prescribed principles for empirica l ly evaluating treatments include rigorous assessment of design, procedure, assessment, and outcome analysis (Kazdin, 2010;Kendall, Holmbeck, & Verduin, 2004).
Evaluation of study design involves assessment of the theoretical basis of the work.Specifically, well-designed studies present rationale for the treatments given to participants, their diagnoses to be addressed, and methodological considerations such as the inclusion of appropriate control groups to rule out alternative hypotheses, random assignment to treatment conditions, and the timing of post-treatment follow-up time points to measure longevity of treatment effects, if they exist.An evaluation of the study procedure would focus on the replicability of the study design.That is, the procedure considers the participant sample including descriptive information about the sample, a clear description of the treatment, and attempts at documenting and ensuring fidelity of treatment delivery across time.Evaluation of the assessment procedure examines the validity, reliability, and appropriateness of the measures for the population and constructs of interest.Finally, evaluation of study outcome analysis is concerned with appropriateness of data analysis methods for outcome data and factors within the treatment that contribute to the outcomes (e.g.testing of mediators/moderators).This prescribed approach to evaluating therapies assures that studies are evaluated based on their scientif ic rigor.This review explores these characteristics and processes in the extant literature on AAT for youth.These aspects of study methodology will be evaluated with the aim of highlighting methodological strengths and weaknesses and offering directions for researchers who would like to pursue high quality AAT research in youth.

Studies
were included if they investigated the extent to which AATs had measurable benefits for youth, ages 0 to 18 years.The definition of AAT used to identify studies was a therapeutic interaction that utilizes the human-animal bond in goaldirected interventions as an integral part of the treatment process (Pet Partners, 2012).Studies published prior to December 31, 2014, were located using two methods.First, electronic searches of several databases were conducted (Medline, PsychINFO, Google Scholar) using a combination of search terms: animal-assisted therapy, AAT, animal interventions, youth, children, adolescents, dog, horse, dolphin, equine-assisted therapy, and hippotherapy.Second, references cited in relevant publications were explored to identify additional relevant articles.Studies that included unstructured time with an animal or had results that were only anecdotal or qualitative were excluded.Physical, physiological, psychological, and social outcome measures were included.Initia l searches yielded approximately 4,500 studies, which were reviewed further to determine if the study met review criteria.When it was unclear if a study clearly meets the criteria for review, at least two of the current review's authors examined the methodology to determine if the inclusion criteria were met.Forty-five studies met these criteria and were included in the quality coding procedures described below.See Table 1 for a description of the 45 studies included in this review.

Quality Coding
Each study included in the review was coded for scientific rigor.The coding system was developed by the first, third, and fourth authors based upon the standards of quality intervention research outlined by Bourton, Moher, Altman, Schulz, and Ravaud (2008), Kendall, Holmbeck, and Verduin (2004), and Kazdin (2010).These guidelines were selected because they address behavioral intervention standards and Consolidated Standards of Reporting Trials (CONSORT) standards for randomized controlled trials, the latter of which are considered to be the gold standard in intervention research.The quality ratings were made in four areas: design, procedure, assessment, and outcome analysis.
To evaluate the design of each study, raters examined the 1) theoretical basis provided for the study, 2) use of a control or comparison condition, and 3) use of random assignment to condition and multip le assessments.To evaluate the procedure of each study, raters examined 4) the description of the sample, 5) description of the intervention, and 6) assessment of treatment fidelity.The rigor of assessments was evaluated in terms of 7) multip le methods of assessment and 8) the use of reliable measures.Finally, the outcome analysis measurement was evaluated in terms of 9) appropriate analytical approaches and 10) testing mediator and moderators of treatment outcomes.Thus, 10 characteristics were rated in each study.
The coders were two advanced clinica l psychology doctoral students.Both raters coded twenty-five percent of the studies to assess inter-rater agreement.Studies were coded on a 0 -2 scale in each category.(See Table 2 for the scoring criteria for each category.)In one case, there was a significa nt discrepancy (e.g., one rater gave a code of 0 and the other gave a code of 2), between coders.The coders discussed the specific aspect of the study in question with the third author and a rule was agreed upon to code the characteristic.
Coding discrepancy of treatment description was also handled in this way; see further discussion below.The average inter-rater reliability was excelle nt (weighted kappa = .97).

Results and Discussion
The coded data were analyzed by category using SPSS version 22.0 to calculate means, standard deviations, and percent of studies coded in each quality level, and the correlations of the categories with the year of publication.The results suggest that researchers are adhering to a number of optimal principles of empirically supported treatments.The overall quality average of the 45 included studies was 1.04 out of a perfect score of 2.0.The overall quality average was also moderately and positively correlated with the year of publication (r=.33, p<.05), which indicates that researchers are applying more rigorous designs in tests of AATs than in the past.However, there was variability across studies, indicating that specific elements of research methodology can be included in the future, to further strengthe n the AAT literature and its impact.As described below, researchers can improve overall methodological quality by making several choices regarding the design, procedure, assessment, analysis plan, and conceptual framework, as well as describing their choices clearly in their manuscripts.

Design
As shown in Table 3, researchers consistently provided theoretical foundatio ns for the use of AATs in the introduction of the papers (M = 1.76).Of the studies reviewed, 76.1% provided clear explanations of a guiding theory or conceptual framework in the introduction of the paper.Having a theoretica l foundation for the research was not significantly correlated with the year of publication (r=.26, p>.05).Theoretica l frameworks provide a solid conceptual basis for using and developing AATs.Guiding frameworks also help educate researchers, clinicians, and policy makers with little AAT experience about the value of incorporating AAT in their work.Some of the studies reviewed in this manuscript that included conceptual models focused on attachment theory (Bachi, Terkel, & Teichman, 2014;Balluerka, Muela, Amiano, & Caldentey, 2014), social learning theory (Schuck, Emmerson, Fine, & Lakes, 2013), Rogerian principals of unconditional positive regard (Kemp, Signal, Botros, Taylor, & Prentice, 2014), or theories of musculature development (Bertoti, 1988;Giagaoglou et al., 2012;Thompson, Ketcham, & Hall, 2014).Continued integration of existing models in the social sciences and other relevant fields in AAT research will help propel the field forward and provide a stronger rationale for the utilization of AATs for specific populations.
In contrast to theoretical foundatio ns, the inclusion of comparison groups is not as common in studies of AATs for youth and has been noted as a significant threat to construct validity in AAT research in general (Marino, 2012).The majority of studies did not include a control or comparison condition (56.5%), whereas 21.7% of studies used a comparison that solely accounted for the passage of time (i.e., waitlist or treatment as usual).Control groups allow researchers to test the extent to which a target intervention is truly efficacio us, while accounting for competing hypotheses such as the passage of time or the attention youth receive in treatment.It is critical that AAT research studies include comparison groups for scientific purposes, but also to demonstrate the value of AATs to clinicia ns, consumers, and policy makers.The fact that another 21.7% of the studies utilized strong control groups to account for key components of the treatment approach (e.g., unstructured time with an animal, educational components without animal contact) is a good sign, but more studies are needed that include this methodological component.One particular ly strong control condition is demonstrated by Dietz, Davis, and Pennings' (2012) use of a dismantling strategy to create two control groups that accounted for two separate key components of the treatment (i.e., therapeutic stories and therapy dogs).Some studies utilized within-subject designs (i.e., crossover or repeated measure designs), which are a useful technique in clinical settings.However, those designs are less desirable for examining efficacy of specific interventions that are expected to have persistent effects, like AATs.
Random assignment is another aspect of design that was generally lacking in the AAT for youth research.Randomly assigning eligible participants to groups minimizes the chance that systematic differences between treatment and control groups can account for group differences.
By using random assignment, characteristics such as age, gender, and experience working with animals can be ruled out as alternative explanations of any observed effects.Without random assignme nt, treatment effects may be attributed to preexisting differences between groups.Only 13% of studies reviewed used random assignment to place participants in treatment conditions.It should be noted that studies without a control group could not earn points in this category; therefore, of the 20 studies that had comparison groups, six (30%) used random assignment.The other studies used strategies for group assignment such as convenience sampling, predetermined groups, and healthy controls.
For researchers who have limited budgets or other limiting factors, there are alternatives to random assignment, including matching participants on key variables and/or statistically comparing groups on key variables.Although not ideal, this strategy does control for group differences on variables identified as particularly important in the intervention (e.g., gender, pre-treatment symptoms previous experience with animals).Strength of control conditions was moderately positive ly correlated with year of publication (r=.48, p<.01) and the use of random assignment was also moderately positively correlated with year of publication (r=.30, p<.05).Overall, as interest in this field has grown over the past several years, research methodology in examining efficacy of AATs has become more rigorous by strengthening control conditio ns and use of random assignment of participants to condition.However, more widespread use of these strong methods is needed.
Pre-and post-treatment assessments were conducted in 58.7% of the studies; however, only 28.3% of studies conducted follow-up assessments.The use of follow-up assessments has not increased as a function of time (r=.13,p>.05).It is understandable that follow-up assessments can be difficult due to practical barriers, including tracking participants and poor retention rates after treatment, as well as additional costs and resources.However, follow-up provides important informatio n about the lasting effects of AATs beyond the end of treatment.The burden of follow-up studies can be diminished by assessing only key variables at follow-up to lessen the time strain on participants, reducing the interval of the follow-up (e.g., from one month to two weeks), and conducting phone or mail assessments.Stumpf and Breithenbach (2014) provided a good model of follow-up assessment by including a pre-test, a short term post-test (four weeks), and a longer term posttest (six months).
Overall, the rigor of designs was variable, depending upon the aspects of designs that were evaluated.Researchers consistently provided conceptual frameworks for AATs; however, considerably fewer studies included control groups, random assignment, and repeated assessments including follow-up assessments.Nevertheless, the significant correlations between the year of publication and inclusion of a control group and random assignment are positive indicators that more recent research studies are addressing some design flaws.

Procedure
Clear descriptions of procedures are essential for clinicians and researchers to be able to replicate studies, an important step in demonstrating that an intervention does not work solely in a single setting.Aspects of the procedure that were evaluated included description of the sample, intervention, and treatment fidelity.
Researchers consistently provided information about their study's sample (M=1.91);91.3% of studies reviewed gave a clear description including age, gender, and clinical problems of their participants.A description of the sample allows consumers of AAT research to better interpret the study and determine the applicability of a particular AAT study to their own population of interest.For instance, Braun (2009) made good use of tables to report key descriptors of the sample (i.e., gender, age, pet at home, previous experience with AAT).Our ratings of the duration and number of sessions across these studies also demonstrate that AAT research has adequately described these features (M=1.83).Of the studies reviewed, 87% provided a clear description of the length and number of sessions, and an additional 8.7% of studies gave some description, often describing the number of session participants received but excluding information about the length of sessions.Information about duration and number of sessions serves as a measure of dosage of treatment and allows readers to evaluate the amount of treatment required to obtain desired outcomes as well as determine the utility of a specific AAT in a particula r setting.This can often be done in a single sentence in the procedures section.
The description of the intervention is a key to replicability of a study.However, it can be challenging to determine the level of information required for clinical replicabil it y.It was difficult for the current authors to agree on the level of detail needed in this area for it to be considered adequate, proving this to be a challenge in establishing inter-rater reliability in coding how well an intervention was described.Several published guidelines were referenced to create a code that accurately reflected the current standards of the field, including the CONSORT Transparent Reporting of Trials criteria (Bourton, Moher, Altman, Schulz, & Ravaud, 2008) and American Psychological Association's Journal Article Reporting Standards (JARS) (APA, 2008).In the final review process, studies were rated on the clear description of four key characteristics of AAT interventions [i.e., 1) tasks of the youth, 2) specific skills or processes targeted by the intervention, 3) techniques used by therapist to teach the child, and 4) specific type of involvement with the animal] or on whether a study provided a statement about how the details of the intervention could be accessed (e.g., through a specific manual, website, or contacting the author directly).Most studies provided some level of detail regarding the interventio n administered (M=1.39) and 47.8% of studies provided either enough detail for clinica l replication or access to more details.Includ ing details necessary for replication can be a challenge in publications due to manuscrip t length restrictions.AAT developers should consider copywriting manuals and making manuals available free online, publishing them, or providing them at the request of interested readers.At a minimum, researchers can indicate that details of their intervention are available by contacting the corresponding author.
Finally, treatment fidelity was monitored informally (e.g., supervision during the intervention) by just 2.2% of studies.None of the studies reported monitoring treatment fidelity in a standardized fashion.Standardized fidelity monitoring could include in-vivo ratings by observers on treatment complia nce or participant ratings on treatment at the end of each session.Monitoring the consistent application of AATs within a study provides evidence that each youth participant received the same quality of treatment.Without detailed descriptions of treatment procedures, it is nearly impossible to monitor fidelity; therefore, as the field continues to improve standardization and descriptions of AATs, treatment fidelity will be more easily monitored.Monitoring treatment fidelity has been a universal challenge in behaviora l intervention research and continues to be an area that intervention researchers seek to improve (Gearing et al., 2011;Miller & Rollnick, 2014).
Taken together, AAT researchers described components of the interventio n procedure to varying degrees.None of the procedural components were correlated with year of publication.Description of the sample and dosage of treatment were clearly stated in the vast majority of studies reviewed.Providing a description of the intervention in detail sufficient for clinical replication was included in nearly half of the studies.However, fidelity of treatment is an area that is significantly lacking in currently publis hed AAT research.As noted above, there are a number of practical steps researchers could model after other intervention research to improve their descriptions of interventions and treatment fidelity.

Assessment and Measures
Aspects of the assessment and measures that were evaluated included multip le modalities of assessment and reliability of measures.The vast majority of studies used standardized assessments to measure outcomes; however, there is room for improvement in the techniques employed.Of the studies reviewed, 73.9% used one method of assessment (e.g., self-report, observer ratings, or performance on tasks).Only 23.9% of studies included multiple methods of assessing outcomes (M=1.22).The use of multiple methods has not significa nt l y changed as a function of the year of publicatio n (r=.13, p>.05).The use of multiple methods of assessment provides a comprehensive picture of outcomes and combats the potential bias of a single method (e.g., self-report biases).For instance, Gabriels and colleagues (2012) measured parental self-reported behaviora l outcomes using questionnaire and intervie w data as well as standardized behaviora l proficiency tests of motor skills, sensory process, and praxis (the ability to organize, plan, and perform an action).This comprehensive assessment method provides convincing evidence to readers who are not familiar with AATs for youth that these AATs have benefits and should be adopted more widely.
The use of reliable measures for treatment outcomes had an average rating of 0.98 and was not significantly correlated with year of publication (r=-.02,p>.05).A total of 28.3% of studies reported strong reliability statistics for their own subjects and 41.3% of studies reported strong reliability statistics based on another sample or weak reliability for the current sample.However, 30.4% of studies did not report the reliability of the outcome measures at all.The results of a study are only as good as the measures used; therefore, it is essential to use reliable measures to assess treatment outcomes and report the reliability estimates to confirm that the measures have this essential component of validity for the participants who were included.A particular ly strong example of reporting reliability statistics is Hamama and colleagues (2011) in which Cronbach's alpha coefficients for each outcome measure were reported at pre-test and post-test for the sample.
The selection of measures is often a key consideration in the development of treatment studies.Reporting reliability statistics for the participants of a study demonstrates the use of strong measures to the consumer and builds confidence in the results.Finding appropriate measures for use in AAT research may be difficult due to the relative nascence of the field.In cases where researchers are interested in a specific variable with no previous ly validated measures, it can be important to design and test measures of the key outcomes before using them to examine the effects of an AAT.This process is often time consuming, but significantly strengthens the findings of AATs.Furthermore, validated measures are more likely to be useful to other AAT researchers, thus benefitting the continued progress of the field.

Outcome Analysis
Sound data analysis plans are essential to the evaluations of AATs for youth.Aspects of each study's outcome analysis that were evaluated included statistical approach, mediator analyses, and moderator analyses.Most researchers appropriately used at least one statistical strategy (M=1.67).However, there was no significantly correlation with year of publication (r=.09, p>.05) indicating that better analytic procedures were not increasing.Only 13% of studies did not use statistical tests to make conclusions about treatment outcomes.The use of statistical tests in examining AAT treatment outcomes further strengthens the support for conclusions made by the study authors regarding the treatment efficac y.However, mediators or moderators of treatment were tested much less frequently, in only 2.2% and 4.3% of studies respectively; these were not significantly correlated to publication year (r=.24, p>.05, r=.12, p>.05).
As the research base continues to grow, identifying mediators and moderators will allow researchers and consumers to better understand the mechanisms of treatment and who benefits most of various types of AATs.It should be noted in that mediator and moderator analyses require more statistical power than group differences analyses; therefore, studies with a small samples size lack the ability to test for these variables.One exemplar study in terms of the examination of moderation is le Roux and colleagues (2014) who demonstrated gender moderated the effects of an animalassisted reading program such that the intervention reduced the existing gender disparity in reading ability compared to control groups.For those researchers not accustomed to using statistical analysis, collaborations with professionals, including qualified graduate students, who have intervention specific data analysis training, could enhance reporting of results through the use of statistical methods.

Conclusions
AAT researchers and consumers have called for more rigorous scientific tests of AATs (Krueger & Serpell, 2010;Kazdin, 2010;Nimer & Lundahl, 2007;Herzog, 2011).This review of AAT for youth demonstrates that the quality of AAT research is variable and that some methodological components have improved over the course of time (e.g., inclusion of a control group and use of random assignment to group).Researchers have much to gain from methodological improveme nts including greater adoption of AATs by consumers and policy makers, funding for program evaluation and expansion to reach more youth, and confidence in the results of efficacy and effectiveness trials.Yet other methodological components have not enjoyed greater use over time (e.g., description of treatment, use of reliable measures, mediator and moderator analyses).Thus, the field can continue to improve by addressing methodological weaknesses.It is recommended that AAT researchers make efforts to include strong control groups and random assignment to build the case that AATs for youth can result in positive outcomes that are not due to characteristics of the sample or to other effects such as the passage of time or the mere presence of animal.It is also recommended that researchers provide more detailed treatment descriptions or make their manuals more accessible so that other research groups can attempt to replicate the findings, which would increase the confidence that readers would have regarding the efficacy of the AAT outside the study.Similar l y, descriptions allow clinicians to apply scientifically sound practices in the community.Also, monitoring treatment fidelity helps researchers and clinicians alike understand why an AAT may or may not have worked as intended.Demonstrating good treatment fidelity also contributes to internal validity, or the confidence that effects were due to the treatment and not to some other aspect of the intervention (e.g., likeability of therapist).Finally, as researchers build the foundation of support for the effectiveness of AATs, further examination of how it works (mediation) and for whom it works (moderators) becomes increasingly important from both a theoretica l perspective and from a practical perspective.As the field grows, future reviews should examine AATs that primarily address physica l outcomes and psychosocial outcomes separately to better understand any differenc es in these treatment domains.AATs for youth show great promise but the ability of the field to convey the importance of this work is hampered by methodological limitations.It is hoped that this methodological critique can continue to foster the impressive work being undertaken by researchers to demonstrate the efficacy and acceptability of these interventions to the community.

Table 1 :
Summary of Included Studies Examining Animal Assisted Therapy for Youth

Table 1 :
Summary of Included Studies Examining Animal Assisted Therapy for Youth

Table 1 :
Summary of Included Studies Examining Animal Assisted Therapy for Youth

Table 1 :
Summary of Included Studies Examining Animal Assisted Therapy for Youth Repeated measures control groups indicate studies that utilize multiple baselines and/or follow up designs within each study participant.Crossover control groups indicate within subjects designs in which all participants receive a sequence of different treatment types, alternating over the course of the study (e.g.ABAB)

Table 2 :
Quality Coding of Studies

Table 3 :
Quality Characteristics' Mean Ratings, Distributions Across Studies, And Correlations by Year of Publication.