Skip to main content
  • Research article
  • Open access
  • Published:

Assessing the order of magnitude of outcomes in single-arm cohorts through systematic comparison with corresponding cohorts: An example from the AMOS study

Abstract

Background

When a therapy has been evaluated in the first clinical study, the outcome is often compared descriptively to outcomes in corresponding cohorts receiving other treatments. Such comparisons are often limited to selected studies, and often mix different outcomes and follow-up periods. Here we give an example of a systematic comparison to all cohorts with identical outcomes and follow-up periods.

Methods

The therapy to be compared (anthroposophic medicine, a complementary therapy system) had been evaluated in one single-arm cohort study: the Anthroposophic Medicine Outcomes Study (AMOS). The five largest AMOS diagnosis groups (A-cohorts: asthma, depression, low back pain, migraine, neck pain) were compared to all retrievable corresponding cohorts (C-cohorts) receiving other therapies with identical outcomes (SF-36 scales or summary measures) and identical follow-up periods (3, 6 or 12 months). Between-group differences (pre-post difference in an A-cohort minus pre-post difference in the respective C-cohort) were divided with the standard deviation (SD) of the baseline score of the A-cohort.

Results

A-cohorts (5 cohorts with 392 patients) were similar to C-cohorts (84 cohorts with 16,167 patients) regarding age, disease duration, baseline affection and follow-up rates. A-cohorts had ≥ 0.50 SD larger improvements than C-cohorts in 13.5% (70/517) of comparisons; improvements of the same order of magnitude (small or minimal differences: -0.49 to 0.49 SD) were found in 80.1% of comparisons; and C-cohorts had ≥ 0.50 SD larger improvements than A-cohorts in 6.4% of comparisons. Analyses stratified by diagnosis had similar results. Sensitivity analyses, restricting the comparisons to C-cohorts with similar study design (observational studies), setting (primary care) or interventions (drugs, physical therapies, mixed), or restricting comparisons to SF-36 scales with small baseline differences between A- and C-cohorts (-0.49 to 0.49 SD) also had similar results.

Conclusion

In this descriptive analysis, anthroposophic therapy was associated with SF-36 improvements largely of the same order of magnitude as improvements following other treatments. Although these non-concurrent comparisons cannot assess comparative effectiveness, they suggest that improvements in health status following anthroposophic therapy can be clinically meaningful. The analysis also demonstrates the value of a systematic approach when comparing a therapy cohort to corresponding therapy cohorts.

Peer Review reports

Background

In the early phase of the clinical evaluation of a therapy, when first study results are published, it can be desirable to assess outcomes of this reference therapy relative to outcomes of other treatments for the disease in question. At this stage, a systematic review of controlled studies of the reference therapy vs. other treatments will give limited information, because few such studies will be available. An alternative is to compare the reference cohort (or cohorts) to all corresponding cohorts, i. e. to single-arm cohorts and therapy arms in controlled studies, receiving other treatments. Although such 'all corresponding cohorts comparisons' cannot assess comparative effectiveness, they nevertheless yield information about the order of magnitude of treatment outcomes. For therapies which have been evaluated exclusively in single-arm studies (e. g. many drugs [14], surgery [58], other procedures [9, 10]), 'corresponding cohort comparisons' remain the only possibility.

Brief comparisons with corresponding cohorts are often presented in discussion sections of papers (e. g. [11, 12]), and are often unsystematic (limited to selected studies) and imprecise (mixing different outcomes and follow-up periods). Here we give an example of a systematic comparative review, restricted to cohorts with identical outcome measures and comparable follow-up periods. Since the cohorts compared are derived from different studies, the comparisons are necessarily explorative and the analyses descriptive: results are not pooled, but are ordered in increasing magnitude.

Methods

Reason for the review

The reference therapy (anthroposophic medicine, a physician-provided complementary therapy system including counselling, medication, art and movement exercises, and massage) had been evaluated in a large single-arm cohort study: the Anthroposophic Medicine Outcomes Study (AMOS) [13]. AMOS was conducted in 1998–2005 in Germany. Outpatients with chronic disorders were enrolled before starting anthroposophic therapy and followed up for four years [1417].

For the five largest AMOS diagnosis groups in adult patients (asthma, depression [18], low back pain [19], migraine, and neck pain), AMOS is so far the only outpatient study of anthroposophic therapy for the respective diagnoses [20]. In all five groups, changes in health status had been evaluated with the SF-36 Health Survey, which is widely used [21], enabling comparisons to other cohorts. We conducted a systematic review, comparing these five cohorts (A-cohorts, 392 patients from 90 medical practices, enrolled up to 31 December 2005) to all retrievable patient cohorts (C-cohorts) with corresponding diagnoses, outcome measure (SF-36), and follow-up periods.

Objective

The objective of this systematic review was to assess the comparative order of magnitude of pre-post changes in health status in adult patients receiving anthroposophic therapy for one of five chronic diseases.

Eligible comparison studies

For comparison to A-cohorts, we considered prospective studies from any setting in any country with any therapeutic intervention including treatment-as-usual, and with a cohort of at least 20 evaluable patients, published in Danish, English, German, French, Italian, Norwegian, Russian, Spanish or Swedish.

Studies were eligible if at least 80% of participants of the study or of a defined subgroup had one of the following five diagnoses occurring in at least 20 adult AMOS patients: asthma, depression, low back pain, migraine, and neck pain. No requirements of diagnostic criteria were made. Low back pain cohorts with more than 25% patients with congenital spinal malformations, spinal infectious or malignant disease, ankylosing spondylitis, Behcet's Syndrome, Reiter's Syndrome, osteoporosis with vertebral fracture, spinal stenosis, spondylolysis, spondylolisthesis, fibromyalgia, traumatic vertebral fracture or previous spinal operations were excluded from the analysis (these diagnoses were excluded from the corresponding A-cohort). Studies with all persons aged ≥ 60 years were also excluded (only 12% of A-patients were aged ≥ 60 years).

Studies were required to have at least one of the following ten outcomes from the SF-36 Health Survey, four-week version: eight SF-36 scales (Physical Function, Role Physical, Role Emotional, Social Functioning, Mental Health, Bodily Pain, Vitality, General Health), SF-36 Physical Component Summary Measure or SF-36 Mental Component Summary Measure. The outcome was included if the arithmetic mean was presented with a number or could be estimated from a figure (a) before commencement of any study intervention, and (b) after three, six, or 12 months (± 20%).

Literature search

We searched Ovid MEDLINE, Ovid MEDLINE In-Process & Other Non-Indexed Citations, Journals@Ovid full text, BIOSIS Previews, Cochrane Database of Systematic Review, American College of Physicians Journal Club, Database of Abstracts of Reviews of Effects, Cochrane Central Register of Controlled Trials, Psychlit, the online SF-36 database [22], literature references of retrieved articles and our own literature archive. Articles published up to December 2005 were considered.

The general search strategy was: "SF-36 keyword" in title, abstract or keyword AND "disease keyword" in title. SF-36 keywords were "SF-36" OR "Short-Form" OR "Medical Outcomes" OR "Quality of Life" OR "Disability" OR "Outcome Assessment (Health Care)". Disease keywords were "Asthma" OR "Depress*" OR "Dysthymic disorder" OR "Low back pain" OR "Spinal diseas*" OR "Migraine" OR "Neck Pain" OR ["Spinal Diseas*" AND "Neck"].

Articles were read and assessed for provisional inclusion by one reviewer. All provisionally included articles were re-assessed for fulfilment of eligibility criteria by a second reviewer. Disagreements about fulfilment of eligibility criteria were resolved by discussion. Data of included articles were extracted and entered into Microsoft Excel data files by one person; all entered data were subject to source document verification by another person.

Analysis

Units of analysis were individual SF-36 outcomes of the smallest statistically independent cohorts. Data from cohorts which were not statistically independent (e. g. patients randomised to drug therapy with or without information booklet, outcomes not differing between the two treatment groups) were pooled prior to analysis.

For each C-cohort, the following study characteristics were extracted and tabulated: diagnosis, publication year, evaluable SF-36 outcomes, sample size at baseline, last evaluable follow-up, follow-up rates, design, setting, country, age, gender, disease duration, and study treatment.

The statistical analysis (SPSS 14.0) was descriptive. For each evaluable SF-36 outcome of a C-cohort, the pre-post difference (mean score at last evaluable follow-up minus mean baseline score) was subtracted from the corresponding difference of the respective A-cohort, yielding a mean outcome difference ((MeanA-FU - MeanA-0) - (MeanC-FU - MeanC-0)). For each of the ten SF-36 outcomes, mean outcome differences were analysed with summary statistics of distribution of the differences. In order to aggregate all differences of all outcomes, the differences were also expressed as between-group effect sizes through division by the standard deviation (SD) of the baseline score of the A-cohort ((MeanA-FU - MeanA-0) - (MeanC-FU - MeanC-0)/SDA-0). The baseline SD of the A-cohort was used instead of the SD of the C-cohort or a pooled SD from A- and C-cohorts because the SD was not available for many C-cohorts. To avoid redundancy when aggregating differences across SF-36 outcomes, comparisons of SF-36 Physical and Mental Component Summary Measures were not included if, for the C-cohort in question, all the eight SF-36 scales were evaluable for comparison. Effect sizes and baseline differences were classified as large (≥ 0.80), medium (0.50–0.79), small (0.20–0.49) and minimal (0.00–0.19) [23, 24]. An improvement of the same order of magnitude was defined as a minimal-to-small effect size (range -0.49 to 0.49). Due to the descriptive nature of this analysis, no hypothesis testing was performed.

Analyses were performed for all comparisons and stratified by SF-36 outcomes, by diagnoses, and both. In addition, four sensitivity analyses (SA1-4) were performed in order to study effects of reducing the heterogeneity of the comparisons. In each SA, between-group effect sizes were reanalysed, restricting the number of comparisons according to study design, setting, intervention or baseline status: In SA1, study designs of C-cohorts were restricted to observational studies (non-randomised comparative studies and single-arm cohorts), i.e. excluding randomised trials, because the randomisation prerequisite might lead to a selection of patients with different characteristics, compared to observational studies such as AMOS. In SA2, settings of C-cohorts were restricted to primary care or health maintenance organizations, because most A-patients were recruited in primary care. In SA3, treatments of C-cohorts were restricted to drugs, physiotherapy or other physical therapies or mixed treatments, because these interventions were deemed to be most similar to the AMOS treatment modalities. In SA4, comparisons were restricted to SF-36 scales with small baseline differences (maximum 0.49 SD) between the respective A- and C-cohorts, because scales with large baseline differences may have differing room for improvement following therapy, for regression to the mean etc.

Results

Excluded publications

A total of 530 publications were excluded from this review for the following reasons: diagnosis not fulfilling eligibility criteria (n = 192 publications), no follow-up data (n = 129), no SF-36 data (n = 55), multiple publications (n = 43), SF-36 data presented without means (n = 31), cohort with all patients ≥ 60 years (n = 27), duration of follow-up differing > 20% from three, six, and 12 months, respectively (n = 20), no baseline SF-36 data (n = 12), cohort with < 20 patients (n = 8), SF-36 acute form only (n = 3), modified SF-36 (n = 3), language not fulfilling eligibility criteria (n = 2), no trial (n = 1), other (n = 4). A table of excluded publications with reasons for exclusion is provided in Additional file 1.

Description of AMOS cohorts and corresponding cohorts

All diagnoses analysed together

The five A-cohorts with a total of 392 patients were compared to 84 C-cohorts with 16,167 patients. These 84 C-cohorts were presented in 63 publications (Table 1, for details see also Additional file 2). Diagnoses are described below and in Additional file 3. Seven of the 84 C-cohorts were published in the period 1994–1996, 15 in 1997–1999, 23 in 2000–2002 and 39 were published in 2003–2005. Evaluable outcomes of C-cohorts were: all eight SF-36 scales (n = 40 of 84 C-cohorts), SF-36 Physical or Mental Component Summary Measures or both (n = 11), all eight SF-36 scales plus SF-36 Physical or Mental Component Summary Measures or both (n = 20), less than all eight SF-36 scales (n = 13).

Table 1 Overview of cohorts, patients and diagnoses

Median sample size per cohort was 56 patients (interquartile range (IQR) 41–127 patients) for A-cohorts and 137 patients (IQR 65–244) for C-cohorts. The last evaluable follow-up ensued after three months in 23 of 84 C-cohorts, after six months in 32 C-cohorts and after 12 months in 29 C-cohorts. Three-month-follow-up rates were 87.5% and 83.0% in A- and C-cohorts, respectively; six-month rates were 82.1% and 79.1%; and 12-month rates were 78.8% and 72.2%.

Study designs of C-cohorts were randomised controlled trials (n = 40 of 84 C-cohorts), non-randomised comparative studies (n = 9) and single-arm cohort studies (n = 35). Study settings of A-patients were primary care practice (85.5%, 337 of 389 evaluable A-patients), referral practice (10.5%), and outpatient clinic (2.8%). Study settings of C-cohorts were primary care or health maintenance organization (n = 27 C-cohorts, 33.6% (5427/16,167) of C-patients), non-academic hospital or outpatient clinic (n = 19 C-cohorts, 25.2% of C-patients), academic hospital or outpatient clinic (n = 27, 22.0%) and other or not specified (n = 11, 18.1%).

The 84 C-cohorts came from the USA (n = 37), Germany (n = 13), United Kingdom (n = 11), Canada (n = 4), Australia (n = 3), Japan (n = 3), Italy (n = 2), Spain (n = 2), from eight other countries (each C-cohort: n = 1) and from more than one country (n = 1).

Mean age, weighted for sample size, was 44.4 years (SD 11.6) in A-cohorts and 44.4 years in C-cohorts (evaluable in 76 of 84 C-cohorts). The percentage of women was 83.2% (326/392) in A-cohorts and 59.6% (8,897/14,927) in C-cohorts (evaluable in 77 C-cohorts). Mean disease duration, weighted for sample size, was 10.5 years (SD 12.8) in A-cohorts and 12.8 years in C-cohorts (evaluable for 14 of 84 C-cohorts).

Main anthroposophic treatment modalities in A-patients were: eurythmy therapy (45.4%, 178 of 392 A-patients), art therapy (26.5%), rhythmical massage therapy (11.5%), and physician-provided anthroposophic therapy (16.6%). Study treatments in C-cohorts were: drugs (n = 32 of 84 C-cohorts, 47.3%, 7,655 of 16,167 C-patients), treatment-as-usual (n = 17 C-cohorts, 14.9% of C-patients), surgery (n = 8, 9.0%), physiotherapy (n = 4, 10.1%), other physical therapy (n = 5, 1.4%), educational intervention (n = 7, 5.9%), and mixed or other therapy (n = 11, 11.4%).

Analyses stratified by diagnoses

Data on gender, age, study design, setting, disease duration at baseline, study treatments, last follow-up and follow-up rates, stratified by diagnosis, are presented in Additional file 3. The diagnosis neck pain had only two C-cohorts; therefore, the following description refers to the remaining diagnoses – asthma, depression, low back pain and migraine – with a range of 12–30 C-cohorts per diagnosis. In all four diagnoses, the percentage of women was higher in A-cohorts than in C-cohorts; absolute percent differences ranged from 9% (asthma: 70% and 61% women in A- and C-cohorts, respectively) to 44% (low back pain: 86% and 42% women). Age was similar in A- and C-cohorts. In C-cohorts the proportion of randomised trials was higher in depression (84%, 16 of 19 C-cohorts) than in other diagnoses (range 19%–43%). In A-patients the proportion recruited in primary care was lower in asthma (48%, 27 of 56 patients) than in other diagnoses (88%–97%). Correspondingly, the proportion of C-cohorts recruited in primary care/health maintenance organization settings was lowest for asthma (8%, 1 of 12 C-cohorts) and highest in depression (74%, 14 of 19 C-cohorts). In asthma, disease duration was similar in A- and C-cohorts (median 14.5 years and 14.5 years, respectively; n = 6 evaluable C-cohorts), the other diagnoses had only 1–3 cohorts with evaluable data on disease duration. Most frequent study treatments in C-cohorts were drugs (migraine, asthma, depression: 71%, 58% and 42% of C-cohorts, respectively) and treatment-as-usual (low back pain, 30% of C-cohorts). Follow-up-rates of A- and C-cohorts differed little across diagnoses.

Comparisons between AMOS cohorts and corresponding cohorts

For separate analysis of individual SF-36 scales, a total of 552 comparisons between A-cohorts and C-cohorts were possible (Tables 2, 3, 4 Fig. 1). For aggregated analysis of all SF-36 scales, comparisons of SF-36 Physical and Mental Component Summary Measures were excluded for cohorts with all eight SF-36 scales evaluable (35 excluded comparisons), resulting in 517 comparisons.

Figure 1
figure 1

Outcome comparisons stratified by individual SF-36 scales. Differences between pre-post improvements of AMOS cohorts and improvements of corresponding cohorts for the eight SF-36 scales (0–100) and the SF-36 Physical and Mental Component Summary measures, expressed in effect sizes and ordered in increasing magnitude for each scale (altogether n = 552 comparisons). Positive differences indicate larger pre-post improvement in AMOS cohort than in corresponding cohort.

Table 2 Baseline scores and outcome comparisons, stratified by SF-36 scales.
Table 3 Baseline differences in standard deviations, stratified by SF-36 scales. Each comparison refers to one SF-36 scale at baseline: Mean score in AMOS cohort minus mean score in corresponding cohort, divided by standard deviation of score in AMOS cohort. A negative difference indicates that AMOS cohorts have worse health status than corresponding cohorts at baseline.
Table 4 Outcome comparisons in effect sizes, stratified by SF-36 scales. Each comparison refers to one SF-36 scale at the last evaluable follow-up of the corresponding cohort: Mean difference from baseline in AMOS cohort minus mean difference from baseline in corresponding cohort, divided by standard deviation of baseline score of AMOS cohort. A positive difference indicates that AMOS cohorts show larger improvements than corresponding cohorts.

Main analysis: all diagnoses and SF-36 scales analysed together (517 comparisons)

At baseline (Table 5), A-cohorts were slightly more severely affected than C-cohorts (median difference 0.22 SD, IQR -0.13 to +0.53); baseline differences between A- and C-cohorts were minimal or small (-0.49 SD to 0.49 SD) in 65.6% (339/517) of the comparisons; medium-to-large (≥ 0.50 SD) with A-cohorts more severely affected than C-cohorts in 26.5% of the comparisons; and medium-to-large with C-cohorts more severely affected in 7.9% of the comparisons.

Table 5 Baseline differences in standard deviations, stratified by diagnosis. Each comparison refers to one SF-36 scale at baseline: Mean score in AMOS cohort minus mean score in corresponding cohort, divided by standard deviation of score in AMOS cohort. A negative difference indicates that AMOS cohorts have worse health status than corresponding cohorts at baseline.

At follow-up (Table 6, Fig. 2, All diagnoses), outcome comparisons showed effect sizes (pre-post improvements of A-cohorts minus pre-post improvements of C-cohorts divided by standard deviation of baseline score of A-cohorts) with a median of 0.11 (IQR -0.11 to 0.35).

Table 6 Outcome comparisons in effect sizes, stratified by diagnosis. Each comparison refers to one SF-36 scale at the last evaluable follow-up of the corresponding cohort: Mean difference from baseline in AMOS cohort minus mean difference from baseline in corresponding cohort, divided by standard deviation of baseline score of AMOS cohort. A positive difference indicates that AMOS cohorts show larger improvements than corresponding cohorts.
Figure 2
figure 2

Outcome comparisons stratified by diagnoses. Differences between pre-post improvements of AMOS cohorts and improvements of corresponding cohorts for all SF-36 scales and summary measures, expressed in effect sizes and ordered in increasing magnitude: for all diagnoses and for individual diagnoses (altogether n = 517 comparisons). Positive effect sizes indicate larger pre-post improvement in AMOS cohort than in corresponding cohort.

• Effect sizes were positive, i. e. showing larger (≥ 0.20) improvements of A-cohorts than of C-cohorts in 41.0% (212/517) of the comparisons. These positive effect sizes were large (≥ 0.80) in 3.3% of the comparisons, medium (0.50–0.79) in 10.3% and small (0.20–0.49) in 27.5% of the comparisons.

• Effect sizes showed minimal differences (-0.19 to 0.19) between A- and C-cohorts in 41.4% of the comparisons.

• Effect sizes were negative, i. e. showing larger improvements of C-cohorts than of A-cohorts in 17.6% of the comparisons. These negative effect sizes were large (≥ 0.80) in 2.5% of the comparisons, medium (0.50–0.79) in 3.9% and small (0.20–0.49) in 11.2% of the comparisons.

The proportion of comparisons showing improvements in A- and C-cohorts of the same order of magnitude (minimal-to-small effect sizes, range -0.49 to 0.49) was 80.1% (414 of 517 comparisons).

Analyses stratified by SF-36 scales (552 comparisons)

Baseline scores of individual SF-36 scales in A- and C-cohorts are presented in Table 2, baseline between-group differences in standard deviations in Table 3. At baseline, A-cohorts were more severely affected than C-cohorts for 8 SF-36 scales, with median differences ranging from 0.05 SD (Role Physical, Bodily Pain) to 0.54 (General Health), while C-cohorts were more severely affected for 2 scales, with median differences of 0.14 (Physical Functioning) and 0.40 (Physical Component Summary). The proportion of baseline comparisons with minimal-to-small differences (-0.49 SD to 0.49 SD) ranged from 44% (General Health) to 81% (Role Physical, Role Emotional).

Outcome comparisons of individual SF-36 scales are presented in Table 2 (score differences) and in Table 4 and Fig. 1 (effect sizes). Median effect sizes ranged from -0.05 (Role Emotional) to 0.27 (Physical Component Summary), while the proportion of outcome comparisons with minimal-to-small differences (-0.49 to +0.49) ranged from 67% (Bodily Pain) to 93% (Role Emotional).

Analyses stratified by diagnosis (517 comparisons)

The diagnosis neck pain had only 10 comparisons; therefore the following description refers to the remaining diagnoses – asthma, depression, low back pain and migraine – with a range of 77–202 comparisons per diagnosis.

At baseline (Table 5), A-cohorts were more severely affected than C-cohorts in all four diagnoses: the median baseline difference in standard deviations (0.22 for all cohorts) ranged from 0.02 (asthma) to 0.42 (migraine), while the proportion of baseline comparisons with small baseline differences (-0.49 to +0.49 SD; 66% for all comparisons) ranged from 57% (migraine) to 86% (asthma).

Outcome comparisons in effect sizes (Table 6) showed very little variation: the median effect size (0.11 for all comparisons) ranged from 0.05 (asthma) to 0.17 (depression), while the proportion of comparisons with minimal-to-small differences (-0.49 to +0.49; 80% for all comparisons) ranged from 77% (depression and low back pain) to 85% (migraine).

Analyses stratified by SF-36 scales and diagnoses (552 comparisons)

These analyses are presented in Additional file 4.

Sensitivity analyses

Four sensitivity analyses were performed (Table 7, see Methods for details). SA1, SA2 and SA4 had very small effects on the outcome differences: in each analysis the median effect size was reduced from 0.11 to 0.08, while the proportion comparisons with minimal-to-small differences (-0.49 to +0.49 SD; 80% for all comparisons) ranged from 83% to 88%. In SA3, study settings of C-cohorts were restricted to primary care or health maintenance organizations, whereby the median effect size was increased from 0.11 to 0.24, while the proportion of comparisons with minimal-to-small differences was increased to 88%. The combination of SA1 + SA2 + SA3 + SA4 yielded only 5 evaluable cohorts with 16 comparisons, while results differed little from the main analysis (median effect size 0.07; minimal-to-small differences in 94% of comparisons).

Table 7 Outcome comparisons in effect sizes: Sensitivity analyses (SA)

Discussion

We have presented a systematic comparative review of SF-36 outcomes in five chronic conditions (asthma, depression, low back pain, migraine, neck pain). The review was prompted by the availability of results from the first study of a given therapy (the AMOS study of anthroposophic medicine in outpatients with various chronic diseases [13]). The objective was to assess the order of magnitude of AMOS outcomes, relative to outcomes of other therapies. For this purpose we compared AMOS diagnostic subgroups (A-cohorts) to all retrievable patient cohorts (C-cohorts) with corresponding diagnoses, outcome measures and follow-up periods. More than 500 comparisons of ten different SF-36 scales showed improvements largely of the same order of magnitude in corresponding A- and C-cohorts (minimal-to-small differences in 80% of the comparisons); with medium-to-large differences favouring A- and C-groups in 14% and 7% of the comparisons, respectively.

This systematic review has five characteristic features: 1) we compared one reference therapy to all other treatments for the respective indications; 2) each comparison was of corresponding cohorts from different studies; 3) comparisons were restricted to cohorts with identical outcome measure and comparable follow-up periods; 4) analyses were descriptive, with results ordered in increasing magnitude instead of being pooled; and 5) different outcomes were converted to a common metric, allowing for data synthesis into one variable. We are not aware of other systematic reviews combining these five features.

This type of review can be regarded as a systematic extension and upgrading of the common 'discussion-reviews' in publications presenting new therapies, where results of the first therapy study are compared descriptively to results of other treatments for the given indication. In contrast to such narrative reviews, the present review has the strengths of systematic, criteria-based literature selection and analysis.

For complementary and other complex therapy systems in widespread use regardless of whether evidence from randomised trials exists, it has been argued that the conventional drug research strategy – starting with studies of biological mechanisms and moving through Phase I, II and III clinical trials – should be replaced by a more appropriate strategy, moving from descriptive studies ('Phase 1') towards comparative studies of the whole system and its parts, and ending with studies of biological mechanisms ('Phase 5') [25]. In the context of this reversed strategy, the present review would represent an intermediate step between Phases 1–2 (studies of paradigms, utilization, perceived benefit and safety) and Phase 3 (comparative effectiveness studies).

Notably, the present review is limited to comparative order of magnitude. For our review, data from one single-arm study were available. Accordingly, each single comparison was of two cohorts derived from different studies. The only predefined criteria were comparable diagnosis and follow-up period, and identical outcome measure. A- and C-cohorts were found to be similar regarding age, disease duration, baseline affection and follow-up rates, and different regarding sample size and gender. Sensitivity analyses, restricting the number of comparisons to increase comparability of study design, settings, therapy, and baseline scores, had only small effects on the results. Notably, representative data on disease duration in C-cohorts was only available for one out of five diagnoses (asthma). Furthermore, only 15% of C-cohorts were from the same country as the A-cohorts (Germany). Other study characteristics of interest, including screening data, comorbidity and the use of adjunctive therapies, were only sparingly and heterogeneously documented in C-cohorts and hence were not evaluable. Because of the residual heterogeneity in these non-concurrent comparisons, the assessment could not be aimed at statistical precision. The analyses were purely descriptive, without attempting to pool data or to adjust for within-group differences (except simple adjustment for differences in baseline scores).

Our search strategy was limited to ten online databases, thus some eligible studies may have been missed. Sample sizes were less than 50 patients for A-cohorts with migraine and neck pain. Furthermore, only two cohorts with neck pain were available for comparison. Otherwise, a range of patient settings (primary care, clinic, academic hospital) and treatments (as-usual, drugs, physical therapies, educational intervention, surgery) were represented in the primary analyses of C-cohorts. Finally, the present review was restricted to a generic health status instrument (SF-36). Disease-specific comparisons might be more sensitive to relevant differences undetected by this review. On the other hand, anthroposophic therapy aims to improve a broad range of symptoms and functional limitations rather than only disease-specific symptoms [26], and broad instruments like the SF-36 may therefore be particularly appropriate [19].

Within the limits of non-concurrent comparisons, this review suggests that anthroposophic therapy for chronic asthma, back or neck pain, depression and migraine can be associated with improvements of SF-36 scales of largely the same order of magnitude as improvements following other treatments. This implication may sound trivial, but is not. If our analyses had shown mostly small improvements compared to other treatments, one might have concluded that further studies of SF-36 as outcome of anthroposophic therapy are not worthwhile. Had our analysis shown large differences favouring anthroposophic treatment, results would have appeared more impressive. Had our results been very heterogeneous (showing mostly large differences in both directions) it would have been necessary to compare C-cohorts with large positive and negative differences, respectively, to see if these two sets of cohorts differ systematically in other respects. The present results suggest that anthroposophic therapy can be associated with clinically meaningful improvements of health status.

The analysis also demonstrates the value of a systematic approach to corresponding cohort comparisons, which is particularly relevant for therapies evaluated exclusively or predominantly in single-arm studies. A relevant question in this respect is the appropriate range of C-cohorts to include and the associated workload (almost 600 publications were assessed for the present analysis). The starting point for the present comparative review was a reference cohort with many diagnoses; therefore we included the five largest evaluable diagnosis groups. In many other circumstances, it would be appropriate to analyse only one diagnosis. In the present review, we restricted comparisons to C-cohorts with similar follow-up period and identical outcome measure. Depending on the research question and the amount of available C-cohorts, further restrictions could be applied e. g. regarding design, setting, therapy and baseline status, as in the sensitivity analyses of the present review. Notably, in our analyses these restrictions, applied individually or simultaneously, had only small effects on the results: the maximum effect was an increase in the median effect size by 0.13 when the setting of C-cohorts was restricted to primary care/health maintenance organizations. Combined restrictions are possible, but will further reduce the number of C-cohorts. In this review, the combination of SA1 + SA2 + SA3 + SA4 resulted in only five evaluable C-cohorts with two diagnoses.

A problem when defining narrow inclusion criteria for any systematic review is that researchers familiar with the pool of potentially eligible studies might choose inclusion criteria that produce results favouring their research agenda (inclusion criteria bias) [27]. This problem can be prevented when broad criteria are used for the main analysis and restrictive criteria are applied secondarily, as in the present review. A general advantage of applying both broad and restricted inclusion criteria for one review is the additional information on how results change when inclusion criteria are altered.

Conversely, with some indications and outcome measures, the number of available C-cohorts may be very small. In such cases it may be necessary to widen the inclusion criteria, e. g. to include cohorts with other follow-up periods and outcome measures than in the reference study. Another scenario is the situation of having a body of C-cohorts from studies with very similar design, interventions and control groups (e. g. placebo-controlled randomised trials of one drug), which might enable researchers to incorporate data also from control groups, and to pool data and adjust for between-group differences [28]. Corresponding cohort comparisons may also be applied to safety aspects of medications and therapies, and should take into account setting (e. g. spontaneous reporting system, retrospective survey or prospective cohort) and outcome measure (e. g. adverse events, suspected adverse reactions or medically confirmed adverse reactions) [29].

Conclusion

In this descriptive analysis, anthroposophic therapy was associated with SF-36 improvements largely of the same order of magnitude as improvements following other treatments. Although these non-concurrent comparisons cannot assess comparative effectiveness, they suggest that improvements in health status following anthroposophic therapy can be clinically meaningful. The analysis also demonstrates the value of a systematic approach when comparing a therapy cohort to corresponding therapy cohorts.

Abbreviations

A-cohorts:

AMOS cohorts

AMOS:

Anthroposophic Medicine Outcomes Study

C-cohorts:

corresponding cohorts

IQR:

interquartile range, SD: standard deviation.

References

  1. Claassen J, Hirsch LJ, Emerson RG, Mayer SA: Treatment of refractory status epilepticus with pentobarbital, propofol, or midazolam: a systematic review. Epilepsia. 2002, 43: 146-153. 10.1046/j.1528-1157.2002.28501.x.

    Article  CAS  PubMed  Google Scholar 

  2. Sibai BM: Diagnosis and management of gestational hypertension and preeclampsia. Obstet Gynecol. 2003, 102: 181-192. 10.1016/S0029-7844(03)00475-7.

    PubMed  Google Scholar 

  3. Srisurapanont M, Kittiratanapaiboon P, Jarusuraisin N: Treatment for amphetamine psychosis. Cochrane Database Syst Rev. 2001, CD003026-

    Google Scholar 

  4. Critchley P, Plach N, Grantham M, Marshall D, Taniguchi A, Latimer E, Jadad AR: Efficacy of haloperidol in the treatment of nausea and vomiting in the palliative patient: a systematic review. J Pain Symptom Manage. 2001, 22: 631-634. 10.1016/S0885-3924(01)00323-2.

    Article  CAS  PubMed  Google Scholar 

  5. McCahill L, Ferrell B: Palliative surgery for cancer pain. West J Med. 2002, 176: 107-110.

    PubMed  PubMed Central  Google Scholar 

  6. Esiashvili N, Landry J, Matthews RH: Carcinoma of the anus: strategies in management. Oncologist. 2002, 7: 188-199. 10.1634/theoncologist.7-3-188.

    Article  PubMed  Google Scholar 

  7. Huisstede BM, Miedema HS, van Opstal T, de Ronde MT, Kuiper JI, Verhaar JA, Koes BW: Interventions for treating the posterior interosseus nerve syndrome: a systematic review of observational studies. J Peripher Nerv Syst. 2006, 11: 101-110. 10.1111/j.1085-9489.2006.00074.x.

    Article  PubMed  Google Scholar 

  8. Ismail-Khan R, Robinson LA, Williams CC, Garrett CR, Bepler G, Simon GR: Malignant pleural mesothelioma: a comprehensive review. Cancer Control. 2006, 13: 255-263.

    PubMed  Google Scholar 

  9. Gill AL, Bell CN: Hyperbaric oxygen: its uses, mechanisms of action and outcomes. QMJ. 2004, 97 (7): 385-395.

    Article  CAS  Google Scholar 

  10. Whitlock EP, Garlitz BA, Harris EL, Beil TL, Smith PR: Screening for hereditary hemochromatosis: a systematic review for the U.S. Preventive Services Task Force. Ann Intern Med. 2006, 145: 209-223.

    Article  PubMed  Google Scholar 

  11. Caffrey SL, Willoughby PJ, Pepe PE, Becker LB: Public use of automated external defibrillators. N Engl J Med. 2002, 347: 1242-1247. 10.1056/NEJMoa020932.

    Article  PubMed  Google Scholar 

  12. Hoeper MM, Halank M, Marx C, Hoeffken G, Seyfarth HJ, Schauer J, Niedermeyer J, Winkler J: Bosentan therapy for portopulmonary hypertension. Eur Respir J. 2005, 25: 502-508. 10.1183/09031936.05.00080804.

    Article  CAS  PubMed  Google Scholar 

  13. Hamre HJ, Becker-Witt C, Glockmann A, Ziegler R, Willich SN, Kiene H: Anthroposophic therapies in chronic disease: The Anthroposophic Medicine Outcomes Study (AMOS). Eur J Med Res. 2004, 9: 351-360. [http://www.ifaemm.de/Abstract/PDFs/HH04_2.pdf]

    PubMed  Google Scholar 

  14. Hamre HJ, Witt CM, Glockmann A, Ziegler R, Willich SN, Kiene H: Anthroposophic medical therapy in chronic disease: a four-year prospective cohort study. BMC Complement Altern Med. 2007, 7: 10-10.1186/1472-6882-7-10.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Hamre HJ, Witt CM, Glockmann A, Ziegler R, Willich SN, Kiene H: Eurythmy therapy in chronic disease: a four-year prospective cohort study. BMC Public Health. 2007, 7 (147): 61-10.1186/1471-2458-7-61.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Hamre HJ, Witt CM, Glockmann A, Ziegler R, Willich SN, Kiene H: Anthroposophic art therapy in chronic disease: a four-year prospective cohort study. Explore (NY). 2007, 3: 365-371. doi:10.1016/j.explore.2007.04.008.

    Article  Google Scholar 

  17. Hamre HJ, Witt CM, Glockmann A, Ziegler R, Willich SN, Kiene H: Rhythmical massage therapy in chronic disease: a 4-year prospective cohort study. J Altern Complement Med. 2007, 13: 635-642. 10.1089/acm.2006.6345.

    Article  PubMed  Google Scholar 

  18. Hamre HJ, Witt CM, Glockmann A, Ziegler R, Willich SN, Kiene H: Anthroposophic therapy for chronic depression: a four-year prospective cohort study. BMC Psychiatry. 2006, 6: doi:10.1186/1471-244X-6-57-10.1186/1471-244X-6-57.

    Article  Google Scholar 

  19. Hamre HJ, Witt CM, Glockmann A, Wegscheider K, Ziegler R, Willich SN, Kiene H: Anthroposophic vs. conventional therapy for chronic low back pain: a prospective comparative study. Eur J Med Res. 2007, 12: 302-310. [http://ifaemm.de/Abstract/PDFs/HH07_3.pdf]

    CAS  PubMed  Google Scholar 

  20. Kienle GS, Kiene H, Albonico HU: Anthroposophic medicine: effectiveness, utility, costs, safety. 2006, Stuttgart, New York, Schattauer Verlag, 1-350.

    Google Scholar 

  21. Ware JE: SF-36 Health Survey Update. Spine. 2000, 25: 3130-3139. 10.1097/00007632-200012150-00008.

    Article  PubMed  Google Scholar 

  22. SF-36.org - A community for measuring health outcomes using SF tools. 2008, QualityMetric Incorporated, [http://www.sf-36.org]

  23. Cohen J: A power primer. Psychological Bulletin. 1992, 112: 155-159. 10.1037/0033-2909.112.1.155.

    Article  CAS  PubMed  Google Scholar 

  24. McDowell I, Newell C: Measuring health. A guide to rating scales and questionnaires. 1996, New York - Oxford, Oxford University Press, 1-523. 2.

    Google Scholar 

  25. Fonnebo V, Grimsgaard S, Walach H, Ritenbaugh C, Norheim AJ, MacPherson H, Lewith G, Launso L, Koithan M, Falkenberg T, Boon H, Aickin M: Researching complementary and alternative treatments - the gatekeepers are not at home. BMC Med Res Methodol. 2007, 7: 7-10.1186/1471-2288-7-7.

    Article  PubMed  PubMed Central  Google Scholar 

  26. Ritchie J, Wilkinson J, Gantley M, Feder G, Carter Y, Formby J: A model of integrated primary care: anthroposophic medicine. 2001, London, Department of General Practice and Primary Care, St Bartholomew's and the Royal London School of Medicine, Queen Mary, University of London, 1-158. [http://www.ivaa.info/PDF/7_pactice_study.pdf]

    Google Scholar 

  27. Egger M, Smith GD: Bias in location and selection of studies. BMJ. 1998, 316: 61-66.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Glenny AM, Altman DG, Song F, Sakarovitch C, Deeks JJ, D'Amico R, Bradburn M, Eastwood AJ: Indirect comparisons of competing interventions. Health Technol Assess. 2005, 9: 1-iv.

    Article  CAS  PubMed  Google Scholar 

  29. Hamre HJ, Witt CM, Glockmann A, Troger W, Willich SN, Kiene H: Use and safety of anthroposophic medications in chronic disease: a 2-year prospective analysis. Drug Saf. 2006, 29: 1173-1189. 10.2165/00002018-200629120-00008.

    Article  PubMed  Google Scholar 

Pre-publication history

Download references

Acknowledgements

This review was funded by the Software-AG Stiftung and the Innungskrankenkasse Hamburg, with supplementary grants from the Deutsche BKK, the Betriebskrankenkasse des Bundesverkehrsministeriums, the Dr. Hauschka Stiftung, the Förderstiftung Anthroposophische Medizin, the Mahle Stiftung, and the Zukunftsstiftung Gesundheit. The sponsors had no influence on study design or planning; on collection, analysis, or interpretation of data; on the writing of the manuscript; or on the decision to submit the manuscript for publication. We thank P. Siemers for technical assistance.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Harald J Hamre.

Additional information

Competing interests

The author(s) declare that they have no competing interests.

Authors' contributions

HJH, GSK and HK designed the review. HJH and WT wrote the analysis plan. AG analysed data. HJH performed literature search, re-assessed provisionally included articles, was principal author of the paper, had full access to all data, and is guarantor. All authors contributed to manuscript drafting and revision and approved the final manuscript.

Electronic supplementary material

Additional file 1: Excluded publications. List of excluded publications with reasons for exclusion (PDF 298 KB)

Additional file 2: Included publications. List of included publications (PDF 31 KB)

12874_2007_247_MOESM3_ESM.pdf

Additional file 3: Description of AMOS cohorts and corresponding cohorts, stratified by diagnosis. Descriptive data for AMOS cohorts and for corresponding cohorts on gender, age, study design, setting, disease duration at baseline, study treatment, last follow-up and follow-up rates (PDF 78 KB)

12874_2007_247_MOESM4_ESM.pdf

Additional file 4: Comparative analyses, stratified by diagnosis and SF-36 scales. Baseline scores of AMOS cohorts and of corresponding cohorts as well as between-group outcome differences, stratified by diagnosis and SF-36 scales (PDF 93 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Hamre, H.J., Glockmann, A., Tröger, W. et al. Assessing the order of magnitude of outcomes in single-arm cohorts through systematic comparison with corresponding cohorts: An example from the AMOS study. BMC Med Res Methodol 8, 11 (2008). https://doi.org/10.1186/1471-2288-8-11

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2288-8-11

Keywords