1. Introduction

In the last 20 years, maintenance studies of non-lithium treatments for bipolar disorder have increased in frequency, leading to the first post-lithium US FDA approvals for long-term treatment of bipolar disorder — for lamotrigine, olanzapine and aripiprazole as monotherapy[13] and for quetiapine as adjunctive therapy. These new treatments are seen by clinicians as being on a par with lithium, yet their ‘maintenance’ study methodologies (especially those for the atypical neuroleptics) are notably different than the randomized clinical trials (RCTs) with lithium conducted in the 1960s and 1970s.[4,5]

This matter is clinically important because many clinicians, and experts, use the above maintenance studies to claim that such agents are ‘mood stabilizers’. As we have discussed previously,[6,7] there are different opinions about the definition of a mood stabilizer, but we would like to suggest a simple one: a mood stabilizer is a drug that prevents new episodes of mania or depression in monotherapy. According to this definition, acute efficacy is irrelevant; prevention of recurrence is what matters. Further, adjunctive efficacy when the drug is added to other proven mood stabilizers (such as lithium) is irrelevant, since monotherapy efficacy is required for stand-alone treatment.

We conducted this review to identify the salient features of contemporary ‘maintenance’ treatment studies of bipolar disorder, especially with neuroleptic agents, so as to assess their validity compared with the older lithium literature, and so as to determine how our experience with those studies can inform us in the methodological design of future maintenance studies. (We use the term ‘neuroleptic’ on purpose because it reflects the one theme common to all these agents, extrapyramidal symptoms; furthermore the term ‘antipsychotic’ ignores their efficacy for nonpsychotic mania and nonpsychotic depression, which is directly relevant to the topic of this article).[8]

2. Prophylaxis versus Relapse Prevention Designs

The literature on the maintenance treatment of bipolar disorder until recently consisted primarily of studies with lithium, which included both prophylaxis and relapse prevention methodologies. In the prophylaxis design, ‘all comers’ are included in the study; in other words, any patient who is euthymic, no matter how that person got well, is eligible to be randomized to drug versus placebo or control, including those with recent manic or depressive episodes. In the relapse prevention design, typically only those patients who acutely respond to the drug being studied are then eligible to enter the randomized maintenance phase. Those who responded to the drug are then randomized to stay on the drug or be withdrawn from it (usually abruptly, sometimes with a taper) and switched to placebo. The ‘prophylaxis’ and ‘relapse prevention’ designs are obviously not addressing the same questions about drug efficacy. In the lithium studies in which the relapse prevention design was used (i.e. only responders to acute treatment with lithium were included), there was evidence of lithium withdrawal following acute treatment in the placebo group.[9,10] The problem is that by design those who reach the ‘maintenance’ phase and are treated with placebo are in fact persons who responded acutely to the study drug (lithium) and in whom the drug is then abruptly discontinued. Thus, if the placebo relapse rate is very high and almost exclusively limited to the first 1–2 months after study initiation, then one is observing a withdrawal effect involving a relapse back into the same acute episode that had just been treated, rather than a new episode. Further, even without immediate withdrawal, if episodes occur mainly in the first 6 months of the maintenance phase, then they are not, in our view, new episodes, but relapses into the acute index episode. Only episodes that occur 6 months or longer after the acute episode should be considered new recurrences; treatments that prevent these new episodes (of either polarity) would demonstrate maintenance efficacy as mood stabilizers.[6] In other words, the relapse prevention design methodology confounds prevention of relapse back into the index episode with prevention of a new episode.

The International Society for Bipolar Disorders (ISBD) also has suggested nomenclature for the course of illness in bipolar disorder; however, that task force was not willing to accept the definitions that we suggest, as given in figure 1, partly because many of the task force members were supportive of the ‘relapse prevention’ design that we critique in this article. It recommended only 2 months since a previous episode as sufficient time to define a new recurrence. One of the co-authors of the current article (SNG) was on the task force and expressed dissent from this view, as documented in the published task force report.[11]

Fig. 1
figure 1

Proposed acute, continuation and maintenance phase definitions for bipolar disorder. Episodes after 6 months are recurrences (new episodes). Episodes prior to 6 months are relapses.[6]

The first major maintenance study after the lithium era was a maintenance study of divalproex, conducted in the mid-1990s, that failed to find a difference between that agent, lithium and placebo.[12] It has been suggested that low severity of illness in this study sample may have been one of the reasons for its failure to differentiate between treatments.[4] This aspect is especially important given the increasing ethical constraints on enrolling patients in research studies, particularly long-term placebo-controlled studies in which patients may not receive any active drug for a year or more. This design factor likely inclined researchers to exclude patients with more severe illness, thereby probably inflating placebo response rates. Indeed, lithium and divalproex had rates of efficacy in this study that were similar to later studies of other agents (i.e. lamotrigine and atypical neuroleptics). Subsequent studies (e.g. with lamotrigine) would include the requirement that patients should have had, on average, one episode per year for the prior 3 years, thus ensuring a high 12- to 18-month recurrence rate on placebo and excluding patients who may have been stable for extended periods of time and so less likely to relapse. Indeed, an increased relapse rate was clearly seen among the relapse prevention studies that followed this new methodology.[4,13] The divalproex study was a key turning point; it was the last study to utilize the lithium-era prophylaxis design, and its failure was partly attributed to that design.

To assess the validity of relapse prevention designs, we identified studies that used this design with a MEDLINE search of the National Library of Medicine database from 1990 to 2011. We searched using the keywords ‘mood stabilizer’, ‘maintenance’, ‘treatment’, ‘relapse prevention’, ‘prophylaxis’ and ‘bipolar disorder’. We excluded lithium-only studies and small studies with a sample size below 60. We limited inclusion to monotherapy studies with a placebo control. We identified five monotherapy placebo-controlled RCTs of non-lithium treatments for bipolar disorder (one with divalproex, two with lamotrigine, one with olanzapine and one with aripiprazole) [see table I].[13,12] We also reviewed available poster abstracts of major psychiatric conferences (specifically, the American Psychiatric Association, the American College of Neuropsychopharmacology, the New Clinical Drug Evaluation Unit meetings and the International Society for Bipolar Disorders meetings) for the past decade to seek as yet unpublished data. We also contacted research liaisons at major pharmaceutical companies, namely Pfizer, Eli Lilly, Janssen, GlaxoSmithKline and AstraZeneca, to obtain such data, but received only limited data in response.

Table I
figure Tab1

Summary of recent placebo-controlled maintenance studies in bipolar disorder

We made a qualitative assessment of the salient methodological features of the studies identified, focusing on sample size, type of pre-randomization treatments, duration of pre-randomization treatments, duration of follow-up, dropout rates, duration of the randomized phase of treatment, other allowed treatments, primary and secondary outcome measures, and statistical methods for analysing randomized results.

3. Problems with the Relapse Prevention Design

The most consequential impact of replacing the traditional prophylaxis design with the relapse prevention design (especially with respect to the atypical neuroleptic ‘maintenance’ studies) was that patients were randomized to continue or stop the drug to which they had just responded. This means that most relapse prevention studies have only been evaluating what has traditionally been called continuation (relapse prevention) treatment.

All contemporary studies used some kind of study-determined open-label treatment that occurred prior to the initiation of the randomized phase of the clinical trial. This pre-study treatment ranged from any medication (the divalproex study) to monotherapy with the specific experimental drugs being studied (the atypical neuroleptic studies) to a combination of the experimental drug with other medications (the two lamotrigine-lithium studies). Minimum duration of pre-study treatment varied from 2 weeks to 4 months (table I).

We identified three major limitations of the relapse prevention design as follows: (i) it is liable to a withdrawal (or discontinuation) effect; (ii) it does not take into account relapse polarity; and (iii) it is biased in comparison with active controls.

3.1 Withdrawal (Discontinuation) Relapse

Patients are more likely to relapse to an episode of the same polarity as their index episode.[15] Often the time to a new mood episode is ill-defined, and many clinicians have (wrongly) interpreted these results to mean that whatever works for the acute episode should be the basis for long-term maintenance. When symptoms recur soon after an acute episode has remitted, especially when treatment is withdrawn overnight, we believe that these new symptoms are more properly viewed as a continuation of the initial index episode rather than a new mood episode.

An obvious example of the heavy contribution of withdrawal-triggered relapse to ‘maintenance’ results is the study of olanzapine monotherapy versus placebo,[2] in which all patients who entered the study had to have responded to open olanzapine for an acute manic or mixed episode. In that study patients only had to achieve remission from acute mania for 2–4 weeks before they were randomized to stay on olanzapine or come off it abruptly to placebo. With this design the median time to relapse in placebo-treated patients was about 1 month (table I), and about 75% of placebo-treated patients relapsed within 2 months.

In general, the studies reviewed do not explicitly state their definition of the acceptable duration for the maintenance phase of treatment. Overall study durations ranged from 6 to 18 months, with a median of 12 months. When study dropouts were included, actual average durations of follow-up were much shorter in general, and notably longer in the prophylaxis versus the relapse prevention design (table I). This is because only 5–21% of subjects treated with active or putative mood stabilizers in the relapse prevention studies actually remained on those agents for the full study duration (vs 24–38% in the prophylaxis design). Specifically, the active putative mood stabilizers all were closely clumped in the 2.8–3.4 months range for the median follow-up in the relapse prevention design, versus 5.1–6.6 months for the mean follow-up in the prophylaxis design. Similarly, in the placebo groups the median period of observation was much shorter in the relapse prevention design (median 0.9–2.9 months) than in the prophylaxis design (mean 5.5 months). Consequently, most of the benefit seen in the relapse prevention design occurs within the first 3 months of follow-up treatment, not the 6 or 18 months that are the maximum durations of those trials.

At a simple level, these results regarding withdrawal and maintenance duration reflect generalizability, which means how a study, if valid, can be extended to the general population of patients, given its specific study design. Such studies are only generalizable to those patients who responded to the study drug acutely.[16] Hence, the neuroleptic data would only be applicable to those who respond to those neuroleptics for acute mania.

3.2 Polarity of Relapse

To observe the natural history in patients with bipolar disorder, we examined available data on polarity of relapse in placebo-treated patients; the data were from two recent 18-month maintenance RCTs of lamotrigine (such data are not available from trials of other drugs, including for the atypical neuroleptics).[17] As seen in table II, recurrences after 6 months are almost always into an episode with polarity opposite to that of the index episode. In contrast, relapses within 6 months in these two RCTs are always into the same polarity as the index episode, suggesting, not a new episode, but a relapse back into the same episode that was just treated, often enhanced by a withdrawal effect facilitated by the abrupt discontinuation of active drug. The same pattern of short-term relapse into the same polarity versus long-term recurrence into opposite polarity was observed in the lamotrigine and lithium treatment groups in those studies.[17] The analysis of relapse polarity lends greater significance to the duration of the maintenance phase in research trials; treatments for bipolar disorder should be tested for efficacy through the natural course of illness wherein patients will likely experience a polarity switch.

Table II
figure Tab2

Opposite polarity as natural history: analysis of placebo-treated patients in lamotrigine maintenance studies of bipolar disorder[17]

We have obtained as much data as we can, both published and unpublished,[1820] on neuroleptics in the maintenance treatment of bipolar disorder. For all the available studied agents (olanzapine, quetiapine, ziprasidone, aripiprazole, risperidone), it is repeatedly the case without exception that no neuroleptic has yet to be shown effective in preventing depression in monotherapy relapse prevention designs 6 months or longer after an index episode of mania. The studies available include a recent huge maintenance study with quetiapine monotherapy (n = 1172) in which patients were included with various index episodes (mania, depressive, mixed), and efficacy was reported for prevention of both mania and depression.[21] However, data are not reported regarding prevention of recurrences into episodes whose polarity is opposite to that of the index episode. In contrast, one of the two maintenance studies on lamotrigine involved an index episode of mania and yet lamotrigine was effective in preventing the opposite pole of depression.[13] (The other study began with an index episode of depression, and lamotrigine was not effective in preventing the opposite pole of mania.[14] )

Both the olanzapine and the lamotrigine data were reanalysed so as to exclude relapses in the first 2–6 months of follow-up in the randomized phase, with reported benefit for both agents even after one excludes those patients who might have experienced an acute discontinuation withdrawal relapse.[2,17] The investigators have interpreted these post hoc analyses as further validating the relapse prevention design. However, the subjects analysed are no longer all those who were randomly allocated to treatment. All such completer analyses produce non-randomized datasets in which confounding bias is reintroduced and thus the results are not interpretable at face value. The statistical literature indicates that intent-to-treat analysis is less biased, and the post hoc censoring analysis is prone to false-positive results, due to an inflated effect size and loss of randomization.[2224]

3.3 Inappropriateness of Comparison with Controls

The recent maintenance relapse prevention studies of lamotrigine[13,14] used a lithium control for assay sensitivity; that is, if an established maintenance agent (lithium) does not separate from placebo the study can be considered to have failed. As it happened, lamotrigine was more effective than placebo in the prevention of depressive episodes, while lithium was not; lithium was more effective than placebo for prevention of mania, while lamotrigine was not.[1] However, the relapse prevention design hampers such a straightforward interpretation, as the sample was partially enriched for lamotrigine response but not for lithium response. (Partial enrichment reflects the design whereby patients had to tolerate, but not necessarily respond, to lamotrigine, because, except for the last week prior to randomization, they were still on other agents as chosen by their clinician.) Thus, this was not an equal comparison of lithium with lamotrigine in an unselected sample. Additionally, just as in the placebo group, lamotrigine withdrawal may exacerbate the onset of a mood episode in the lithium group, and thus any comparisons in which the benefits of lamotrigine over lithium are claimed are not necessarily valid.

4. Lithium: Comparing Prophylaxis with Relapse Prevention Designs

In contrast, lithium has been studied and shown to be effective in both relapse prevention designs and prophylaxis designs, although its benefits too are overstated in the relapse prevention design. Thus, in a systematic review,[25] the effect size of lithium on recurrence (new episode) prevention when all designs are combined yields an odds ratio (OR) of 4.0 (95% CI 1.8, 8.7; n = 8 studies, 1022 patients). When limited to the prophylaxis design, with only two studies and a small sample (n = 40), the effect size in the older literature decreased slightly and became somewhat imprecise (OR = 3.2, 95% CI 0.6, 15.5). When using the new lamotrigine studies[13,14] as examples of the prophylaxis design for lithium (since subjects were not preselected to be lithium responders), a smaller effect size is found, though with greater precision due to larger samples (OR = 1.9, 95% CI 1.2, 2.8; n = 2 studies, 495 patients). When limited to relapse prevention designs, the effect size became huge (OR = 22.0, 95% CI 1.0, 68.7; n = 3 studies, 161 patients).

One might conclude, therefore, that if we require the most rigorous design, i.e. prophylaxis, to establish maintenance efficacy, only lithium has been proven effective.

5. Clinical and Research Implications

We identified two different maintenance study designs, prophylaxis and relapse prevention, with the following three important limitations to the relapse prevention design: withdrawal relapse, invalid active control comparisons and inattention to relapse polarity. We found that contemporary ‘maintenance’ studies of bipolar disorder do not, in fact, observe long-term outcomes for the majority of patients studied, but rather they average about 3 months in duration.

We believe three important conclusions follow from this analysis of the peer-reviewed literature:

  1. (1)

    In spite of marketing claims, misleading FDA indications and widespread belief among clinicians, atypical neuroleptics have not yet been shown, in monotherapy, to prevent new episodes of bipolar disorder, meaning episodes of opposite polarity (depressive) to the index episode (manic). In contrast, lamotrigine has shown such efficacy in one study,[13] while lithium has the most evidence of such efficacy.

  2. (2)

    Prophylaxis designs provide a more valid alternative to the current standard monotherapy relapse prevention designs. Such designs would not pre-select patients to be acute responders to the experimental drug.

  3. (3)

    Researchers need to achieve consensus on what we mean by maintenance efficacy, which we believe should entail benefit 6 months or longer after the index mood episode.

There is a need for a conceptual shift to prophylaxis design studies, with greater awareness of the natural history of bipolar disorder. In the natural history of untreated bipolar disorder, the frequency and duration of episodes show considerable inter-individual variation. In the great majority of patients (those with a non-rapid cycling illness), a new episode is experienced, on average, every year or two, with an untreated major depressive episode lasting about 6 months and an untreated manic episode lasting about 3 months (longer for both in the most severe cases).[6] In patients with rapid-cycling illness, the duration of such mood episodes is by definition not longer than 3 months.[6] Since prophylaxis in bipolar disorder involves preventing both depression and mania, not just one or the other, then the longer period of illness associated with depression would appear to be the correct node around which to organize our definition of a maintenance phase; 6 months or longer would be the criterion for non-rapid cycling bipolar disorder (and 3 months or longer for rapid-cycling illness).

The recent BALANCE trial[26] is a step in the right direction; in that study, an enriched design was not used, and patients were included after they had recovered from any episode with any treatment, and only needed to tolerate lithium plus valproate (not just one or the other) without side effects for 4–8 weeks before randomization to maintenance treatment. They were then followed for a mean of 21.4 months, which is much longer than the neuroleptic studies, and up to 2 years.

The study criteria mentioned above apply not only to studies of mood stabilizers, but also antidepressants. A recent systematic review[27] using the concepts described in this article examined maintenance RCTs of antidepressants for major depressive disorder, most of which were relapse prevention designs. It found that only 5 of 23 studies provided data on relapse rates before and after 6 months of maintenance follow-up. In four of those five studies, antidepressant efficacy was limited to the withdrawal relapse period (<6 months of maintenance follow-up), with no benefit after 6 months in prevention of recurrence of new depressive episodes. Further, in bipolar type II depression, a recent study[28] claimed that fluoxetine was more effective than lithium in maintenance treatment; although the study received much attention and prominent publication, it utilized a relapse prevention design where only responders to open-label acute fluoxetine were allowed into the double-blind maintenance phase. As described throughout this article, this design guarantees the result researchers claim to show; it is not a fair assessment of maintenance drug efficacy versus lithium, and thus it is scientifically incorrect to claim that fluoxetine is more effective than lithium based on that kind of relapse prevention study.

6. Conclusions

Current maintenance studies in bipolar disorder have major limitations due to the relapse prevention design, the most important of which is withdrawal relapse into the same episode with which patients had entered the study. These studies, particularly those involving the atypical neuroleptics, do not demonstrate prevention of new mood episodes (i.e. of the opposite polarity) beyond 6 months after the index episode.

Some clinicians and researchers appear to be overestimating contemporary ‘maintenance’ data, and drawing potentially false conclusions, such as the widespread view that atypical neuroleptics are mood stabilizers. Future research would benefit, we believe, from being clear about the following two matters: (i) long-term maintenance should mean the prevention of new episodes, not just relapse back into the most recent one, and can best be demonstrated using standard prophylaxis designs, not relapse prevention designs; and (ii) the duration of outcomes observed needed to prove maintenance efficacy (i.e. prevention of new episodes) is 6 months or longer after the last mood episode.