Background

Organizations that provide services to persons with severe and persistent mental illness are under increasing pressure to demonstrate that those services are achieving recognizable and measurable outcomes with respect to client functioning. The concept of “recovery” can be operationalized from either the perspective of the consumer of mental health services, or from the perspective of the agency providing those services, and current efforts to measure client outcomes have used both approaches (Kidd et al. 2004; Test et al. 2005).

When considered from the perspective of the consumer, recovery is assessed from the internal experience of the individuals who use terms such as “becoming empowered,” “taking charge of their own lives,” or “becoming responsible for themselves.” Other aspects to recovery include the mitigation of psychiatric symptoms and improvement in overall functioning, as well as identifying and taking on meaningful roles in life (McGlynn 1996; Uehara et al. 2003).

Several scales have been created for use with mental health consumers in community settings. These include the Psycho-Social Well-Being Scale (PSWS) (O’Hare et al. 2003), the Camberwell Assessment of Need and Behavior and Symptom Identification Scale (Trauer and Tobias 2004), and the Satisfaction With Life Scale (Test et al. 2005), which were developed to assess consumers with respect to social and other functioning, rather than to simply focus on psychiatric symptoms.

When considered from the perspective of the agency, the focus has been on several aspects. There has been a need to evaluate the level of functioning of the client, as well as the type of services used that contributed to improved functional outcomes. The Global Assessment of Functioning (GAF) is one instrument originally developed for this purpose (Endicott et al. 1976; Greenberg and Rosenheck 2005; Hall 1995; Moos et al. 2002). The Cornell Service Index is a measure of health services usage among clients of outpatient mental health services (Sirey et al. 2005), while the Clinical Strategies Implementation Scale is frequently used to assess whether evidence-based practices are used in mental health services for persons with schizophrenia (Falloon et al. 2005; Resnick 2005).

The Level of Care Utilization System (LOCUS) (Sowers et al. 1999) was developed to provide a way to assess services needs of adults and to quantify services based on the amount and scope of resources available to clients at each level of service. The LOCUS is used to assess clients on six dimensions: (1) Risk of Harm; (2) Functional Status; (3) Medical, Addictive and Psychiatric Co-Morbidity; (4) Recovery Environment; (5) Treatment and Recovery History; and (6) Engagement. Five of the dimensions are scored on a five-point scale, with one denoting minimal risk of harm, for example (dimension 1-Risk of Harm), or minimal impairment (dimension 2-Functional Status); higher numbers mean greater risk or greater impairment. One dimension (4—Recovery Environment) has two defined subscales; the first subscale denotes the Level of Stress in the Recovery Environment and the second subscale denotes the Level of Support in the Recovery Environment. As with the other dimensions, lower scores indicate a low stress environment or a highly supportive environment and higher scores represent the negative ends of the continuum.

In California in 1997, the California Association of Social Rehabilitation Agencies (CASRA) convened a group of 50 administrators, clinicians, and consumers for the purpose of creating a system to classify consumers according to their needs. The CASRA project concluded that consumers could be assigned to clusters based on their level of risk, their level of coping skills and supports, and their level of engagement with the mental health system. The movement of consumers from one group or cluster to another could itself be viewed as an outcome, could reasonably be seen as a description of “the process of recovery,” and could be counted as such by service providers.

In 2005, voters of the state of California passed the Mental Health Services Act (MHSA), which provided for steady funding of mental health services based on an additional tax. Recovery was seen as the basis for services delivered under this Act.

This focus on recovery has significant implications not only for the types of mental health services offered and the manner in which they are delivered, but also for the way in which the effectiveness (outcomes) of mental health programs and systems are evaluated. McGlynn (1996) described five major domains of outcome measurement for mental health programs. These include clinical status (how a disorder is defined, particularly in terms of the presence and severity of symptoms); functional status (the ability of an individual to perform age appropriate activities); and quality of life (the importance of different decrements in functioning on an individual’s perception of his or her quality of life); adverse events (negative outcomes such as hospitalization, mortality, incarceration that result from system problems that could be avoided with appropriate care) and satisfaction with care (the consumer’s perception of the quality of the care that she or he received). The concept of overall recovery from a disabling mental illness as a domain of outcome measurement is now of major importance for the evaluation of mental health services (McGlynn 1996; Sowers 2005).

The groups or clusters created by the CASRA workgroup provided the framework for the development of the Milestones of Recovery Scale (MORS) (Pilon et al. 2006). The MORS consists of three underlying dimensions of the consumer’s (a) level of risk, (b) level of engagement with the mental health system, and (c) level of skills and supports. The consumer’s level of risk is comprised of three primary factors: (a) the consumer’s likelihood of causing physical harm to self or others, (b) the consumer’s level of participation in risky or unsafe behaviors, and (c) the consumer’s level of co-occurring disorders. The consumer’s level of engagement is the degree of connection between the consumer and the mental health service system. Finally, the consumer’s level of skills and supports should be viewed as the combination of the consumer’s abilities and support network(s) and the level to which the consumer needs staff support to meet his/her needs. It should include an assessment of their skills in independent living (e.g., grooming, hygiene), cognitive impairments, whether or not they are engaged in meaningful roles in their life (e.g., school, work), and whether they have a support network of family and friends. The eight levels of the MORS can be found in Table 1.

Table 1 The Milestones of Recovery (MORS) and how it is used

This paper reports on the psychometric properties of the MORS.

Methods

Inter-Rater Reliability Study: Long Beach, California

The inter-rater reliability study took place at The Village Integrated Service Agency during the month of October 2005. All active clients were rated by two to five raters. All clients were rated by a psychiatrist or a case manager and either a neighborhood leader or one other staff person who knew them well. There were a total of 49 raters who rated 431 clients. The intra-class correlation coefficient was calculated using PROC MIXED (Littel et al. 2006) in SAS version 9.1.3. PROC MIXED provides the within and between rater variance components required for the calculation of the intra-class correlation even when the number of raters for each client differs. For determination of acceptable inter-rater reliability, .70 was chosen as an acceptable level (Nunnally and Bernstein 1994).

Inter-Rater Reliability Study: Vinfen Corporation, Boston, Massachusetts

Vinfen Corporation is the largest provider of non-profit housing service to people with psychiatric disabilities in New England. Vinfen’s housing program includes 80 programs in the Psychiatric Rehabilitation Division (PRD). These include homeless outreach, residential services, supported housing, the Program of Assertive Community Treatment (PACT), with specialized programs for transitional aged youth, people living with HIV and AIDS, and people with co-occurring mental health and substance abuse disabilities.

In November 2005, one of the co-authors on this paper provided 2 days of consultation to Vinfen Corporation on the use of the MORS. This included 3 h of training to managers and staff of four pilot programs. In 2–3 h increments, each program assigned initial ratings to all persons served. A range of program types across the PRD was included in the pilot (N = 105). Vinfen has a total of 240 possible slots for clients, including a PACT team with 72 persons served; an outreach team serving 114 persons; a transitional aged youth program serving 19 persons; and supportive housing serving 35 persons. Each program rated all participants by consensus at the same time each month. Three programs rated participants all “in one sitting.” One program assigned ratings to 25% of the participants per week for each month.

After training the staff, the actual pilot study took place in April 2006. Each client was assigned a primary rater (usually a case manager) and a secondary rater. Each rater was blind to the other raters’ rating. A total of 105 clients were rated by two individuals and both ratings on each client took place on the same day.

Test–Retest Reliability Study

The test–retest reliability study was conducted at two points in time during the month of September 2005 at The Village in Long Beach, California. Three hundred and eighty-one clients were rated at both points in time (431 at time 1 and 381 at time 2). The time interval between ratings ranged from 10–20 days.

Validity Study

A score on an existing measure, the Level of Care Utilization System (LOCUS) (Sowers et al. 1999, 2003) was obtained on all clients on whom a MORS score was obtained for 6 months, January through June 2005. The LOCUS and the MORS were obtained on all clients for each of these 6 months. Spearman correlation coefficients were obtained and .49 was used as meeting acceptable validity (Nunnally and Bernstein 1994).

Results

The demographics of the clients on whom the inter-reliability data were obtained can be found in Table 2. The inter-rater reliability achieved using clients and staff of The Village Integrated Service Center was r = .85 (95% CI = .81, .89). The inter-rater reliability using clients and staff of Vinfen Corporation was r = .86, (95% CI = .80, .90). The test–retest reliability using clients and staff of The Village was r = .85 (95% CI .81, .87).

Table 2 Demographics

Clients from each of the Vinfen program participated in the inter-rater reliability study. This included 57 from the PACT program, 16 Transitional Aged Youth (TAY), 16 from the outreach program, and 17 from the supported housing program. There were some differences between Vinfen clients who participated in the inter-rater reliability study and those who did not. While there were no significant differences between the rated and unrated clients based on gender, there was a significant different found on race/ethnicity (χ 2(3) = 9.71, P = .02) with greater numbers of White clients and fewer Black and Asian clients participating in the inter-rater reliability study. Clients who were rated were younger (M = 45.13, SD = 13.58) compared to clients who were not rated (M = 48.99, SD = 10.78, t(174) = 2.06, P = .04). Clients who were rated had higher GAF scores (M = 53.63, SD = 7.71) compared to clients who were not rated (M = 48.07, SD = 8.32, t (174) = 3.25, P = .0017).

The validity coefficients and 95% confidence intervals using the LOCUS can be found in Table 3. Validity at the acceptable r = .49 level was achieved for all LOCUS subscales except the Level of Support subscale.

Table 3 MORS overall rating and LOCUS subscale validity coefficients and 95% confidence intervals

Discussion

This paper reported on the reliability and validity of a measure for assessing client outcomes in the persistently severely mentally ill. Inter-rater and test–retest reliabilities were good overall; two independent sites were used to obtain inter-rater reliability, providing a strong test of the instrument’s ability to be used consistently among several raters, in different settings, and with different populations. The values of the obtained reliabilities were remarkably similar.

The test–retest reliability of the MORS was also good and reached acceptable levels.

One measure of validity, the LOCUS, was used in this study. The LOCUS was subjected to rigorous reliability and validity studies when it was developed (Sowers et al. 1999; Sowers et al. 2003). It provides exceptional validation of the MORS with respect to convergent construct validity. One subscale of the LOCUS, the Level of Support, did not meet the stated criterion of .49. The Medical, Addictive, and Psychiatric Co-Morbidity (MAP) subscale met the criterion of .49, however, the 95% confidence interval fell below .49. We do not feel that this is a problem with the MORS in that all instruments will have stronger abilities to discriminate in some areas than in others.

In the multiple studies of inter-rater, test–retest, and validity reported here, the MORS was subjected to methodologically rigorous evaluation. Compared to the reliability and validity techniques used for other instruments that have relied on scoring of vignettes or clinician case studies (Sowers et al. 1999), we feel our methodology is strong.

It is important to note that the MORS is designed to be used as an administrative tool, not a clinical one. Its purpose is to describe the general parameters of recovery, not to prescribe the individual process of recovery. No classification system can do justice to the uniqueness of the individuals that it attempts to classify. We propose that the MORS should be used for purposes of program accountability and the establishment of benchmarks for programs; it should guide the collaboration between staff and the consumers in the development and refinement of treatment plans, as well as provide a framework for staff within which they can think and act on the current status of program participants.

There are several limitations to this study that must be noted. Firstly, the time period between the two inter-rater reliability ratings was different at the two sites. Both of the ratings at the Vinfen site in Massachusetts took place on the same day, while the inter-rater reliability ratings at the Village in Long Beach took place a week apart. These are different intervals and we cannot assess what impact this difference in timing may have had on the ratings. Secondly, there are a limited number of variables on both of the samples. No LOCUS measures were obtained on the Vinfen clients and the Global Assessment of Functioning was not obtained on the Village clients. Therefore, we have different scales on the two samples, reducing our ability to compare differences between the two samples. Finally, clients from the Village were similar to clients at Vinfen who received services under the Program of Assertive Community Treatment (PACT) model; however, Vinfen also had clients from two additional modalities of treatment, the Transitional Aged Youth (TAY) and outreach clients. These TAY and outreach clients may different from the other Vinfen clients and the Village clients in important ways that we have not measured.

Some future directions in the adaptation and use of the MORS include consideration of the following questions. First, are different services more or less effective at different milestones of recovery? What is the “typical” path of a person in recovery? Can such a path be described by use of the MORS? Can we hold service providers accountable for moving people through the milestones? And finally, should we set expectation for service providers to move certain percentages of their consumers to higher milestones over a set amount of time?