INTRODUCTION

Patients with osteoporosis can sustain fractures following falls or other minimal trauma, often leading to loss of function and independence, with annual direct medical costs estimated at $17 billion.1 Increasingly, attention has focused on identifying physicians who deliver high-quality care, especially for common medical conditions like osteoporosis, where evidence-based care can improve outcomes and/or reduce costs.2,3 Through appropriate risk assessment, diagnosis, and treatment,4,5 the risk of osteoporosis-related fractures can be reduced, alleviating consequent dependency and economic burden.

Meaningful clinical performance assessment includes clinically important evidence-based measures, feasible data collection, and a psychometrically robust assessment.2,6 Recent research on composites drawn from evidence-based measures has advanced the science of quality assessment.711 Composites are comprehensive measure sets that provide a more reliable assessment than individual measures alone.8,9 Using a composite to make a judgment about an individual physician’s performance in practice requires a defensible absolute standard of performance on the composite. An absolute standard that reflects the level of proficiency required for practice in the profession is a widely-accepted requirement for certification testing of physicians’ knowledge.12 We have previously shown that reasonable absolute standards can be set for composites in diabetes and preventive cardiology care.10,11 Multiple individual osteoporosis measures are currently available in National Quality Measures Clearinghouse;13 we believe this is the first study to address the development and evaluation of a composite for osteoporosis care.

In this study, we addressed measuring osteoporosis care quality using evidence-based guidelines.14,15 We applied a patented methodology16 to assess physicians’ clinical performance. We determined performance standards for both competent care, defined as the standard of care patients should expect to receive from all certified physicians, and excellent care, defined as the quality of care patients can expect to receive from a certified physician who has the knowledge and skills needed to deliver recommended care at a higher achievable standard. We then conducted analyses to understand the stability and meaningfulness of these standards in measuring osteoporosis care.

METHODS

Instrument and Sample

We used data from the American Board of Internal Medicine (ABIM) Osteoporosis Practice Improvement Module® (PIM), augmented by data on physician and practice characteristics drawn from ABIM’s registration and practice characteristics survey. The PIM is a web-based self-evaluation tool that uses medical chart reviews to help physicians improve clinical care. Physicians who make management decisions about patients with or at high risk of osteoporosis can choose to complete the Osteoporosis PIM as part of ABIM’s Maintenance of Certification (MOC) program.17

To complete the PIM, individual physicians abstracted 25 patient charts using a retrospective or prospective sequential sample, or a systematic random sample. Eligible patients included patients aged 18 years and older with a diagnosis of osteoporosis, osteopenia, or a prior low impact fracture; women aged 65 years and older; and men aged 70 years and older. Patients must have received care from the physician’s practice for at least 12 months (with at least one visit within the past 12 months). Physicians also completed a web-based adaptation of the Physician Practice Connections® Readiness Survey (PPC-RS) that assesses the physician’s practice infrastructure. The PPC-RS is a precursor to the Physician Practice Connections®–Patient-Centered Medical Home assessment survey developed by the National Committee for Quality Assurance (NCQA).18 It asks physicians to report on whether the following are present in their practice: structured quality measurement and improvement; patient data tracking systems; processes for care management; proactive management of important conditions; patient-centered self-care support and education; and technology infrastructure. The PPC-RS score is on a scale of 0 to 100 points, with 100 being the maximum possible score.

We obtained data from a retrospective cohort of 463 physicians who completed the osteoporosis PIM between November 2011 and January 2013.

Performance Measures

We used eight evidence-based process measures (Table 1; online Appendix 1 for specification) derived from evidence-based guidelines on the prevention, diagnosis, and treatment of osteoporosis.4,5 Eighty-two physicians were excluded because of less than five eligible patients for some measures, yielding 381 physicians. Performance on each measure was defined as the percentage of a physician’s patient panel receiving the service.

Table 1. Computation of the Competent and Excellent Care Standard Based on Measures from the ABIM Osteoporosis PIM

Composite and Standard-Setting Methodology

The methodology has been previously described911 and patented.16 For each measure, we established a threshold for delivering both competent and excellent care using an expert panel and an adaptation of the Angoff standard-setting method.19 A panel of 12 physicians was selected through a call for nominations to represent essential perspectives of clinical practice around osteoporosis care (online Appendix 2). All were certified in internal medicine and/or rheumatology, endocrinology, and geriatric medicine. A majority spent at least 50 % of their time in clinical practice. The panel reached a shared understanding of the characteristics of a hypothetical physician whose care for patients with or at risk for osteoporosis would be just at the threshold for competent care (i.e., a “marginally competent” physician). This physician would have a basic understanding of osteoporosis care, but would not always deliver appropriate care to all patients.

Next, thresholds for each measure were estimated for how the hypothetical “marginally competent” physician would perform (online Appendix 3). For example, to determine the threshold for bone density testing, each panelist independently answered the question “What percentage of eligible patients seen by a ‘marginally competent’ physician would have had bone density testing?” Statistics describing the characteristics of patients seen by the 381 physicians were presented. After panelists shared their initial estimates with the other panel members, actual results for each measure were presented as a “reality check.” Panelists discussed differences and were given a chance to change their estimates. Final estimates were averaged across panelists to determine each measure’s threshold for competent care.

The Dunn-Rankin method was used to determine the point value for each measure used in the composite score (online Appendix 3).20 Panelists independently rated each measure’s importance to delivering quality osteoporosis care using an 11-point Likert scale (0 = Not at all important to 10 = Very important). Panelists’ ratings of importance were then used to derive point values for each measure. The points across all eight measures were scaled to sum to 100.

To determine the competent care standard (online Appendix 3), the threshold for each measure was multiplied by its point value (Table 1). For example, the threshold of 44.6 % for calcium intake assessment and counseling was multiplied by its point value (i.e., importance weight) of 13 (i.e., 0.446 × 13 = 5.80). The products for all measures were summed to yield the “standard” for competent care. Likewise, to determine the excellent care standard, the panel repeated the steps above (Table 1). This time, the panel considered a hypothetical “marginally excellent” physician whose care would be just at the threshold for excellent care.

Computing a Physician’s Performance Score

A physician’s performance on each measure was multiplied by its point value, but no points were awarded unless the threshold for competent care was met. Points earned by a physician for all measures were summed to yield a composite score that ranged between 0 and 100 points (online Appendix 3).

Estimating Reliability and Classification Accuracy

We used the bootstrap sampling method (1,000 samples per physician) to estimate the reliability of the composite21 and the classification accuracy of the two standards.22,23 We estimated the standard error of measurement through the bootstrap samples, and then derived the reliabilities using the classical true score model. Classification accuracy describes the accuracy of decisions made and how well they minimize false classifications. For both standards, each physician was first classified as meeting or not meeting the standard based on the observed sample. Second, a composite was calculated for each physician’s bootstrap sample and a classification was made. If classification decisions were the same for the bootstrap and observed samples, the decision was deemed accurate. The proportion of accurate classifications over all samples for each physician was calculated. These were then averaged across physicians to compute the classification accuracy index, which can range from 0 to 1.

Statistical Analyses

Chi-square tests and t-tests were used to compare the differences between our study sample and those who completed other available ABIM PIMs. We computed coefficient of variation and intraclass correlation coefficient to assess inter-rater agreement in panelists’ final estimates of each measure’s threshold. To examine the validity of the assessment, we used multivariate regression to determine if physician characteristics (e.g., specialty), practice characteristics (e.g., practice type), and practice infrastructure level (PPC-RS score) were meaningfully associated with the composite. We utilized the same set of variables in logistic regression models to examine whether they predicted meeting competent and excellent care standards. We conducted a sensitivity analysis to evaluate the effect of patient risk group (e.g., having osteoporosis versus being at risk due to age) on the composite. Statistical analyses were conducted using SAS Version 9.3. All data were Health Insurance Portability and Accountability Act (HIPAA) compliant and data were reported only in aggregate. Permission to use data for research purposes was granted by physicians upon enrollment in MOC.24 The ABIM Privacy Policy can be found at www.abim.org/privacy.aspx. We were blinded to physicians’ identities, and we viewed and analyzed the data in aggregate. Essex Institutional Review Board, Inc. approved this study.

RESULTS

Most of the 381 physicians used retrospective sequential (54.3 %) or systematic random (33.6 %) samples. Table 2 compares physician and practice characteristics of our study sample with the initial 463 osteoporosis PIM completers and with 11,813 physicians who completed any ABIM PIM during the same time period. No statistically significant differences were observed between our study sample and the initial sample. Compared with all physicians completing any PIM, our study sample had similar age and internal medicine certification exam scores, but differed in other characteristics. Our sample had a larger proportion of general internists (55.6 %) and subspecialists whose training includes emphasis on osteoporosis: rheumatologists (19.2 %) and endocrinologists (10.0 %).

Table 2. Physician and Practice Characteristics of the Study Sample and All Physicians Who Completed Any One of the ABIM PIMs

The mean number of charts abstracted was 25.36 (SD = 2.0; range: 24-50). The mean patient age was 72.21 years (SD = 10.48); 89.4 % were female; 39.8 % had osteoporosis, 28.5 % had osteopenia and 3.1 % had a prior low-impact fracture; 23.6 % were women 65 years or older and 5.0 % were men 70 years or older without these diagnoses. Table 1 shows physicians’ mean performance on each measure. The reliability coefficients of individual measures ranged from 0.69 to 0.97. The estimated reliability of the composite was 0.92, indicating that 92 % of the measured performance reflects true ability, not random error. The mean composite was 67.54 out of a possible 100 points (SD = 18.22).

Table 1 presents the thresholds and number of points assigned to each measure and the standards for both competent and excellent care. The absolute percentage point differences between the panel’s overall initial and final threshold estimates were small for most measures (2.3 % to 6.7 % for competent care; 2.1 % to 4.6 % for excellent care), except for the falls risk management competent care threshold (12.8 %). The variability in panelists’ final threshold ratings was small (coefficients of variation: 0.06 to 0.28 for competent care; 0.02 to 0.13 for excellent care). The inter-rater agreement (0.81 for competent care; 0.68 for excellent care) indicated substantial and moderate agreement in panelists’ final judgments. Panelists showed consistency in their ratings of the importance of individual measures; variability in points assigned was small, ranging from 0.4 to 2.1 points on average.

The standard for competent care was 46.87 points (Table 1). Figure 1 displays a histogram of physician composite scores. Three hundred and twenty-one physicians (84.3 %) met that standard. Those not meeting the standard performed statistically differently across all measures compared with those meeting it. Rheumatologists (94.5 %) and endocrinologists (92.1 %) had a higher proportion meeting the standard than general internists (81.1 %) and geriatricians (76.7 %). The classification accuracy index was high (0.95), meaning that, with repeated sampling, the same classification result (meeting the standard or not) would occur 95 % of the time.

Figure 1.
figure 1

Distribution of composite scores with standards for performance (N = 381 physicians).

The standard for excellent care was 83.58 points (Table 1) and was achieved by 85 physicians (22.3 %). Rheumatologists (38.4 %) and endocrinologists (34.2 %) had a higher proportion meeting the excellent standard than general internists (16.0 %) and geriatricians (26.7 %). The classification accuracy index was again high (0.95).

Table 3 presents the association between physician, patient characteristics, practice infrastructure (PPC-RS) scores, and the composite. Twenty percent of the variance in the composite was explained by the regression model. A few variables explained most of the variances, as indicted by their squared partial correlation coefficients: PPC-RS (0.058), certification in rheumatology (0.059) or endocrinology (0.026), and proportion of female patients (0.028). Controlling for other characteristics, rheumatologists and endocrinologists were estimated to have a higher composite score (11.21 and 9.65 points, respectively) than general internists; a physician with a PPC-RS score of 73.3 (75th percentile) would have a predicted composite score 3.05 points higher than one with a score of 59 (50th percentile). The average patient age in a physician’s sample and the physician’s composite score were negatively related; proportion of female patients was positively related.

Table 3. Results of Multivariate Regression Analysis Associating Physician and Patient Characteristics with Composite Scores (N = 364)*

Table 4 presents the logistic regression results for the association between physician and practice characteristics and meeting the two care standards. Controlling for other characteristics, rheumatologists were more likely to meet both competent (adjusted odds ratio [AOR], 7.41; 95 % CI, 1.67 to 32.84) and excellent (AOR, 3.56; CI, 1.79 to 7.09) care standards than general internists. Endocrinologists were more likely to meet the excellent care standard (AOR, 3.24; CI, 1.34 to 7.85) than general internists; and also tended to be more likely to meet the competent care standard compared to general internists; however, the result was not statistically significant (AOR = 3.46; CI, 0.70 to 17.2). For a one-point increase in PPC-RS score out of a possible 100 points, there was a 2 % and 3 % increase in the likelihood of meeting competent (AOR, 1.02; CI, 1.01 to 1.04) and excellent (AOR, 1.03; CI, 1.01 to 1.04) care standards. For a one-year increase in sampled patients’ average age, there was a 10 % reduction in the likelihood of meeting competent care standard (AOR, 0.90; CI, 0.83 to 0.98). For a 1 % increase in the proportion of sampled female patients, there was a 4 % increase in the likelihood of meeting the competent care standard (AOR, 1.04; CI, 1.02 to 1.06).

Table 4. Odds of Meeting the Competent Care and Excellent Care Standard From a Single Multivariable Logistic Regression Model Including All Listed Characteristics (N = 364)*

We compared the results including only patients with osteoporosis with the original results, which included patients with other risk factors. Two hundred and seventy-two physicians had at least five eligible patients for each measure. There was an average increase of 4.1 points in composite scores. Physician classification changed for 43 physicians (15.9 %), with 21 now meeting the competent care standard, 18 now meeting the excellent care standard, and two physicians now falling below each standard.

DISCUSSION

We evaluated individual physicians’ osteoporosis care by creating a robust composite using clinical data and a rigorous methodology for determining absolute standards for both competent and excellent care. The composites were meaningfully associated with physician and practice characteristics, supporting the validity of the methodology.25 Subspecialists in rheumatology and endocrinology demonstrated higher quality care than general internists on our composite; this is expected, given that their training and practices are more likely to focus on care for osteoporosis patients than those of generalists who care for patients with a broad spectrum of healthcare needs. The negative association between patient age and composite scores might be due to the competing needs of the elderly patients with multiple chronic conditions. The association between patient gender and composite indicated that osteoporosis care for men lagged behind; while osteoporosis and osteoporosis-related fractures occur more often in women, the mortality rate associated with fractures is higher in men.26 Including only patients with osteoporosis versus those at risk had an effect on physicians’ composite scores.

Both competent and excellent care standards classified physicians accurately, as indicated by their high classification accuracy value. Most physicians (84 %) met the competent care standard, and about 22 % met the excellent care standard. Some characteristics were more strongly associated with meeting the competent care standard than with meeting the excellent care standard. Notably, age and gender of the patients sampled were statistically significant in predicting meeting only the competent standard. The competent care standard serves to identify those physicians whose care for osteoporosis patients was below what patients should be able to expect; individual measures highlight specific areas for improvement. With more research on the composite’s generalizability and linkage with outcomes, performance reports based on this approach could potentially inform patients’ and purchasers’ health care choices, reward physicians providing superior care, and guide improvement in care. We utilized the results from this study to help physicians in the ABIM MOC program to identify areas of improvement in osteoporosis care. Prior cognitive testing with physicians who completed other ABIM PIMs indicated that performance relative to their peers as well as to an absolute standard provided valuable feedback that they were not getting from any other source.11 We created feedback reports for the ABIM Osteoporosis PIM that include histograms of the composite and individual measures showing the physician’s performance compared to the two absolute performance standards and relative to their peers. Physicians see only their own individual performance and the distribution of the performance of peers. For physicians with fewer than five eligible patients for some measures, the composite score cannot be calculated. Information on performance on individual measures is provided to guide improvement activities. This study has several limitations. First, physicians voluntarily selected the osteoporosis PIM. These results may not generalize to all physicians providing osteoporosis care. Second, there was no audit process to ensure that instructions for sampling and data extraction were followed. However, prior work confirmed the accuracy of physician-reported data in a quality improvement framework,27 and physicians had no reason to provide false results as there were no consequences for low or high performance. Third, we based the relative importance of each measure on ratings by 12 experts; this may be better informed by further research on outcomes of care. The range of experts’ experience and perspective, however, provided a balance that promoted fairness. Finally, there was no control for differences in patient panel characteristics; for process measures, formal risk adjustment is not generally necessary.

CONCLUSION

We have established an approach for assessing a physician’s quality of osteoporosis care that is backed by evidence-based guidelines and empirical data. Performance on individual measures and on the composite compared to absolute standards of care may provide a more meaningful way to judge whether physicians provided evidence-based quality of care than a relative standard that simply depends on how other physicians perform. Our hope is that this type of feedback will help physicians identify areas where they can improve osteoporosis care, and decrease morbidity, dependency, and healthcare costs.