Intraobserver Reliability on Classifying Bursitis on Shoulder Ultrasound

Purpose: Bursitis is a common musculoskeletal cause of shoulder pain and treatment varies, thus correctly diagnosing and grading bursitis is paramount in deciding management. Our aim was to assess reliability in grading shoulder bursitis on ultrasonography among fellowship trained musculoskeletal radiologists at our institution. Methods: Retrospective study of patients diagnosed with bursitis on ultrasonography. Single-sonographic images of the subacromial-subdeltoid bursa were collected for each patient and randomized to form a test-bank of varying degrees of bursitis. Three months after the test was administered, the cases were randomized and readministered. The radiologists graded each case as: within normal limits, mild, moderate or severe. Intraobserver variability was measured using Cohen’s kappa coefficient. Linear regression model was performed to assess correlation between years of experience and kappa. Results: 10 radiologists reviewed 70 cases of bursitis. Kappa values ranged from .53 to .91, indicating ‘moderate’ to ‘almost perfect’ variability amongst radiologists. A moderate positive correlation of improving variability (r = .69) with increasing years of experience exists. Conclusion: Fellowship trained musculoskeletal radiologists were able to grade shoulder bursitis with moderate to almost perfect variability, with a positive correlation of improved variability with increasing experience. This may help clinicians choose the correct treatment more confidently in their patients with shoulder pain.


Introduction
Shoulder pain is the third most common musculoskeletalrelated reason for seeking medical attention in the United States. 1 While the underlying cause of shoulder pain can be highly variable, a correlation with bursitis has been found; Draghi et al 2 found that in people presenting with shoulder pain, regardless of the cause, there was a common association between pain and the presence of bursitis.
Bursae are synovial-lined structures that function to minimize friction between at least two structures moving against each other. 2 The bursa is considered a potential space, seen on ultrasonography (US) as hypoechoic tissue between hyperechoic peribursal fat tissue. 2,3 Bursitis is when there is swelling or inflammation of the bursa. The word bursitis is often a misnomer however, as not all cases of bursitis are primarily from an inflammatory process but instead from a non-inflammatory swelling of the bursa. 4 In cases of bursitis on US, the bursa appears fluid-filled and is lined with a hyperechoic wall. 5 The normal shoulder joint is comprised of multiple bursae, including the subacromial-subdeltoid (SA-SD) bursa. The SA-SD bursa is composed of two bursae that lie under the deltoid muscle and acromioclavicular joint and overlie the rotator cuff and bicipital groove. 6,7 In people presenting with shoulder pain, there is often an association with SA-SD bursitis, regardless of the aetiology. 2 There are many pathologies that may cause SA-SD bursitis, including repetitive stress or overuse, rotator cuff injury, trauma, rheumatoid arthritis, infection and pigmented villonodular synovitis. 2,5 The treatment for bursitis is usually conservative, including activity modification, physiotherapy, nonsteroidal anti-inflammatory drugs and corticosteroid injections. Surgical resection of the bursa is reserved for treatment resistant cases. Thus, correctly diagnosing bursitis is important in regard to choosing the correct management for patients with shoulder pain.
Although findings of bursitis are relatively straightforward, there are no guidelines or classification systems that allow for standardized grading of bursitis. This leads to subjective assessments, which can lead to both intraobserver and interobserver variability. For example, Naredo et al 8 found that interobserver variability exists between experts in musculoskeletal (MSK) US (including a combination of radiologists and rheumatologists), with 84% agreement for diagnosing bursitis on shoulder US. This is most likely attributed to differences in opinion of what constitutes bursitis, mainly whether the presence of inflammation is necessary for diagnosis or not. While controlling for the differing opinions in defining bursitis, our aim was to determine whether intraobserver variability exists among fellowship trained MSK radiologists at our institution. This could grant healthcare providers with information to choose the correct treatment more confidently for their patients presenting with shoulder pain.

Methods
We conducted a retrospective study of patients who were diagnosed with bursitis on shoulder US at our institution between January 1, 2019 and December 31, 2020. Research ethics board approval was obtained. Included patients were between 18 and 69 years of age. Patients with incomplete imaging and full-thickness rotator cuff tendon tears were excluded. A total of 70 patients were analysed, including a small subset of control cases. We collected single-sonographic images, with standard window presets, of the SA-SD bursa from our institution's Picture Archiving and Communication System for each patient. Images were acquired by MSKtrained sonographers. These single images were randomized to form a 'test-bank' of varying degrees of shoulder bursitis.
The test-bank was administered to all participating fellowship trained MSK radiologists (N = 10) within Hamilton in the form of an electronic document (Microsoft PowerPoint presentation). The participants were asked to grade each case as: within normal limits, mild, moderate, or severe. Given that no gold-standard exists for grading bursitis, the present study did not seek to provide objective measurements for determining each grade. Thus, the participants were asked to grade based on their prior training and experience. The bursa was measured at the widest thickness between the peribursal fat and the superficial margin of the supraspinatus muscle, in a plane parallel to the transducer beam. 9 Following the first administration, the test-bank was randomly reordered and readministered 3 months later. The participants were then asked to grade each case again, without knowing the grading previously assigned to each case. The participants were also asked how many years they had been practicing MSK radiology.
Data were collected and analyzed on Microsoft Excel. Cohen's kappa coefficient was calculated to determine intraobserver variability between the four categorical variables (within normal limits, mild, moderate and severe). A linear regression model was used to assess for correlation between kappa coefficient as the dependent variable and the radiologist's years of experience as the independent variable. Corresponding P-value and Pearson correlation coefficient (r) were calculated. Statistical significance was declared when P <.05. The Pearson correlation coefficient range was defined as between À1.0 and +1.0, with the sign of the correlation coefficient representing the direction of the relationship. The strength of the correlation was defined based on the absolute value of r as perfect (r = 1.0), strong (.8 ≤ r ≤ 1.0), moderate (.5 ≤ r ≤ .8), weak (.2 ≤ r ≤ .5), very weak (.0 < r ≤ .2) or no association (r = .0). 10 Data analysis was performed using SPSS version 28.0 (SPSS Inc., Chicago, IL).

Results
A total of 10 MSK-trained radiologists volunteered to assess single sonographic images of the SA-SD bursa in 70 different patients. The variables measured between the two iterations of the tests, for each participant, are demonstrated in Table 1. This includes the number of disagreements and the number of disagreements spanning >1 category (i.e. grade of bursitis designated as 'mild' on one test and 'severe' on the other test, or 'within normal limits' on one test and either 'moderate' or 'severe' on the other test). Cohen's kappa coefficient, which measures level of agreement between the tests, as well as the kappa interpretation are also presented in Table 1. The following standard has been proposed to interpret the strength of agreement for the kappa coefficient 11 : <.01 = poor, .01-.20 = slight, .21-.40 = fair, .41-.60 = moderate, .61-.80 = substantial and .81-1.00 = almost perfect.
The kappa coefficient with standard of error for each participant is demonstrated in Figure 1. To demonstrate the trend of variability regarding the level of experience, the xaxis is organized in increasing years of experience. There was a moderate positive correlation between years of experience and improved variability (r = .69, P = .026). Figure 2 displays the distribution of assigning differing grades of severity of bursitis for each radiologist. Although interobserver variability was not measured in this study, this figure further illustrates the known existence of interobserver Intraobserver variability was measured using Kappa coefficient. Number of disagreements >1 category refers to disagreements between two noncontiguous categories (between within normal limits and moderate, between within normal limits and severe, and between mild and severe); CI, confidence interval; *significant difference (P < .001). variability between the different radiologists. Representative cases of patients with each grade of bursitis that were unanimously agreed upon amongst all radiologists are demonstrated in Figure 3.

Discussion
The present study retrospectively examined patients at our institution diagnosed with shoulder bursitis on US. Fellowship trained MSK radiologists were asked to grade each case of bursitis using a single US image of the SA-SD bursa. Three months after each case was graded, the cases were randomized and reassessed to the radiologists. This allowed for the assessment of intraobserver variability. This study demonstrated that intraobserver variability exists amongst the radiologists, with a moderate positive correlation of improved variability (increasing reliability) with increasing experience. At time of publication, intraobserver variability of grading shoulder bursitis on US has not previously been measured. The present study reports relatively good agreement in grading shoulder bursitis on US, regardless of years of experience. The kappa values ranged from .53 to .91 (Table 1). Although no researchers have previously measured intraobserver variability on grading shoulder bursitis on US, many have reported similar intraobserver variability on different MSK-related pathologies, for different joints, and amongst different sonographic experts (both radiologists and non-radiologists). [12][13][14][15] Cohen's kappa is a useful measure of intraobserver reliability. Values range from À1 to +1, where 0 represents the amount of agreement that can be expected from chance and 1 represents a perfect agreement between two tests. 16 Table 1 displays the individual kappa values for each participant and Figure 1 highlights the trend of increasing kappa with increasing experience. There was a moderate positive correlation of improved variability with increasing years of experience. The most experienced radiologist had the highest kappa of .91, representing a disagreement of only 4 of 70 cases (6%). The kappa values of the remaining participants were between .53 and .73, which is similar to previous studies assessing intraobserver variability for different sonographic findings in various joints. [12][13][14][15] To interpret the strength of agreement for given kappa values, we can separate different kappa values into descriptive categories. Landis and Koch 11 have proposed the following standard: <.01 = poor, .01-.20 = slight, .21-.40 = fair, .41-.60 = moderate, .61-.80 = substantial and .81-1.00 = almost perfect. Similar standards have been proposed, 17 albeit with slightly different descriptors. However, the numerical value at each tier is relatively arbitrary and considering this when interpreting the results is paramount. The radiologists in the present study ranged from moderate to almost perfect (Table  1), with a moderate positive correlation of improved variability with increasing experience. In the first group of radiologists with experience ranging from 1 to 5 years (N = 3), there was moderate agreement between tests; whereas, in the most experienced group (16-30 years; N = 4), there was one moderate, two substantial and one almost perfect. Although the difference between moderate and almost perfect categorically appears significant, Figure 1 better illustrates how similar in agreement the radiologists are regarding numerical kappa values. The magnitude of kappa is influenced by additional factors, including the number of categories and applying weighted factors to kappa. 18 The greater the number of categories, the greater the potential for disagreement between tests. In this case, there were four categories (within normal limits, mild, moderate and severe). Thus, in a clinical setting disagreement between within normal limits and severe should be more significant than a disagreement between mild and moderate, for example. However, in the present study, only 6 of the 10 radiologists had a disagreement spanning two noncontiguous categories (between within normal limits and moderate, between within normal limits and severe, or between mild and severe), and each of those six radiologists only made that disagreement once (representing 1.4% of cases).
Although the present study did not measure interobserver variability directly, Figure 2 illustrates the varying allotment for the different grades of shoulder bursitis between the different radiologists. Example cases of each grading severity that were unanimously agreed upon amongst all participants are provided in Figure 3. It is proposed that interobserver variability would exist if it were measured in this cohort, as many researchers have shown interobserver variability with diagnosing and grading different MSK pathologies on US imaging. 8,[12][13][14][15][19][20][21][22][23] This is most likely attributed to differences in opinion amongst clinicians of what constitutes bursitis, 8 as there is no gold standard definition of bursitis on US. A clinician's definition and grading acumen of shoulder bursitis is likely influenced by a combination of their prior training and clinical experience. The present study did not seek to establish a consensus on the definition of bursitis or establish a grading criterion for shoulder bursitis on US. It is the current researchers' chief interest in how clinicians would utilize this information to adjust their clinical practice. Most importantly, what impact does the radiologic impression of bursitis have in comparison to the clinical exam, or how much emphasis is placed on the radiologist's grading of bursitis? Furthermore, do the answers to these questions differ between clinicians (orthopaedic surgeons, rheumatologists, physiatrists, physiotherapists, etc.).
The present study measured bursal distension on US as the distance between the peribursal fat and the superficial margin of the supraspinatus muscle, along a plane parallel to the transducer beam. We acknowledge that interobserver variability may exist when measuring the SA-SD bursa, thus next steps should consider this prospectively.
Our study has several limitations. It is a single centre study based on the retrospective assessment of saved US images of the SA-SD bursae without Doppler assessment. Additional clinical context, including patient characteristics, presenting symptomatology and comorbidities were not available to the interpreting radiologists. Ultrasonography is considered an inherently operator dependent imaging modality. This was not a primary objective of our study and was minimized by utilizing cases performed by MSK-trained sonographers, from only a single center.

Conclusion
This study demonstrates good intraobserver reliability in grading shoulder bursitis on US for all MSK-trained radiologists. Furthermore, there was a moderate positive correlation of improving variability with increasing years of experience. Thus, understanding the inherit intraobserver variability of shoulder US may help clinicians more confidently choose the correct treatment for their patients presenting with shoulder pain.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.