Power Weighted Versions of Bennett , Alpert , and Goldstein ’ s S

A weighted version of Bennett, Alpert, and Goldstein’s S, denoted by S r , is studied. It is shown that the special cases of S r are often ordered in the same way. It is also shown that many special cases of S r tend to produce values close to unity, especially when the number of categories of the rating scale is large. It is argued that the application of S r as an agreement coefficient is not without difficulties.


Introduction
In behavioral and biomedical science it is frequently required to measure the intensity of a behavior or a disease. Examples are the degree of arousal of a speech-anxious participant while giving a presentation, the severity of lesions from scans, or the severity of sedation during opioid administration for pain management. The intensity of these phenomena is usually classified by a single observer using a rating scale with ordered categories, for example, mild, moderate, or severe. To avoid that the observer did not fully understand what he or she was asked to interpret, the categories must be clearly defined. To measure the reliability of the rating scale researchers typically ask two observers to rate independently the same set of subjects. Analysis of the agreement between the observers can then be used to asses the reliability of the scale. High agreement between the ratings of the observers usually indicates consensus in the diagnosis and interchangeability of the classifications of the observers.
For assessing agreement on an ordinal scale various statistical methodologies have been developed. For example, the loglinear models presented in Tanner and Young [1] and Agresti [2,3] can be used for analyzing the patterns of agreement and potential sources of disagreement. Applications of these models can be found in Becker [4] and Graham and Jackson [5]. However, it turns out that researchers are usually only interested in a coefficient that (roughly) summarizes the agreement in a single number. The most commonly used coefficient for summarizing agreement on an ordinal scale is weighted kappa proposed in Cohen [6] ( [5,7]). Cohen [8] proposed coefficient kappa as an index of agreement when the rating scale has nominal (unordered) categories [9]. The coefficient corrects for agreement due to chance. Weighted kappa extends Cohen's original kappa to rating scales with ordered categories. In the latter case there is usually more disagreement between the observers on adjacent categories than on categories that are further apart. With weighted kappa it is possible to describe the closeness between categories using weights. Both kappa and weighted kappa are standard tools in assessing agreement and have been used in thousands of applications [10,11]. The most commonly used version of weighted kappa is quadratic kappa [5,7].
Various authors have identified difficulties with the interpretation of kappa for nominal categories [7,[12][13][14][15][16][17]. Cohen's kappa is a function of the marginal totals, the base rates of the categories, which indicate how often the categories were used by the observers [18][19][20]. Cohen's kappa tends to produce much lower values for skewed marginal distributions. Furthermore, kappas from samples with different base rates are not comparable [13,16]. de Mast and van Wieringen [16] and de Mast [17] studied kappa and kappa-type coefficients in the context of a latent class model. These authors argued that the problematic behavior of kappa is explained from the fact that it is a coefficient of predictive association, instead of a pure coefficient of agreement. Other authors have identified difficulties with the interpretation of quadratic kappa for ordered categories as well. Quadratic kappa behaves as a measure of association, instead of an agreement coefficient [5]. The value of quadratic kappa also tends to increase as the number of categories increases [21]. Furthermore, quadratic kappa cannot discriminate between tables with very different levels of exact agreement [22].
A commonly proposed alternative for Cohen's kappa for nominal categories is coefficient , originally proposed in Bennett et al. [23] ( [24][25][26]). Since coefficient is a linear transformation of the raw agreement and not a function of the marginal totals, it does not exhibit the interpretation difficulties of the kappa coefficients [10,27]. Furthermore, under the latent class model discussed in de Mast and van Wieringen [16] and de Mast [17], coefficient is the only agreement coefficient that can be given some justification. Coefficient is equivalent to coefficient in Janson and Vegelius [28], coefficient RE in Janes [29], and kappa in Brennan and Prediger [12]. In the case of two categories coefficient is equivalent to coefficients discussed in, among others, Holley and Guilford [30], Maxwell [31], and Krippendorff [32].
Recently, Gwet [33] proposed a weighted version of coefficient for rating scales with ordinal categories. In this paper this coefficient will be denoted by . The generalization proposed in [33] is analogous to the generalization of kappa [8] to weighted kappa [6]. The weighting schemes that can be used with are identical to the weighting schemes of weighted kappa. The most commonly used weighting schemes for weighted kappa are the linear weights [34][35][36] and the quadratic weights [22,37,38]. In this paper we study how behaves as an agreement coefficient for rating scales with ordinal categories. More precisely, we study a special case of which will be denoted by . Special cases of are coefficient and the coefficients that are obtained if we use the linear and quadratic weighting schemes. We present several properties of that indicate that the application of as an agreement coefficient is not without problems.
The paper is organized as follows. In Section 2 we introduce notation and define coefficients and . In Section 3 it is shown that there is a simple ordering of the special cases of if a certain mild condition holds. Since this requirement is often met in real life, the special cases of are usually ordered in the same way. In Section 4 we present properties of for tridiagonal agreement tables. It is shown that many special cases of tend to produce values close to unity, especially when the number of categories of the rating scale is large. Section 5 contains a discussion.

Weighted Coefficients
In this section we introduce notation and define the coefficients and . Gwet [33, page 56] defines in terms of similarity scaling. However, for notational convenience, we will define in terms of dissimilarity scaling here. If the weights are dissimilarities, pairs of categories that are further apart are usually assigned higher weights.
Suppose two fixed observers independently rate the same set of subjects using the same set of ≥ 2 ordered categories that are defined in advance. For a population of subjects, let denote the proportion classified in category by the first  [41]. Seven pathologists, labeled A to G, classified each of 118 slides in terms of carcinoma in situ of the uterine cervix, based on the most involved lesion, using ordered categories, (1) negative, (2) atypical squamous hyperplasia, (3) carcinoma in situ, (4) squamous carcinoma with early stromal invasion, and (5) invasive carcinoma. The data can also be found in Landis and Koch [42]. Table 1 is the cross classification of the ratings of pathologists A and D.
Let for , ∈ {1, 2, . . . , } be nonnegative real numbers with = 0. The numbers are used as weights, one for each cell of the table { }. If we formulate Gwet's approach in terms of dissimilarity scaling, then Gwet [33] presented the coefficient Coefficient is well defined if we require that at least one is nonzero. With fixed, the maximum likelihood estimate of (1) under a multinomial sampling model is given bŷ In this paper we are interested in a particular weighting scheme. Let ≥ 0 be a nonnegative real number and consider the weight function Using weight function (3) in (1) we obtain the weighted coefficient Journal of Mathematics 3 Various well-known weighting schemes are special cases of weighting scheme (3). For = 0 we have the identity weights For = 4 categories weighting scheme (5) is given by If we use = 0 in (4) we obtain Coefficient (7) is Bennett et al. 's [23] , an agreement coefficient proposed for rating scales with nominal categories [12,28,29]. Coefficient 0 is thus a special case of coefficient (4). The value of 0 is 1 if there is perfect agreement between the observers and 0 when ∑ = 1/ . For Table 1 we have the estimatê0 = 0.364.
Finally, for = 2 categories coefficient becomes Since all special cases of coincide for = 2 categories, there are no examples of 2 × 2 tables in this paper.

Conditional Inequalities
If we apply coefficients 0 , 0.5 , 1 , and 2 to the same rating data we consistently find the triple inequalitŷ0 <̂0 .5 < 1 <̂2. For example, consider the data entries in Table 2. Table 2 presents various statistics of 20 agreement tables from the literature. The first column of Table 2 specifies the source of the agreement table and the second column shows whether the table has size 3 × 3, 4 × 4, or 5 × 5. Columns 3 to 6 of Table 2 contain the values of the estimateŝ0,̂0 .5 ,̂1, and̂2. For all entries except the first we have the triple inequalitŷ 0 <̂0 .5 <̂1 <̂2.
As a second example we consider the data on diagnosis of carcinoma from Holmquist et al. [41]. Seven pathologists labeled A to G classified each of 118 slides in terms of carcinoma in situ of the uterine cervix, based on the most involved lesion, using five ordered categories. Table 1 is the cross classification of the ratings of pathologists A and D. Table 3 presents various statistics of the 21 pairwise agreement tables for the seven pathologists. Columns 2 to 5 of Table 3 contain the values of the estimateŝ0,̂0 .5 ,̂1, and̂2. For all 21 tables we have the triple inequalitŷ0 <̂0 .5 <̂1 <̂2. The quantities in the last four columns of Tables 2 and 3 are defined and discussed. Tables 2 and 3 illustrate that the orderinĝ0 <̂0 .5 <̂1 < 2 is often found with real life data. This suggests that is usually increasing in . The triple inequality does not hold in general, but it holds if a certain condition is valid. This sufficient condition is defined below. Recall that { } is the agreement table with proportions. Define the quantities For fixed , the quantity in (12) is the sum of all elements of { } that are steps removed from the main diagonal, divided by 2( − ). Since there are precisely 2( − ) elements that are steps removed from the main diagonal, is the average disagreement of the elements that are steps removed from the main diagonal. Since the elements of { } that are steps removed from the main diagonal correspond to pairs of categories that are steps apart, 1 is the average disagreement between the observers on adjacent categories, 2 is the average disagreement on all categories that are two steps apart, and so on.
With ordered categories it is natural to assume that Condition (13) states that the average disagreement between the observers on categories that are closer in the ordering To check whether inequality (13) is reasonable for real life data we may check if the inequalitŷ holds. It turns out that condition (15) holds for many real life agreement tables with ordered categories. This is to be expected if the rating scale has been thoughtfully constructed, since in this case one expects that the disagreement between the observers on categories that are closer in the ordering is higher than on categories that are further apart in the ordering. For example, consider the data in Table 1. We havê 1 = 0.050, 2 = 0.014, 3 = 0.006, If thêare all equal, all special cases of̂coincide. It should be noted that the data in Cohen [8] are artificial. Condition (15) also holds for most entries of Table 3. The three exceptions are the entries corresponding to the pairs (A,C), (B,C), and (C,E). Theorem 2 below shows that is increasing in if condition (13) holds. Thus, if (13) holds there is a simple relationship between the special cases of coefficient (4). In particular, if (13) holds we have the triple inequality Lemma 1 is used in the proof of Theorem 2.

Lemma 1. Let for 1 ≤ ≤ be nonnegative real numbers and let and for 1 ≤ ≤ be positive real numbers. If
then Furthermore, inequality (21) is strict if two are distinct.
Proof. We start with the first part of the assertion. From (20) it follows that > for < . Since − > 0 for < it follows from (19) that Summing (22) over all and with 1 ≤ < ≤ we obtain Adding ∑ =1 to both sides of (23) we obtain Since ∑ =1 and ∑ =1 are positive, inequality (24) is equivalent to (21). Finally, note that if two are distinct, then (22) and hence (24) are strict.

Tridiagonal Agreement Tables
In practice it frequently occurs that an agreement table with ordered categories is (approximately) tridiagonal. A tridiagonal table is a square matrix that has nonzero elements only on the main diagonal, the first diagonal below this, and the first diagonal above the main diagonal. If the agreement table is tridiagonal there is only disagreement between the observers on adjacent categories. In this section we present results that hold if the agreement table is tridiagonal. In this case we have 2 = 3 = ⋅ ⋅ ⋅ = 0 and it follows that condition (13) holds. It may be that the results also hold if condition (13) is valid. Note that Theorem 2 is always valid for tridiagonal agreement tables.
The tridiagonal tables in Table 2 For 6 of the 20 entries in Table 2 the agreement table is tridiagonal. In Table 3 Tables 2 and 3 are approximately tridiagonal: only a few disagreements are not on the diagonals directly below and above the main diagonal. Tables 2 and 3 show that is usually increasing in . This implies that Bennett et al. 's is usually a lower bound of the other special cases of . Furthermore, it suggests that goes to unity as increases, regardless of the data at hand. Theorem 3 formalizes this observation for tridiagonal agreement tables.

Theorem 3. If is fixed and { } is tridiagonal, then
Proof. If { } is tridiagonal, (4) becomes Since the elements of { } sum to unity, we have the inequality The right-hand side of (30) does not depend on the data. Since the denominator ∑ =1 ∑ =1 | − | is increasing in , we can, for fixed , make the right-hand side of (30) arbitrary small. Hence, → 1 as → ∞. (4) is a nonnegative real number there are uncountably infinite many special cases of . Theorems 2 and 3 together with Tables 2 and 3 show that all these special cases usually lie between 0 and 1. Tables 2 and 3 also show that the positive differences 1 − 0 and 2 − 1 are quite substantial. This suggests that most elements of the sequence ( 0 , 1 , 2 , . . .) will lie close to 1 and that consecutive differences 1 − 0 , 2 − 1 , 3 − 2 , . . . become smaller and smaller. In this section we present a particular result for the positive differences 1 − 0 and 2 − 1 . Theorem 5 below shows that 2 − 1 never exceeds 1 − 0 . We first derive explicit formulas of ∑ =1 ∑ =1 | − | for = 0, 1, 2 in Lemma 4.

Dependence on Number of Categories.
A criticism against the use of Bennett et al. 's is that the coefficient tends to produce higher values for agreement tables with more categories [26]. More precisely, if the raw agreement ∑ =1 is constant in (7) we have 0 → ∑ =1 as → ∞. Thus, if the rating scale has many categories, we have 0 ≈ ∑ =1 , and 0 is not a chance-corrected coefficient.
While 0 has the raw agreement ∑ =1 as an upper bound, it appears that the values of many other special cases of tend to go to unity as the number of categories of the rating scale increases. For example, suppose that { } is tridiagonal and that ∑ =1 = 2/3. Using (29) and (35) the formula of 1 is given by We have 1 → 1 as → ∞. Since 1 is usually a lower bound for all special cases of with > 1 (Theorem 2), it follows that all coefficients with ≥ 1 go to unity as the number of categories increases. Dependence on the number of categories is considered an undesirable property of .

Discussion
Bennett et al. 's [23] is an agreement coefficient for ratings scales with nominal categories that has been discovered and rediscovered by many authors [12,28,29]. Recently, a weighted version of was proposed by Gwet [33] for rating scales with ordinal categories. In this paper we presented various properties of a special case of this weighted version, denoted by , where is a nonnegative real number. Bennett et al. 's [23] corresponds to 0 , while 1 and 2 are the versions of that are obtained by using, respectively, the linear and quadratic weighting schemes.
It was first studied how the different versions of are related. Theorem 2 shows that is increasing in if the average disagreement between the observers on adjacent categories is greater than the average disagreement on categories that are 2 steps apart and if the latter is greater than the average disagreement on categories that are 3 steps apart and so on. Hence, in this case, there is a simple relationship between the values of the special cases of . It turns out that Theorem 2 is quite a strong result. First of all, the result involves all special cases of , and there are uncountably infinite many versions of . Secondly, the sufficient condition holds for many data tables reported in this paper (see Tables  2 and 3). Since is usually increasing in , its special cases are essentially measuring the same thing.
For the application of Cohen's kappa and weighted kappa, various authors have presented target values for evaluating the values of the kappa coefficients [44][45][46][47]. There is general consensus in the literature that uncritical application of such magnitude guidelines leads to practically questionable decisions. Warrens [48] argued that, since quadratic kappa produces values that are substantially higher than the values produced by Cohen's kappa, the same guidelines cannot be used for both coefficients. A similar argument applies here. Tables 2 and 3 show that coefficients 0 , 1 , and 2 produce quite different values. Thus, despite the fact that the coefficients measure the same thing, they do this to a different extent. If one accepts the use of magnitude guidelines, different criteria need to be developed for the different special cases of .
Finally, a number of results were presented that illustrate that many special cases of tend to produce values close to unity, regardless of the data at hand. This is especially the case when is high ( ≥ 2) and the number of categories of the rating scale is large (5 or more). These results were only proved for agreement tables that are tridiagonal, but the high estimates in Tables 2 and 3 suggest that the results hold under more general conditions. The dependence of on the number of categories implies that different criteria need to be formulated depending on the number of categories. Developing different criteria for different coefficients and different number of categories seems an impossible task. Hence, coefficient is useless as a general agreement coefficient. It is advised to limit the application of weighted versions of Bennett et al. to one or two coefficients, for example, 0 for rating scales with nominal categories and 1 for rating scales with ordinal categories.

Conflict of Interests
The author declares that there is no conflict of interests regarding the publication of this paper.