Asymmetrical reliability of the Alda score favours a dichotomous representation of lithium responsiveness

The Alda score is commonly used to quantify lithium responsiveness in bipolar disorder. Most often, this score is dichotomized into “responder” and “non-responder” categories, respectively. This practice is often criticized as inappropriate, since continuous variables are thought to invariably be “more informative” than their dichotomizations. We therefore investigated the degree of informativeness across raw and dichotomized versions of the Alda score, using data from a published study of the scale’s inter-rater reliability (n = 59 raters of 12 standardized vignettes each). After learning a generative model for the relationship between observed and ground truth scores (the latter defined by a consensus rating of the 12 vignettes), we show that the dichotomized scale is more robust to inter-rater disagreement than the raw 0-10 scale. Further theoretical analysis shows that when a measure’s reliability is stronger at one extreme of the continuum—a scenario which has received little-to-no statistical attention, but which likely occurs for the Alda score ≥ 7—dichotomization of a continuous variable may be more informative concerning its ground truth value, particularly in the presence of noise. Our study suggests that research employing the Alda score of lithium responsiveness should continue using the dichotomous definition, particularly when data are sampled across multiple raters.

The Alda score is a validated index of lithium responsiveness commonly used in bipolar 2 disorder (BD) research [1]. This scale has two components. The first is the "A" subscale 3 that provides an ordinal score (from 0 to 10, inclusive) of the overall "response" in a 4 therapeutic trial of lithium. The second component is the "B" subscale that attempts 5 to qualify the degree to which any improvement was causally related to lithium. The 6 total Alda score is computed based on these two subscale scores, and takes integer 7 values between 0 and 10. Many studies that employ the Alda score as a target variable 8 dichotomize it, such that individuals with scores ≥ 7 are classified as "responders," and 9 those with scores < 7 are "non-responders." 10 A common criticism that arises from this practice is that continuous variables should 11 not be discretized by virtue of "information loss." Indeed, discretizing continuous 12 variables is widely viewed as an inappropriate practice [2][3][4][5][6][7][8][9][10][11][12]. However, the practice 13 remains common across many areas of research, including our group's work on lithium 14 responsiveness in BD [13]. The primary justification for using the dichotomized Alda score 15 as the lithium responsiveness definition has been based on the inter-rater reliability study 16 by Manchia et al. [1], who showed that a cut-off of 7 had strong inter-rater agreement 17 (weighted kappa 0.66). Furthermore, using mixture modeling, they also found that 18 the empirical distribution of Alda scores supports the discretized definition. Therefore, 19 there exist competing arguments regarding the appropriateness of dichotomizing lithium 20 response. Resolving this dispute is critical, since the operational definition of lithium 21 responsiveness is a concept upon which a large body of research will depend. 22 Although the Manchia et al. [1] analysis provides some justification for using a 23 dichotomous lithium response definition, it does not dispel the argument of discretization-24 induced information loss entirely. However, there is some intuitive reason to believe that 25 discretization is, at least pragmatically, the best approach to defining lithium response 26 using the Alda score. First, the Alda score remains inherently subjective to some degree 27 and is not based on precise biological measurements; an individual whose "true" Alda 28 score is 6, for example, could have observed scores that vary widely across raters. Second, 29 it is possible that responders may be more reliably identified than non-responders. For 30 example, unambiguously "excellent" lithium response is a phenomenon that undoubtedly 31 exists in naturalistic settings [14,15], and for which the space of possible Alda scores is 32 substantially smaller than for non-responders; that is, an Alda score of 8 can be obtained 33 in far fewer ways than an Alda score of 5. As such, we hypothesize that agreement on 34 the Alda score is higher at the upper end of the score range, and that this asymmetric 35 agreement is a scenario in which dichotomization of the score is more informative than the 36 raw measure. To evaluate this, we present both empirical re-analysis of the ConLiGen 37 study by Manchia et al. [1]. and analyses of simulated data with varying levels of 38 asymmetrical inter-rater reliability. Detailed description of data and collection procedures is found in Manchia et al. [1]. 42 Samples included in our analysis are detailed in Table 1, including the number of raters 43 included across sites, and the average ratings obtained at each of those sites across the 44 12 assessment vignettes. As a gold standard, we used ratings that were assigned to each 45 case vignette using a consensus process at the Halifax site (scores are noted in the first 46 row of Table 1). The lithium responsiveness inter-rater reliability data are available in 47 S1 File (total Alda score), and S2 File (Alda A-score).

48
Empirical Analysis of the Alda Score

49
In this analysis, we seek to evaluate whether discretization of the Alda score under the 50 existing inter-rater reliability values preserves more mutual information (MI) between the 51 observed and ground truth labels than does the raw scale representation. To accomplish 52 this, we first develop a probabilistic formulation of raters' score assignments based on a 53 Multinomial-Dirichlet model, which we describe here.

54
Let n (k) i ∈ N + denote the number of raters who assigned an Alda score i ∈ A, with 55 A = {0, 1, ..., 10} to an individual whose gold standard Alda score is k ∈ A. The vector 56 of rating counts for the gold standard score k is is where α is a pseudocount denoting the prior expectation of 59 the number of ratings received for each score i ∈ A. In the present analysis, we assume 60 that α is equal across all scores in A, and thus we denote it simply as a scalar α = α; 61 which can be viewed as the conditional distribution over scores A for any given rater 66 when the gold standard is k. In cases where no assessment vignette had a gold standard 67 rating of k, we assumed that The dichotomized Alda scores are defined as T = {δ[i ≥ τ ] : ∀i ∈ A}, where τ is the 69 dichotomization threshold (set at τ = 7 for the Alda score), and where δ[·] is an indicator 70 function that evaluates to 1 if the argument is true, and 0 otherwise. Given threshold τ 71 (Responders ≥ τ and Non-responders < τ ), the dichotomous counts are represented as 72 follows with c (k) ∼ Multinomial(φ k ), and φ k ∼ Dir(φ|ξ), where ξ is a pseudocount for the 74 number of dichotomized ratings assigned to each of non-responders and responders. 75 We can thus estimate the conditional distribution over observed dichotomized response 76 ratings as Mutual Information of Raw and Dichotomized Alda Score Representations denote a given observed raw Alda score assigned to a case with ground truth score of 80 x * ∈ A. Given uniform priors on the true classes, the joint distribution is k=0,1,...,10 .
For the binarized classes, we have a prior of p(y * = 1) = 4 11 , and the joint distribution 82 is thus The MI for these distributions can be computed as functions of the prior pseudocounts 84 α and ξ: for the raw and dichotomized Alda scores, respectively. We can express the MI of 86 the raw and dichotomized Alda score distributions both in terms of α, such that both 87 distributions have an equivalent total "concentration:" ξ = α when ξ = 11α/2. This is 88 equivalent to saying that our prior assumption about the uncertainty of the raw and 89 dichotomized distributions assumes the same number of a priori ratings.

90
Our primary hypothesis-that the dichotomized Alda score is more informative with 91 greater observation uncertainty-is evaluated by determining whether I ξ [y o ||y * ] exceeds 92 I α [x o ||x * ] as we increase the a priori observation noise (α and ξ). The previous experiment regarding dichotomization of the raw Alda score did not fully 96 capture the effect of dichotomization of a continuous variable, since the raw Alda score 97 is still discrete (albeit with a larger domain of support). Thus, we sought to investigate 98 whether dichotomization of a truly continuous, though asymmetrically reliable, variable 99 would show a similar pattern of preserving MI and statistical power under higher levels 100 of observation noise and agreement asymmetry. Demonstration of the synthetic agreement data across differences in the parameter ranges and presence of asymmetry. The x-axes all represent the ground truth value of the variable, and the y-axes represent the "observed" values. Data are depicted based on different values of a uniform noise parameter (0 ≤ ω ≤ 1) that governs what proportion of the data is merely uniform noise over the interval [0, 10], and a disagreement parameter (σ ≥ 0), which governs the variance around the diagonal line. Panel A (upper three rows, shown in blue) depicts the synthetic data in which there was asymmetrical levels of agreement across the score domain. Panel B (lower three rows, shown in red) depict synthetic data in which there was symmetrical agreement over the score domain.

102
The simplest synthetic dataset generated was merely a sample of regularly spaced points 103 across the [0,10] interval in both the x and y directions. This dataset was merely used to 104 conduct a "sanity check" that our methods for computing MI correctly identified a value 105 of 0. This was necessary since data with uniform random noise over the same interval 106 will only yield MI of 0 in the limit of large sample sizes.

107
The main synthetic dataset accepted "ground truth" values x ∈ [0, 10] and yielded 108 "observed" values y ∈ [0, 10] based on the following formula for the i th sample: where 0 ≤ ω ≤ 1 is a parameter governing the degree to which observed values are 110 coupled to the ground truth based on f (x i ) (data are entirely uniform random noise 111 when ω = 0, and come entirely from f (x i ) when ω = 1). The function f (x i ) governing 112 the agreement between ground truth and observed is essentially a 1:1 correspondence 113 between x and y to which we add noise along the diagonal based on a uniform random 114 variate U (−σ, σ) with width σ. 115 We simulated two forms of diagonal spread. The first is constant across all values 116 x ∈ [0, 10], which we call the symmetrical case, and which is represented by a parameter 117 β = 1. The other is an asymmetrical case (represented as β = 0), in which the agreement 118 between x and y is not constant across the [0, 10] range. Overall, the function f (x i ) is 119 defined as where R (l, u) (·) is a function to ensure that all points remain within the [l, u] interval in 121 both axes. In the asymmetrical case, R (l, u) (·) reflects points at the [0, 10] bounds. In 122 the symmetrical case, the data are all simply rescaled to lie in the [0, 10] interval.

123
Demonstration of the simulated synthetic data are shown in Figure 1. Every synthet-124 ically generated dataset included 750 samples, and for notational simplicity, we denote 125 the k th synthetic dataset (given parameters β, ω, σ) as D for bandwidth selection) on the simulated dataset, and then approximating the following 132 integral using Markov chain Monte-Carlo sampling: Conversely, discrete MI was computed by first creating a 2-dimensional histogram 134 by binning data based on a dichotomization threshold τ . Data that lie below the 135 dichotomization threshold are denoted 1, and those that lie above the threshold are 136 represented as 0. Based on this joint distribution, the dichotomized MI is Note that continuous MI will remain constant across τ .
where Φ(·) and Φ −1 (·) are the cumulative distribution function and quantile functions 146 for a standard normal distribution, and ζ(·) is Fisher's Z-transformation Under a dichotomization of D (k) β,ω,σ with threshold τ association between the ground 148 truth and observed data can be evaluated using a (two-tailed) Fisher's exact test, whose 149 alternative hypothesis is that the odds ratio (η) of the dichotomized data does not equal 150 1. The null-hypothesis has a Fisher's noncentral hypergeometric distribution, where N (k) is the total number of observations in sample k, and N (k) 152 are the number of ground truth and observed data, respectively, that fall below the 153 dichotomization threshold τ . Under the alternative hypothesis, this distribution has an 154 odds ratio parameter estimated from the data: The statistical power of Fisher's exact test under this setup and a two-tailed signifi-156 cance threshold of α is The central aspect of this analysis is comparison of the dichotomized and continuous MI 161 across values of the dichotomization threshold τ , global noise ω, asymmetry parameter 162 β, and diagonal spread σ. Under all cases, we expect that increases in the global 163 noise parameter ω will reduce the MI. We also expect that with symmetrical reliability 164 (i.e. β = 0), the dichotomized MI will be lower than the continuous MI across all 165 thresholds. However, as the degree of asymmetry in the reliability increases, we expect 166 the dichotomized MI to exceed the continuous MI (i.e. as σ increases when β = 1). 167 Finally, as a sanity check, we expect that both continuous and dichotomized MI will 168 be approximately 0 when applied to a grid of points regularly spaced over the [0,10] 169 interval.

171
Statistical power of the Pearson correlation coefficient and Fisher's exact test were 172 computed across symmetrical (β = 0) and asymmetrical (β = 1) conditions of the 173 synthetic dataset described above. Owing to the greater computational efficiency of 174 these calculations (compared to the MI), the diagonal spread parameter was varied 175 more densely (σ ∈ {1, 2, ..., 20}). The power of Fisher's exact test was evaluated at 176 two dichotomization thresholds: a median split at τ = 5 and a "tail split" at τ = 3. 177 We evaluated three global noise settings (ω ∈ {0.3, 0.5, 0.7}). At each experimental 178 setting, we computed the aforementioned power levels for 100 independent synthetic 179 datasets. Results are presented using the mean and 95% confidence intervals of the 180 power estimates over the 100 runs under each condition. We expect that the Fisher's 181 exact test under a "tail split" dichotomization (not a median split) will yield greater 182 statistical power in the presence of asymmetrical reliability, greater diagonal spread, and 183 higher global noise. However, under the symmetrically reliable condition, we expect that 184 the statistical power will be greater for the continuous test of association.  Histograms of ratings for each value of the ground truth Alda score available in the first wave dataset from Manchia et al. [1]. Each histogram represents the distribution of ratings (n r = 59) for a single one of twelve assessment vignettes. The gold standard ("ground truth") Alda score, obtained by the Halifax consensus sample, is depicted as the title for each histogram. Plots in blue are those for vignettes with gold standard Alda scores less than 7, which would be classified as "non-responders" under the dichotomized setting. Vignettes with gold standard Alda scores ≥ 7 are shown in red, and represent the dichotomized group of lithium responders.  properties with respect to agreement. X-axes represent the dichotomization thresholds at which we recalculate the dichotomized MI. Mutual information is depicted on the y-axes. Plot titles indicate the different diagonal spread (σ) parameters used to synthesize the synthetic datasets. Solid lines (for dichotomized MI) are surrounded by ribbons depicting the 95% confidence intervals over 10 runs at each combination of parameters (τ, ω, β, σ). Fisher's exact test (a measure of association between dichotomized variables; red lines) for synthetic data with symmetrical (upper row) and asymmetrical (lower row) properties with respect to agreement. Columns correspond to the level of uniform "overall" noise (ω) added to the data, representing prior uncertainty. X-axes represent the diagonal spread (σ), and the y-axes represent the test's statistical power for the given sample size and estimated effect sizes. Data subjected to Fisher's exact test were dichotomized at either a threshold of 5 (the "Median Split," denoted by '+' markers in red) or 3 (the "Tail Split," denoted by the dot markers in red). For all series, dark lines denote means and the ribbons are 95% confidence intervals over 100 runs.
(τ = 5) or continuous representation; this relationship was present even at high levels of 221 diagonal spread and overall uniform noise.

223
The present study makes two important contributions. First, using a sample of 59 ratings 224 obtained using standardized vignettes compared to a consensus-defined gold standard [1], 225 we showed that the dichotomized Alda score has a higher MI between the observed 226 and gold standard ratings than does the raw scale (which ranges from 0-10). Those 227 data suggested that the Alda score's reliability is asymmetrical, with greater inter-rater 228 agreement at the upper extreme. Secondly, therefore, using synthetic experiments we 229 showed that asymmetrical inter-rater reliability in a score's range is the likely cause of 230 this relationship. Our results do not argue that lithium response is itself a categorical 231 natural phenomenon. Rather, using the dichotomous definition as a target variable in 232 supervised learning problems likely confers greater robustness to noise in the observed 233 ratings.

234
Some have argued that the existence of categorical structure in one's data [9], or 235 evidence of improved reliability under a dichotomized structure [16], are potentially 236 justifiable rationales for dichotomization of continuous variables. These claims are 237 generally stated only briefly, and with less quantitative support than the more numerous 238 mathematical treatments of the problems with dichotomization [9,10,16,17]. However, 239 these more rigorous quantitative analyses typically involve assumptions of symmetrical 240 or Gaussian distributions of the underlying variables in the context of generalized 241 linear modeling (although Irwin & McClelland [10] demonstrated that median splits of 242 asymmetric and bimodal beta distributions is also deleterious). These analyses have led 243 to vigorous generalized denunciation of variable dichotomization across several disciplines, 244 but our current work offers important counterexamples to this narrative [10,11].

245
The Alda score is more broadly used as a target variable in both predictive and 246 associative analyses, and not as a predictor variable, which is an important departure 247 from most analyses against dichotomization. Since there is no valid and reliable biomarker 248 of lithium response, these cases must rely on the Alda score-based definition of lithium 249 response as a "ground truth" target variable. In the case of predicting lithium response, 250 where these ground truth labels are collected from multiple raters across different 251 international sites, variation in lithium response scoring patterns across centres might 252 further accentuate the extant between-site heterogeneity.

253
To this end, inter-individual differences in subjective rating scales may be more 254 informative about the raters than the subjects, and one may wish to use dichotomization 255 to discard this nuisance variance [8,9,16]. Doing so means that one turns regression 256 supervised by a dubious target into classification with a more reliable (although coarser) 257 target. Appropriately balancing these considerations may require more thought than 258 adopting a blanket prohibition on dichotomization or some other form of preprocessing. 259 An important criticism of continuous variable dichotomization is that it may impede 260 comparability of results across studies, both in terms of diminishing power and inflat-261 ing heterogeneity [17]. However, this is more likely a problem when dichotomization 262 thresholds are established on a study-by-study basis, without considering generalizability 263 from the outset. These arguments do not necessarily apply to the Alda score, since the 264 threshold of 7 has been established across a large consortium with support from both 265 reliability and discrete mixture analysis [1], and is the effective standard split point for 266 this scale [18].

267
Our study thus provides a unique point of support for the dichotomized Alda score 268 insofar as we show that the retention of MI and frequentist statistical power is likely 269 due to asymmetrical reliability across the range of scores. Our analyses show that there 270 is a range of Alda scores (those identifying good lithium responders; scores ≥ 7) for 271 which scores correspond more tightly to a consensus-defined gold standard in a large 272 scale international consortium. In particular, we showed that this dichotomization will 273 be more robust to increases in the prior uncertainty (i.e. the overall level of background 274 "noise" in the relationship between true/observed scores). This feature is important since 275 the sample of raters included in the Alda score's calibration study [1] was relatively small 276 and consisted of individuals involved in ConLiGen centres. It is reasonable to suspect 277 that assessment of Alda score reliability in broader research and clinical settings would 278 add further disagreement-based noise to the inter-rater reliability data. At present, use 279 of the dichotomized scale could confer some robustness to that uncertainty.

280
More generally our study showed that if reliability of a measure is particularly high 281 at one tail of its range, then a "tail split" dichotomization can outperform even the 282 continuous representation of the variable. This presents an important counterexample 283 to previous authors, such as Cohen [5], Irwin & McClelland [10], and MacCallum et 284 al. [9] who argued that "tail splits" are still worse than median splits. While our study 285 reaffirms these claims in the case of measures whose reliability is constant over the 286 domain (see Figure 4B and the upper row of Figure 5), our analysis of the asymmetrically 287 reliable scenario yields opposite conclusions, favouring a "tail split" dichotomization 288 over both median splits and continuous representations. Tail split dichotomization was 289 particularly robust when data were affected by both asymmetrical reliability and high 290 degrees of uniform noise over the variable's range. Together, these results suggest that 291 dichotomization/categorization of a continuous measurement may be justifiable when its 292 relationship to the underlying ground truth variable is noisy everywhere except at an 293 extreme.

294
Our study has several limitations. First, our sample size for the re-analysis of the Alda 295 score reliability was relatively small, and sourced from highly specialized raters involved 296 in lithium-specific research. However, one may consider this sample as representative 297 of the "best case scenario" for the Alda score's reliability. It is likely that further 298 expansion of the subject population would introduce more noise into the relationship 299 between ground truth and observed Alda scores. It is likely that most of this additional 300 disagreement would be observed for lower Alda scores, since (A) there are simply more 301 potential item combinations that can yield an Alda score of 5 than an Alda score of 9, for 302 example, and (B) unambiguously excellent lithium response is a phenomenon so distinct 303 that some question whether lithium responsive BD may constitute a unique diagnostic 304 entity [19,20]. Thus, we believe that our sample size for the reliability analysis is likely 305 sufficient to yield the present study's conclusions. based, and thus cannot offer the degree of generalizability obtained through rigorous 308 mathematical proof. Nonetheless, our study offers sufficient evidence-in the form 309 of a counterexample-to show that there exist scenarios in which dichotomization is 310 statistically superior to preserving a variable's continuous representation. Furthermore, 311 we used well controlled experiments to isolate asymmetrical reliability as the cause of 312 dichotomization's superiority across simulated conditions.

314
In conclusion, we have shown that a dichotomous representation of the Alda score for 315 lithium responsiveness is more robust to noise arising from inter-rater disagreement. The 316 dichotomous Alda score is therefore likely a better representation of lithium responsiveness 317 for multi-site studies in which lithium response is a target or dependent variable. Through 318 both re-analysis of the Alda score's real-world inter-rater reliability data and careful 319 theoretical simulations, we were able to show that asymmetrical reliability across the 320 score's domain was the likely cause for superiority of the dichotomous definition. Our 321 study is not only important for future research on lithium response, but other studies 322 using subjective and potentially unreliable measures as dependent variables. Practically 323 speaking, our results suggest that it might be better to classify something we can all 324 agree upon than to regress something upon which we can not.