Latent-Variable Modelling of Ordinal Outcomes in Language Data Analysis

ABSTRACT In empirical work, ordinal variables are typically analysed using means based on numeric scores assigned to categories. While this strategy has met with justified criticism in the methodological literature, it also generates simple and informative data summaries, a standard often not met by statistically more adequate procedures. Motivated by a survey of how ordered variables are dealt with in language research, we draw attention to an un(der)used latent-variable approach to ordinal data modelling, which constitutes an alternative perspective on the most widely used form of ordered regression, the cumulative model. Since the latent-variable approach does not feature in any of the studies in our survey, we believe it is worthwhile to promote its benefits. To this end, we draw on questionnaire-based preference ratings by speakers of Maltese English, who indicated on a 5-point scale which of two synonymous expressions (e.g. package-parcel) they (tend to) use. We demonstrate that a latent-variable formulation of the cumulative model affords nuanced and interpretable data summaries that can be visualized effectively, while at the same time avoiding limitations inherent in mean response models (e.g. distortions induced by floor and ceiling effects). The online supplementary materials include a tutorial for its implementation in R.


Introduction
Ordinal variables consist of a set of ordered categories, where the ordering reflects a larger or smaller amount of what is being measured.Their midway position between continuous and nominal variables has given rise to a variety of analysis strategies in empirical work.Most commonly, ordered outcomes are translated into numeric scores and then treated in the same way as continuous variables.We will follow Agresti (2010, p. 137) and use the term mean response model CONTACT Lukas Sönning lukas.soenning@uni-bamberg.deThis article has been corrected with minor changes.These changes do not impact the academic content of the article.
Supplemental data for this article can be accessed online at https://doi.org/10.1080/09296174.2024.2329448

Selection of Studies and Overview of Database
Our survey covers an 11-year span (2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019)(2020)(2021)(2022) and includes 3,859 articles published in 14 linguistic journals.These represent a broad range of research areas and methodologies (for an overview, see https://osf.io/jnv27).We searched for all studies that include the words 'ordinal', 'Likert', or 'semantic differential', which returned 423 documents.Since our attention is restricted to the treatment of ordinal outcome (i.e.dependent or response) variables, we manually excluded work that features ordinal predictors only, or where ordinal variables play an ancillary role (e.g. for sample description or stimulus validation).We also excluded studies that rely on standardized (multi-item) tests to measure latent traits such as anxiety or motivation.This left us with 149 research articles.
The vast majority of ordinal responses in our database were collected using rating and judgement tasks (n = 136; 91%), where informants indicate some type of assessment or agreement by choosing from an ordered set of categories.Acceptability judgements were the most frequently employed scheme (n = 32).Recurrent forms also included accentedness and naturalness ratings, and semantic differential scales.For simplicity, we will refer to these kinds of response scales, collectively, as rating scales.Among the minor types of ordinal variables we encountered are existing categories (e.g.CEFR levels, phones/phonemes, or position on a grammaticalization cline) or classifications based on defined criteria (e.g.test or performance ratings).

Structure of Ordinal Scales
Let us take a closer look at the layout of response scales used for rating tasks (n = 136 articles).Table 1 summarizes the distribution of (i) number of response categories, ranging from 3 to 11; and (ii) the way in which descriptors are added to the scale, e.g.whether labels are given only at the endpoints, or for each category.The numbers in boldface provide the overall distribution of these attributes.Considering the number of response categories, the most popular choices are 5 (39%) or 7 categories (35%); fewer than 5 and more than 7 options are rarely used.We also note that the vast majority of studies (79% in total) use an uneven number of tick boxes, thus providing a middle category that does not force ratings towards either end of the scale.As for the way in which descriptive information is added, we observe that 58% of all studies provide labels only at the endpoints.Another frequently used format is to give descriptions for each category (31%); around half of these fully labelled sequences are Likert-type formats, where respondents indicate (degree of) agreement to a statement (Likert, 1932).

Analysis Strategies Used in the Literature
Based on the analysis strategy used, we divided studies into six groups.An analysis is coded as employing a mean response model if it translates the ordered categories into numeric scores and then summarizes these using averages (and standard deviations), ANOVA or ordinary (mixed-effects) least-squares regression.The label non-parametric test was assigned to rank-based inferential procedures (e.g.Wilcoxon signed-rank test).Studies in the class ordered regression model employ a form of categorical regression respecting the order of response levels.Further, ordered random forests are listed as a separate category since they rely on a different modelling paradigm.Analyses assigned to the category binary regression dichotomize the graded scale, either by collapsing adjacent levels to form a binary outcome, or by applying a series of disconnected regressions to pairs (or groupings) of response levels.Finally, the label description is used for analyses that exclusively rely on descriptive statistics and use measures other than the mean (e.g.category counts/percentages or medians).We found 9 studies in our survey that rely on means (MRM) in addition to another strategy.These studies were assigned to both groups and appear twice in the following summaries, which are therefore based on 158 (instead of 149) data points.
Figure 1 shows the usage rate of strategies across the 11-year period covered by our survey.Each crossbar denotes a study, and the tallies are the total number of research articles making use of the respective approach.The great majority of analyses (n = 122; 77%) assign numeric scores to the ordered categories and then rely on MRMs for data summary and analysis.As for the choice of numeric scores, we found that, except for a single study, all analyses used equal distances between categories.The only other analysis strategy that is used at a notable rate is ordered regression (n = 23; 14%).All models reported in these papers are cumulative (mixed) models, a common variant of ordered regression that will be discussed in more detail in Section 6.
While ordered regression models featured in only 14% of the studies, this rate appears to be relatively high compared to other behavioural sciences (see, e.g.Liddell & Kruschke, 2018, p. 329).It is therefore of interest to also pay heed to the methodological literature directed at language scholars, which has succeeded in expanding our data-analytic repertoire.

The Methodological Literature on Ordinal Language Data Analysis
Ordered regression models are described in several of the statistical textbooks written for a broader linguistic audience.Thus, both Baayen (2008, pp. 208-214) and Gries (2021, pp. 353-361) illustrate how to run such models in R and how to summarize associations between predictor(s) and outcome using graphs of predicted category probabilities.Barreda and Silbert (2023: 407-426) also devote considerable space to this type of model; critically, they introduce ordinal regression using the latent-variable formulation that takes centre stage in the present paper.Further, a number of methodological articles have dealt with various aspects of ordinal data modelling: Janda and Endresen (2017) compare different approaches, including (mixed-effects) ordered regression modelling, ANOVA, and machine learning tools, i.e. regression and (unordered) classification trees with follow-up random-forest analyses.In their assessment, machine learning procedures evaluate most favourably in terms of model adequacy, informativity, and user-friendliness.Another paper by Baayen and Divjak (2017) illustrates how generalized additive mixed models can be applied to ordered outcomes.These capture non-linear associations between continuous predictor(s) and ordered response on an underlying latent-variable scale (see Section 6.2).We note, in passing, that all methodological treatments mentioned so far concentrate on the same form of ordinal regression: the cumulative model (see Section 6).
An area of active methodological research is experimental syntax, which often deals with ordinal data derived from acceptability judgements (see Schütze & Sprouse, 2014).This line of inquiry has directed attention to aspects such as the comparison of different measurement scales (e.g.binary choices, ordinal judgements, and magnitude estimation) and the agreement between formal and informal methods of data collection (e.g.Sprouse et al., 2013).Despite the ubiquity of ordinal scales in this field, the application of ordered regression has  so far received little attention.Some recent work has explored the use of a latentvariable approach based on signal detection theory, both for binary judgements (Bader & Häussler, 2010) and ordered scales (Dillon & Wagers, 2021).For ordered responses, however, this analysis strategy can only yield difference measures on the underlying scale (Dillon & Wagers, 2021, p. 89).Though informative for certain analysis tasks, this limits the wider applicability of the procedure.
The current study concentrates on a more versatile approach to ordinal data analysis: the latent-variable perspective on the cumulative model.Before we deal with this analysis strategy, however, let us recapitulate not only the limitations, but also the strengths of MRMs.

Limitations of Mean Response Models
It has been pointed out in the statistical literature that the practice of converting an ordered response into numeric scores for analysis is problematic in certain settings.In this section, we summarize the main arguments that have been advanced against MRMs for ordered outcomes (see Long, 1997, pp. 35-40, 116-119;Agresti, 2010, pp. 5-8, 137-140).
The first major point of criticism is directed at the mapping between ordered categories and numeric values.While researchers are free in their choice of scores, equal-spaced integers appear to be a near-universal default (see Section 2.3).However, if verbal descriptions are given for categories (31% of the rating scales in our survey), perceived increments along the ordered scale will depend on how informants interpret these labels.Arguably, this concern is less of an issue with rating scales where only the endpoints are labelled (58%) and tick boxes are equally spaced, or where only numbers are used as category labels (6%).
A second major concern are floor and ceiling effects, which may arise with any form of ordinal (rating) scale.Thus, if responses, either in general or for certain subgroups, tend to be near the endpoints of the scale, the distribution of numeric scores will be skewed and measures of dispersion will be systematically smaller.Likewise, differences between conditions are compressed near the limits, which distorts data interpretation.This is especially true for interaction patterns, which may be scale-dependent, i.e. result from floor and ceiling effects (e.g.Loftus, 1978;Dillon & Wagers, 2021, pp. 65-68).
A number of other problems have been noted.For instance, discrete scores make no allowance for the fact that responses are actually compatible with a range of values underlying the ordered scale.The discomfort caused by this form of measurement error grows if the number of response categories is small, which could be argued to apply to about half of the rating scales in our survey (with five or fewer levels).Further, MRMs may yield predictions (or uncertainty bounds) beyond the limits of the scale and they do not generate predicted category probabilities, which makes it difficult to check the fit of a model against the observed data.Some limitations can be partly addressed at the design stage of a study.To mitigate floor and ceiling effects, for instance, the endpoints should mark extreme categories (e.g.'fully acceptable' instead of 'acceptable'), and the number of categories can be increased (which also reduces measurement error).As for the choice of scores, sensitivity analyses can be used to check whether linguistic conclusions change with reasonable alterations of the numeric scores.To reinforce an equal-distance interpretation, Janda and Endresen (2017, pp. 226-227) suggest adding numbers to tick boxes and paying attention to the visual composition on the page/screen.
While these remedies may succeed in addressing some of the concerns we have reviewed in this section, they cannot guard against all potential problems associated with MRMs.Before we turn to ordered regression models as a more adequate analysis strategy, we introduce the linguistic context of our illustrative data.

Linguistic Background and Data
The data we rely on in this paper are questionnaire-based preference ratings for pairs of synonymous expressions (e.g.package-parcel).In Section 4.1 we describe the linguistic background and structure of our data.Section 4.2 outlines the research questions driving the analyses presented in the remainder of this article.

Background and Data
Our data are drawn from a larger research project on lexical and grammatical variation in settings where English is spoken as a native, second, or foreign language.Here, we focus on the variety spoken and written in Malta and its relation to the major reference varieties of standard British and American English (BrE and AmE, for short).We concentrate on 68 pairs of lexical expressions such as truck-lorry or realization-realisation, which are known to differ in usage between BrE and AmE (see Algeo, 2006).We will refer to these as lexical pairs, and to the individual expressions as variants.Though clearly a simplification, we will use BrE (or AmE, as the case may be) when we refer to 'more British', 'exclusively British' or 'traditionally British' items.
To elicit usage preferences, our questionnaire asks informants to indicate whether they always use one of the two variants, prefer one over the other, have no preference, or do not use either expression (see Appendix 1 for illustration; Krug & Sell, 2013 for methodological details).We are therefore dealing with a symmetric 5-point ordinal scale with a meaningful midpoint and verbal descriptions for each response category.Data were collected between 2008 to 2018, and in this paper we concentrate on a subset of 500 speakers that is roughly balanced on year of birth (the terms 'speaker' and 'informant' are used interchangeably in this paper).The distribution of this variable, which spans roughly 60 years, is shown in Figure 2, where each bar indicates the number of informants for a specific birth year.While the peak around the year 1990 results from the original data collection regime, it turns out that the age distribution for the Maltese population shows a similar amplitude for cohorts born around 1990. 2  Item pairs with more than 15% missing responses were excluded, leaving 63 pairs for analysis.The data 3 used in the present study are available from TROLLing (Krug et al., 2023).Appendix 2 shows, for each item, the distribution of the five response categories by year of birth.The colour scheme adopted is used consistently in this article: Brighter shades denote the BrE variant, darker shades the AmE variant.

Research Questions
The focus of our analysis is two-fold: First, we would like to assess, for each item pair, how much Maltese English tends towards the BrE or the AmE variant.When localizing a pair on the cline between the standard varieties, we will refer to its 'Britishness', to acknowledge the historically more influential variety.Our second focus is on apparent-time trends for each item pair.This means that we are interested in the extent to which differences between age groups (as indicated by year of birth) may point to diachronic (i.e.real-time) changes in Maltese English.We would like to summarize apparent-time patterns using trend lines, which show us how usage preferences vary across cohorts.Both research questions can be addressed in a straightforward manner with an MRM.This approach, which we have used in previous work (e.g.Krug et al., 2016Krug et al., , 2020)), will be illustrated in the next section.

Analysis Using a Mean Response Model
We first need to convert response categories to numeric scores.To facilitate interpretation, we assign the value 0 to 'No preference', the middle category.In order for our numeric scale to reflect Britishness, we use positive scores to represent a preference for BrE and negative scores for AmE.Since the wording of the response categories (see Appendix 1) makes it difficult to reason about distances between these, we opt for equal-sized steps, i.e. scores of ±1 and ±2.The numeric scale is therefore bounded at −2 and +2, and zero assumes the interpretation of a draw between the standard varieties.
For the analyses presented in this section, we use standard (mixedeffects) linear regression.To account for the data structure, all models include by-informant random intercepts, which allow for correlated responses within individuals.This provides leeway for variation in Britishness among speakers that is not captured by the predictors in the model.The fixed effects depend on the analysis task: To obtain overall Britishness scores, we include item pair as a predictor (model 1).Apparent-time trends, on the other hand, are captured with item pair, year of birth, and their interaction (model 2) 4 : We used R (R Core Team, 2023) and relied on the packages 'lme4' (Bates et al., 2015), 'emmeans' (Lenth, 2023), and 'lattice' (Sarkar, 2008).Details about the analysis can be found in the OSF project associated with this article (https://osf.io/jnv27).
Figure 3 shows the distribution of the 63 Britishness scores, which are estimated means 5 based on model 1.We observe that for the majority of lexical pairs, the speakers in our sample prefer the British variant -only in five cases do we note an appreciable preference for the AmE expression.There is considerable variation among item pairs in terms of the strength of preference.The distribution hits the upper bound of the scale, with some item pairs near ceiling, indicating almost categorically British usage.
MRMs also allow us to summarize apparent-time trends in the data.gravitate across cohorts: While informants born around 1950 tend to prefer the British variant, this pattern changes across age groups, with speakers born in the 1990s leaning towards the AmE form.The line in the graph shows a straight-line summary of the estimated trend. 6 The ability to summarize trends using simple profiles allows us to take a bird's eye perspective and bring into view all item pairs in our data.To this end, Figure 5 collects apparent-time patterns (i.e.fitted regression lines) from model 2 in a single display.It appears that most pairs show a downward trend -similar to parcel-package, though not as pronounced.Note that the ceiling effect, which was evident in Figure 3 above, also surfaces in this graph -most trend lines hover at the British end of the scale.
To simplify our assessment of the patterns shown in Figure 5  type of model we are using (i.e.regression with an identity link function), these differences are proportional to the simple (i.e.item-specific) slopes in our model.In other words, Figure 6 shows the slope coefficients 7 and their statistical uncertainty.Negative differences then denote a decrease in Britishness, with the magnitude reflecting the steepness of the trend lines.Figure 6, which graphs these contrasts in rank order, shows that most are negative.Our illustrative item pair (package-parcel) appears at the far left, with an estimated apparent-time decrease of 2 points (see Figure 4).Item IDs (running from 1 to 63) have been added to the display, and reference to Appendix 2 allows us to look up pairs of expressions.To assess the statistical dependability of these apparent-time differences, we have added 95% confidence intervals.
The analyses in this section demonstrate how MRMs allow us to form simple and informative data summaries: The means and trend lines we have presented provide direct answers to our research questions.In terms of model interpretation, then, MRMs serve as a benchmark against which alternative analysis strategies may be compared.At the same time, however, the distributional assumptions of MRMs make no allowance for the discreteness and boundedness of numeric scores.For the data shown in Figure 4, for instance, the model states that, at each x-location (i.e.birth year), the scatter of data points around the regression line can be approximated by a bell-shaped (Gaussian) distribution.It is evident that the actual spread of informant ratings around the trend line flagrantly violates this assumption, which casts doubt on the descriptive and inferential quantities derived from such models.We now turn to a more appropriate class of regression models, whose distributional assumptions respect the categorical nature of the outcome variable.

Ordered Regression Models
Three basic types of ordinal regression are usually distinguished, based on the way in which (changes in) category probabilities are modelled: the cumulative, the adjacent-category, and the continuation-ratio model (see Bürkner & Vuorre, 2019 for an overview; Fullerton & Xu, 2017 for a booklength treatment).We will concentrate on the most commonly used version, the cumulative model (McCullagh, 1980;McKelvey & Zavoina, 1975).Similar to other categorical regression models, cumulative models do not describe response probabilities on the data scale (probabilities), but on an unconstrained model scale, which has no upper or lower limit.Two common choices are logit and probit scores.
It turns out that the cumulative model can be understood (or derived) in two distinct ways: as a representation of (differences in) model-scaled cumulative response probabilities (Section 6.1), and as a description of a latent variable underlying the ordinal scale (Section 6.2).Our focus in this paper is on the continuous response formulation, and Section 6.3 and 6.4 illustrate how analogues of the means, patterns and differences presented in Section 5 can be constructed on the latent-variable scale.All ordinal models reported in this paper were run in R using the package 'ordinal' (Christensen, 2022).

Cumulative Models for Ordered Outcomes
To understand how a cumulative model represents category probabilities on the model scale, we turn to a graphical illustration, which appears in Figure 7 (for a more thorough treatment, please refer to Agresti, 2010, pp. 44-84;Long, 1997, pp. 114-145).We consider a simplified analysis setting that compares preferences for a hypothetical item pair across two discrete birth years: 1950 and 2000.Category probabilities for these two points in (apparent) time are shown using stacked bars.We observe that usage preferences for informants born in 1950 are 'more British' (i.e.brighter).For informants born in 2000, there is a draw between the variants.
To represent this change in response probabilities, the cumulative model uses cumulative probabilities, which we obtain through appropriate combination of adjacent (groups of) categories.They are marked next to the stacked bars using dots.For instance, the cumulative probabilities for the hypothetical 1950 data are .43, .62 (i.e. .43 + .19), .83 (.43 + .19 + .21), and .93 (.43 + .19 + .21 + .10).
These probabilities are on the data scale and need to be transformed to the model scale, for which we choose logits (i.e.log odds) as a link function; the grey lines in Figure 7 illustrate the non-linear mapping.The regression model describes differences between conditions of interest (i.e.1950 vs. 2000) using differences in cumulative logits.For this hypothetical example, this difference is −1, which means that logits decrease by 1 point.Note how all four cumulative logits (are forced to) decrease by the same amount.This is referred to as the parallel regression (or proportional odds) assumption, and the model illustrated in Figure 7 is therefore more accurately referred to as a parallel cumulative model (with a logit link).To complete our reading of This means that, given a set of observed data (the stacked bar charts in Figure 7), the parallel cumulative model with a logit link captures the differences between observed distributions by (i) finding a set of cumulative logits (one less than the number of ordinal categories) and (ii) a constant difference (or slope, in regression parlance) that best represents observed differences.The slope coefficient (here: −1) then signals, on the log odds scale, the change in cumulative probabilities associated with a 50-year step in birth year.Note that in an actual analysis, the fitted cumulative logits (dots) will only approximate the observed cumulative probabilities (bar charts).This is one particular way in which the parallel cumulative model can be understood.We now turn to an alternative conceptualization, which revolves around the notion of an underlying latent response variable.

Latent-Variable Representation of Cumulative Models
The latent data formulation of cumulative models stipulates a continuous variable underlying the sequence of ordered responses (see Long, 1997, pp. 116-122;Agresti, 2010, pp. 53-55).We will stick to the label Britishness to designate this latent variable.It should be noted that the range of values taken on by this latent variable depends on the data and how the model is specified.Since the model cannot estimate the intercept and all four thresholds simultaneously, a constraint must be imposed on the parameters in order for the model to be identified (see Long, 1997,  pp.[122][123].This means that the model must be parameterized in such a way that these five quantities are defined using only four parameters.Two commonly used options (or identifying assumptions) are to set the intercept to 0, or to set one of the thresholds to 0. 8 Since the main focus of the present paper is on model interpretation, we would like to scale the latent variable in a way that positive scores reflect an inclination towards the British variant, with zero assuming the same interpretation as in the MRM.The R package we are using to estimate cumulative models offers this option.9 Before we go further, we should note that we use the label 'latent-variable model' as a convenient shorthand.In fact, this 'model' is identical to the cumulative model -the difference lies in the method of interpretation, i.e. the way in which regression coefficients are used to understand the model.
To illustrate how a latent-variable model (LVM) negotiates between the ordered response categories and the continuous response formulation, we again consider a hypothetical item pair.Let us assume we have actively defined the latent scale to be centred at the point denoting no preference between BrE and AmE, and obtain, as an estimate of the propensity towards either standard variety, an average of +0.3.This average leans towards BrE, and Figure 8(a) shows that it marks the centre of a bell-shaped distribution.This curve represents the variation in propensities underlying the responses we have collected using our questionnaire.This variability is the sum of different sources of variation.Possibly the greatest share is variation among respondents, as speakers may differ in their (strength of) preference -either in general, or for this lexical pair specifically.Other sources of variation that contribute to the spread of scores is error, perhaps due to fatigue or a misunderstanding of the meaning of the expressions.
The way in which this underlying latent scale maps onto the ordered response categories is illustrated in Figure 8(b).The vertical lines mark four thresholds, which are labelled using the Greek letter τ (tau).These divide the continuum into five intervals, one for each category.For instance, if the latent Britishness score underlying a specific response in our questionnaire (and recall that this score may include an error, or noise component) is +1.0, the response category 'I always use the British term' will be selected.
The location of the thresholds is determined empirically, by the regression analysis.Their location and spacing depends on the number of response categories and on how they are labelled, both in absolute terms and relative to each other.The estimation of threshold parameters is usually flexible, which means that they are allowed to adapt to the distribution of the data, without any a-priori constraints (e.g.equidistance).For our 5-point scale, we would like to implement two constraints: First, since the middle category is neutral, we centre thresholds at 0. 10 The sign of the latent variable is then meaningful, since positive averages signal a preference for BrE.Further, with our response categories being identically phrased at both ends, we constrain thresholds to be symmetric around 0. 11 In this way, their spacing aligns with the firm expectation that the same verbalization should mark the same increments along the underlying scale (see Johnson, 2003).Irrespective of how thresholds are estimated, 12 they divide the area under the density curve into five parts, which are represented in Figure 8(b) with different fill colours.These areas under the curve represent the model-based estimates of the category probabilities.They are shown to the right of the graph using a more familiar graph type, a grouped bar chart.
Depending on the specific distribution that is used to represent the underlying variability, we distinguish (among others) between the cumulative model with a probit link (normal distribution) and a logit link (logistic distribution).The choice between these two is largely a matter of convenience 13 (Long, 1997, p. 120); we will use the probit link function.
We note once more that the two formulations we have reviewed for the parallel cumulative model are mathematically equivalent and therefore yield identical results, i.e. the same statistical inferences and estimates of category probabilities.The choice between the two perspectives may therefore be based on interpretative preferences.Next, we demonstrate how the latent-variable approach allows us to construct model summaries that meet the standards set by MRMs.

Comparison of Items on the Latent-Variable Scale
Figure 9 illustrates how the responses for three item pairs in our data are modelled, with the x-axis hosting the latent variable, Britishness.The normal densities have different means: For the item pair (a) trash-rubbish it is +1.2; for (b) center-centre it is +0.5; and for (c) package-parcel it is +0.2.Four thresholds divide the horizontal scale into five regions, which represent our response categories.The grouped bar charts at the right show the estimated response probabilities.Note how, for package-parcel, the preference pattern is roughly balanced: The share of the outer categories ('I always use this expression') are 29% for package (AmE) and 40% for parcel (BrE).For centercentre, the latent-variable mean is higher, and we can observe how the proportional share of grey shades changes accordingly.The proportion of speakers 'always' using BrE centre is 60%; for BrE rubbish, it is at 79%.This illustrates the effect of the latent-variable mean on the category probabilities: As it shifts along the scale, we observe a change in response probabilities.
The model for (c) package-parcel illustrates a feature of LVMs that may seem counterintuitive at first: Even though the latent mean falls into the interval that maps onto the response 'I have no preference', this need not be the most likely category.Rather, it is the spacing and location of the thresholds relative to the bell-shaped density that determines the probability of the individual response categories.
Let us also consider how this information is represented in a regression table.For the kind of analysis underlying Figure 9, there are three types of parameters: (i) thresholds, (ii) coefficients describing the item means, and (iii) a random intercept SD, which expresses between-speaker variability in the overall level of Britishness.Further, Table 2 shows two sets of thresholds: An uncentered and a centered one (with mean 0).The latter parameters are the ones underlying  Figure 9.The difference between these sets of thresholds is a shift in scale: In the centered set, the average over the uncentered parameters (−0.61) is subtracted out.

Apparent-Time Trends on the Latent-Variable Scale
An LVM can also be used to describe trends in the data.We are then interested in how the estimated latent-variable mean for a specific item pair varies with year of birth.A graphical representation of such a model for the item pair package-parcel appears in Figure 10.Britishness is now shown on the vertical axis, which is again divided into five intervals by the thresholds.The normal distributions in the graph represent six equally-spaced snapshots, starting with the year 1950 (left) and extending to the year 2000 (right).The tilted stroke connects the estimated means at these x-values, and (in accordance with the model assumptions) it forms a straight line.Estimated means are given above the graph -they range from +0.9 (1950) to −0.6 (2000).The grouped bar charts at the top show the predicted category probabilities for different birth years: The share of responses that indicate 'always' using parcel  decreases from 66% (1950) to 16% (2000).Conversely, the proportion of 'always'-responses for package increases from 10% to 56%.
Let us also consider a regression table for the analysis shown in Figure 10.Table 3 lists two types of parameters: thresholds and a slope coefficient describing the trend for package-parcel.Again, both uncentered and centred thresholds are shown.
The model-based change in category probabilities can also be shown on a continuous scale using an area chart.This graphical arrangement appears in Figure 11(a), where we see the continuous decrease in Britishness in apparent time.For comparison, the descriptive density plot in panel (b) offers a smoothed summary of category probabilities across birth years (see Appendix 2).We see that the fit of the model is not perfect -most notably, the share of 'no preference' responses appears to be underestimated somewhat.
We have seen how differences between items and apparent-time patterns can be modelled and summarized on the latent-variable scale.In the next section, we apply these techniques to the full set of data.

Analysis Using a Latent-Variable Model
We now use a cumulative mixed model with a probit link to address our research questions.We will first return to the gradience in Britishness among item pairs, and then consider apparent-time trends in our data.For comparison, we will present results side-by-side with those obtained from the earlier MRMs.In terms of structure (fixed and random components), the ordered regression models we report here parallel those in Section 5; this means that we used the same model syntax (models 1 and 2). Figure 12(b) shows the latent-variable means for the 63 items.The thresholds, which guide the interpretation of these scores, also appear in the graph.Similar to the MRM-based distribution in panel (a), we note that the majority of items sits firmly in the British half of the continuum.Since the latent scale is unconstrained, however, the pile of dots shows no ceiling effect and instead forms a more symmetric distribution.
Figure 13(b) displays the set of 63 straight-line patterns based on the LVM, along with the four thresholds.Similar to the MRM, these are proportional to the item-specific slopes in the model.A comparison to panel (a) reveals two differences: First, the trend lines are spread out more evenly along the scale, which improves resolution among item pairs with high levels of Britishness.Further, the removal of the ceiling effect sensitizes the model to differences in trend at the upper end of the scale.While in panel (a) there is little variability in slopes among items near ceiling, no such scale constraints are evident in panel (b).
Let us also look at a condensed summary of the 50-year apparent-time trends.Figure 14 shows contrasts of estimated means, i.e. differences in Britishness between 2000 and 1950.Again, these are proportional to the simple slopes returned by our model.Recall that positive values indicate an increase in Britishness, and negative scores reflect a decrease.A comparison to Figure 6 shows a somewhat greater number of confidence intervals that do not cover zero.We will take a closer look at these differences in the next section.

Comparison of Results
We now present more detailed, item-level comparisons of the results from the two models, dealing in turn with the overall level of Britishness and apparent-time trends.Figure 15 shows estimated means -or Britishness scores -from the two models.Each item pair is represented by a line, which connects the estimated MRM (top) and LVM (bottom) mean.The two scales are aligned at zero, which has the same interpretation in both models.Apart from this alignment, we have scaled the horizontal extension of each set of scores based on their standard deviation.This makes them roughly comparable in terms of visual spread.The axes, however, show the actual scores (i.e.not the standardized 14 scores).
What emerges from Figure 15 is that the Britishness scores from the two models are in very good agreement (Pearson correlation: +.99).The rank order of item pairs is almost identical, which is evident from the fact that only few profiles cross.Figure 15 also illustrates how the latent scale increases the resolution among item pairs with high Britishness scores; it is more successful in revealing gradience near the extremes of the scale.
Figure 16 compares the estimates for apparent-time differences -i.e.slopes, or contrasts of estimated means -derived from the two models.The setup is the same as in Figure 15 overall agreement between the two models is still good.In comparison to Figure 15, however, we note more crossings of profiles.As we draw lines from the top (MRM) to the bottom (LVM), a recurrent pattern is for negative contrasts (signalling an apparent-time decrease in Britishness) to show a leftward deflection.In other words, for a number of items the LVM yields steeper trend lines away from the BrE variant.
Let us take another look to understand this pattern.To this end, Figure 17 shows, on the left-hand side, the differences between the standardized apparent-time contrasts of estimated means (or slopes) derived from the two models.A dot below the grey horizontal line indicates that, for this specific item pair, switching from the MRM to the LVM brings about a clockwise rotation of the (standardized) trend line.Since the majority of these (second) differences are negative, this is the typical rotation we observe.
At the right-hand side, Figure 17 graphs these differences in apparent-time trend against the estimated latent-variable mean in Britishness (using item IDs  from Appendix 2 as plotting symbols).It provides an exploded view of the dot diagram and allows us to see, for each item, how the observed between-model difference in apparent-time trend relates to the overall level of (latent-scale) Britishness.We note that a clockwise rotation of trend lines tends to go hand in hand with a greater average level of Britishness.In other words, items with a high level of Britishness show a steeper decrease in Britishness in the LVM.This results from a removal of the ceiling effect: While MRM trend lines are artificially flattened near the end of the scale (see Figure 13(a)), the LVM removes this constraint and allows our straight-line summaries to stretch out, providing a better picture of the behaviour of firmly British item pairs.Let us finally compare the statistical uncertainties in the trend coefficients.Figure 18 shows the contrasts of estimated means produced by the two models, along with 95% confidence intervals.Estimates from the LVM (see Figure 14) are shown in grey, and the item pairs are ordered based on these.The vertical position of IDs reflects the estimated latent-scale means of Britishness; most therefore appear in the upper half of the display.The squares at the top of the graph flag items whose 95% interval does not cross zero, i.e. lexical pairs that might tentatively be considered as showing a statistically detectable trend; again, grey squares refer to the LVM.
For the MRM (black squares), 26 confidence intervals exclude zero, and for the LVM 32 item pairs show a detectable trend in apparent time.From a confirmatory perspective, then, the MRM appears to be less receptive to patterns in our data.It is therefore only for 87% of the item pairs (55 out of 63) that we arrive at the same binary statistical conclusion (if such were needed).For the remaining 8 pairs (13%), only one of the models suggests a statistically reliable trend.If we compare these mismatches to the latent-scale Britishness of items (vertical position of ID), we note two things: First, the LVM is statistically more sensitive near the endpoints of the scale (i.e.solitary grey squares go hand in hand with high latent-scale means, i.e. item IDs appearing near the top of the graph).The solitary black square for item pair 10 (potato chips vs. potato crisps), on the other hand, is linked to a latent-scale mean near zero. 15With regard to the directionality of apparent-time patterns, it is worth noting that there are no item pairs for which the two analyses yield different conclusions.
While the overall level of agreement between the two analysis strategies is acceptable for our data, this should not be taken to imply that the choice of model is largely inconsequential.In fact, it has been shown elsewhere that there are data settings where LVMs and MRMs can diverge much more drastically (e.g.Liddell & Kruschke, 2018;Rohrer & Arslan, 2021).We should also note that, in cases of disagreement, we would certainly give preference to the LVM, as its distributional assumptions are less doubtful.

Summary and Conclusion
With the statistical analysis of ordinal variables being a recurrent theme in the methodological literature, we conducted a survey of research articles published in various linguistic journals to see how they are handled in language data analysis.We observed that ordered responses are most commonly treated in the same way as continuous variables.While limitations of this approach have been pointed out in the literature, it is perhaps equally important to recognize its strengths: Apart from being easy to implement, mean response models (MRMs) provide informative and accessible data summaries.Our survey also showed that ordered regression models are making inroads into the published literature, with cumulative models (a specific form of ordinal regression) being applied in an increasing share of analyses.The way in which such models are summarized and interpreted in these studies suggests that not many linguists may be familiar with their alternative, latent-variable formulation.Critically, this interpretative angle allows us to summarize data patterns analogously to MRMs.
The aim of this paper was to draw linguists' attention to this conceptualization of cumulative models, and to illustrate its application to language data.We used data from a research project on lexical variation in Maltese English to juxtapose analyses based on MRMs and latent-variable models (LVMs).With regard to clarity of interpretation, the data summaries provided by the two model types were at eye level.At the same time, however, LVMs allowed us to avoid the limitations inherent in MRMs.Thus, they sidestep the difficult (and necessarily arbitrary) choice of numeric scores for the response categories, they cushion certain forms of measurement error, and they allow us to generate predicted response probabilities, which make it possible to compare model estimates to the observed data.LVMs also succeeded in removing from our model summaries distortions induced by floor and ceiling effects.For MRMs we observed how variation among simple averages and regression lines was compressed near the endpoints of the scale, with trend lines being artificially flattened.While for overall Britishness scores this only led to some deformity in the relative spacing along the scale, the amount of change in apparent time suggested by our analysis varied systematically between the two models.The scale-dependence of interaction patterns therefore affected a quantity of immediate linguistic interest.We hope to have illustrated how the latent-variable conceptualization of cumulative models aids interpretability and allows us to communicate ordered regression models in a manner that is both informative and accessible.Judging from the results of our survey, which indicates that the vast majority of studies either use a mean response model or a cumulative ordered regression model, we believe that an LVM approach to ordered response variables may be of wider applicability in language data analysis.

Notes
1.All images in this paper have been published under the Creative Commons Attribution 4.0 licence (CC BY 4.0, http://creativecommons.org/licenses/by/4.0) in the accompanying OSF project (https://osf.io/jnv27).2. https://www.populationpyramid.net/malta/2019/3.No ethics approval was obtained for this study, since -due to its design and participant characteristics -this was (and is, as of 2023) not required by European or Maltese national regulations or policies.For a careful weighing of research-ethical considerations, please refer to the data protection impact assessment in the TROLLing post (Krug et al. 2023).4. In the notation used here, the term in brackets denotes the random intercepts for the variable informant.5.These are the fitted values obtained through appropriate combination of the regression coefficients (i.e. the fixed intercept and slopes).6.This is the regression line fitted to the (unjittered) data.7.For the regression analysis reported in this paper, the predictor year of birth was centred and rescaled to range from − 1 (1950) to + 1 (2000).Figure 5 therefore shows slope coefficients (and their CIs) multiplied by 2, to obtain the difference associated with a 2-unit (i.e.50-year) rather than 1-unit (25-year) change in the predictor.8.The parameterization of the model determines the centre of the latent scale and is therefore one of the features affecting the values taken on by the latent variable.A model feature that influences the variability of the latent variable is the link function chosen (e.g.probit or logit): For the logit link, the latentvariable variance is π 2 /3, i.e. about 3.3 times larger than for the probit link.Finally, the data will affect the value of the latent variable (only) if the model is identified by setting the intercept to 0. The centre of the latent scale will then depend on the scaling of the predictor variables (i.e. which condition is referenced when setting all predictors to 0) and the association between predictors and outcome (i.e. the expected distribution of the outcome variable for the condition denoted by the intercept).9.As we will see shortly, this involves defining the thresholds (i.e.parameterizing the model) in such a way that the mean of the threshold parameters is zero.10.As explained in more detail in the tutorial that is included in the supplementary materials (https://osf.io/jnv27), in our analyses, this constraint is actually enacted at the post-processing stage.Thus, the parameterization we used when fitting the model using the R package 'ordinal' does not define thresholds to be centred.It is the package we use to construct model predictions (emmeans; Lenth, 2023) that (by default) returns predictions (or estimates) on a latent scale that is centred at the average over thresholds.11.In the R package 'ordinal', this can be achieved by setting the argument 'threshold' to 'symmetric'.This is illustrated in the short R tutorial that accompanies the present paper (https://osf.io/78ezm).12.While the custom thresholds we are using in our analysis make sense in light of our ordinal response scale and our interpretative preferences, the basic point the current paper is trying to make does not hinge on how thresholds are parameterized.Thus, if they are unstructured (i.e.estimated flexibly, with no a-priori constraints on their relative spacing or midpoint), the same methods of interpretation can still be applied.As a technical aside, however, we note that a benefit of constraining thresholds in this way is that it reduces the number of parameters that must be estimated, thus yielding a more parsimonious model structure.13.We thank a perceptive reviewer for the following hint: Since the logistic distribution (i.e. the probability distribution underlying the logit link) is well-approximated by a Student-t distribution with about 8 degrees of freedom (see Agresti, 2010, p. 330), the logit link is also more robust to aberrant data points.14.These scores are in fact only partly standardized, since they are not centred about their mean.15.This is due to the fact that the MRM relies on a combined (i.e.pooled) estimate of residual variation around the conditional averages.Since the variation of scores is smaller near the endpoints of the scale (see Section 3), the residual standard deviation is downwardly biased for conditional means near the scale midpoint.The opposite is true, of course, for estimates near the bounds of the scale.

Figure 2 .
Figure 2. Distribution of year of birth in our sample of speakers.

Figure 3 .
Figure 3. Gradience in Britishness: estimated means for the 63 item pairs, based on an MRM (model 1).

Figure 4 .Figure 5 .
Figure 4. Data representation for item pair package-parcel using an MRM; the trend line shows estimated means by year of birth (model 2).

Figure 6 .
Figure 6.Apparent-time trends based on an MRM: contrasts of estimated means for birth years 2000 and 1950 with 95% confidence intervals (model 2).

Figure 7
Figure 7 from left to right, the new set of (lower) cumulative logits then map back to the data scale, i.e. the category probabilities shown at the far right.This means that, given a set of observed data (the stacked bar charts in Figure7), the parallel cumulative model with a logit link captures the differences between observed distributions by (i) finding a set of cumulative logits (one less than the number of ordinal categories) and (ii) a constant difference (or slope, in regression parlance) that best represents observed differences.The slope coefficient (here: −1) then signals, on the log odds scale, the change in cumulative probabilities associated with a 50-year step in birth year.Note that in an actual analysis, the fitted cumulative logits (dots) will only approximate the observed cumulative probabilities (bar charts).This is one particular way in which the parallel cumulative model can be understood.We now turn to an alternative conceptualization, which revolves around the notion of an underlying latent response variable.

Figure 7 .
Figure 7. Illustration of the parallel cumulative model using hypothetical data: cumulative category probabilities (data scale) and cumulative logits (model scale) modelling differences in response patterns.

Figure 8 .
Figure 8. Illustration of the latent-variable formulation of the cumulative model.

Figure 9 .
Figure 9. LVM for three illustrative item pairs in our data set.

Figure 10 .
Figure10.LVM for the apparent-time trend in the item pair parcel-package.

Figure 11 .
Figure 11.Visualization of the apparent-time trend in parcel-package: (a) Model-based category probabilities; (b) a flexible summary using a conditional density plot.

Figure 12 .Figure 13 .
Figure15shows estimated means -or Britishness scores -from the two models.Each item pair is represented by a line, which connects the estimated MRM (top) and LVM (bottom) mean.The two scales are aligned at zero, which has the same interpretation in both models.Apart from this alignment, we have scaled the horizontal extension of each set of scores based on their standard deviation.This makes them roughly comparable in terms of visual spread.The axes, however, show the actual scores (i.e.not the standardized 14 scores).What emerges from Figure15is that the Britishness scores from the two models are in very good agreement (Pearson correlation: +.99).The rank order of item pairs is almost identical, which is evident from the fact that only few profiles cross.Figure15also illustrates how the latent scale increases the resolution among item pairs with high Britishness scores; it is more successful in revealing gradience near the extremes of the scale.Figure16compares the estimates for apparent-time differences -i.e.slopes, or contrasts of estimated means -derived from the two models.The setup is the same as in Figure15: The scales are aligned at zero (indicating no change in apparent time) and the horizontal spread of estimates is standardized, with the axis labels giving the actual differences.With a Pearson correlation of +.90, the

Figure 14 .
Figure 14.Statistical uncertainty estimates for apparent-time trends based on an LVM (model 2).Error bars denote 95% confidence intervals.

Figure 15 .
Figure 15.Comparison of estimated means (Britishness scores) of all 63 item pairs based on an MRM (top) and an LVM (bottom) (model 1).

Figure 16 .
Figure 16.Comparison of contrasts of estimated means (our estimates of apparent-time trends) for all 63 items, based on an MRM (top) and an LVM (bottom) (model 2).

Figure 17 .
Figure 17.Discrepancy between standardized of estimated means (standardized apparent-time trends) generated by the two models, and its relation to the endpoints of the scale, i.e. the estimated latent-scale mean (Britishness score).

Table 1 .
Structure of ordinal scales used in rating tasks (n = 136 articles).

Table 2 .
Table of coefficients for the analysis underlying Figure 9.

Table 3 .
Table of coefficients for the analysis underlying Figure 10.