Testing linearity and relevance of ordinal predictors

: In a linear model relevance of a categorical predictor with or- dered levels is typically tested by use of the standard F -test (known from statistical textbooks). Such a test can also be applied for testing whether the regression function is linear in the ordinal predictor’s class labels. In this paper we propose an alternative (restricted) likelihood ratio test for these hypotheses which is especially suited for ordinal predictors and is based on the mixed model formulation of penalized dummy coeﬃcients. We show in simulation studies that the new test is more powerful than the standard F -test in many situations. The advantage of the new test is especially striking when the number of ordered levels is moderate or large. Using the relationship to mixed eﬀect models and robust existent ﬁtting software obtaining the test and its null distribution is very fast; a fast R implementation is provided.


Introduction
We consider an ordinal covariate x with levels k = 0, . . . , K. Such ordinal predictors are, for example, often found in the social sciences. Frequently, however, they are treated as continuous and the class label is directly used as explanatory variable in a (linear) regression model for predicting a response variable y. Thomas (1997), for example, used linear regression to describe the relationship between a manager's values and his or her success, with values being coded using an ordinal scale ranging from 1 (lowest priority) to 18 (highest priority). Labowitz (1970) used ordinally scaled prestige ratings to predict suicide rates for different occupations. Also Kim (1975Kim ( , 1978 argues that Pearson's correlation can be used to quantify the association between two variables if one (or even both) of these is ordinal. But not only in the social sciences, ordinal covariates play an important role. In particular when rating schemes are used, variables are often ordinally scaled; and schemes like that are found in many fields, such as medicine. For example, cell characteristics may be graded on ordinal scales with 1 being the closest to normal tissue and 10 the most anaplastic (Wolberg and Mangasarian, 1990), or see WHO (2001), Cieza et al. (2004) and Section 7 for an example from rehabilitation medicine.
In standard linear models the conditional expectation E(y|x) of y given an ordinal predictor x, is assumed to have a simple form where ǫ is a zero-mean random variable. Typically, such models make the additional assumption that ǫ is normal and has variance σ 2 . So for given data (y i , x i ), i = 1, . . . , n, we have with independent identically distributed (iid) ǫ i ∼ N (0, σ 2 ). Such iid normal errors ǫ i are assumed throughout this paper. Even though model (1) is often used in practice, the variable x is ordinal and may alternatively be included using dummy-coding as where the dummy variables m k are defined as That means, a one-dimensional variable x coding the factor level by a single number is transformed into a multivariate variable (m 0 , . . . , m K ) T . For example, if x has levels 0, . . . , 5, x = 3 is transformed into (0, 0, 0, 1, 0, 0) T . Each dummy variable m k has its own (dummy) regression coefficient β k . For identifiability, we specify reference category k = 0, such that β 0 = 0.
As restrictions (3) (and (4)) are linear and the parameter space of θ = (α, β 1 , . . . . . . , β K ) T from model (2) is R K+1 , it follows that model (1) can be tested against model (2) using a standard F -test. If one is interested in the global test whether the predictor x is necessary, the null-hypothesis β k = 0 for all k = 1, . . . , K can also be tested using a standard F -test. These F -tests are shortly summarized in the next section. Section 3 shows how penalized estimates for dummy-coded ordinal predictors can be obtained, and in Section 4 these estimates are derived in the context of linear mixed models. Using the mixed model formulation, we propose an alternative (restricted) likelihood ratio test for checking linearity -i.e., distinguish between model (1) and (2) -and checking relevance of an ordinal predictor (Section 5). Simulation studies in Section 6 show that the proposed test distinctly outperforms the standard F -test in many situations. In Section 7, real data examples are given.

Standard F-tests
Assuming normal errors ǫ, the F -test can be used to test linear null-hypotheses of the form H 0 : Cθ = d, where θ = (α, β 1 , . . . , β K ) T denotes the coefficient vector from the dummy model (2) and C is a r × (K + 1) matrix of rank r. The matrix C and vector d define the linear hypotheses that are to be tested. For testing, the likelihood ratio statistic can be used, with L(θ) denoting the data likelihood evaluated at the parameter vector θ. H A denotes the alternative hypothesis Cθ = d. Assuming H 0 is true, the monotone transformation follows an F -distribution with (r, n − K − 1) degrees of freedom (see, e.g., Rao et al. (2008)). SSE denotes the sum of squared errors under the full model, and SSE 0 is the sum of squared errors if restrictions Cθ = d hold.

Penalized estimates for ordinal predictors
When estimating the regression parameters in model (2) is maximized, with l(θ) denoting the usual, non-penalized, likelihood. By adding the penalty J(θ), estimates of the regression parameters can be stabilized. The choice of the penalty J(θ) is crucial. Because the categories of the predictor x are ordered, it may be assumed that the response y (or its conditional expectation) changes smoothly between two adjacent levels k and k − 1 of x. This led Gertheiss and Tutz (2009) to propose the quadratic difference penalty where β = (β 1 , . . . , β K ) T , penalty matrix Ω 1 = D T 1 D 1 , and An implementation of this penalty for ordinal predictors is provided in the R package ordPens (Gertheiss, 2011), and also given in the Appendix. The strength of the penalization in (5) is controlled by the penalty parameter γ.
With γ = 0, pure ML estimates are obtained. With γ → ∞, all estimatesβ k equal β 0 = 0, which means that in the model response y is not influenced by predictor x. Ifβ k = 0 ∀k is desired for some γ < ∞, a group lasso (Yuan and Lin, 2006) penalty can be used (see Gertheiss et al. (2011) for details). Under the assumption that the true coefficient vector β is non-linear but smooth, it has been shown in Gertheiss and Tutz (2009) that parameter estimation and prediction can be substantially improved using the proposed penalized likelihood compared to unpenalized likelihood, simple ridge estimation, or linear regression on the category labels.
Similarly, if deviations from a linear function -that is, from model (1) -are to be penalized, the penalty has the form with penalty matrix Ω 2 = D T 2 D 2 , and In principle, penalties of higher order are also possible, but we found these two penalties to be especially useful for ordinal predictors because: (1) they incorporate the ordinal scale level of x; and (2) penalize derivations from a model without predictor x, and a linear model based on the class labels, respectively. The penalty parameter γ can be chosen based on the data. Popular approaches are (leave-one-out/K-fold/generalized) cross-validation or information criteria, such as AIC or BIC. In the next section we propose to use the duality between penalized likelihood and linear mixed models, which allows the estimation of the penalty parameter, γ, using ML or REML; this also provides a natural framework for testing hypotheses of constancy β 1 = · · · = β K = β 0 = 0 and linearity.

Linear mixed models
In a linear mixed model with one variance component it is assumed that where y = (y 1 , . . . , y n ) is the vector of response values, ξ is the p-dimensional vector of fixed effects, u is the q-dimensional vector of random effects, ǫ = (ǫ 1 , . . . , ǫ n ) T is the error vector, and X and Z are adequate design matrices (see, for example, Ruppert et al. (2003), or Crainiceanu and Ruppert (2004)). For the vectors of random variables u and ǫ, we assume where I n denotes the n × n identity matrix (iid errors are assumed). Under the assumption of normal u and ǫ, we have and maximizing the likelihood of (y, u) over the unknown ξ and u, leads to the criterion (cf. Ruppert et al., 2003) σ Minimization of these loss functions leads to estimates with M = (X|Z) and
Then Z contains modified dummy variables z 1 , . . . , z K with In this case, G −1 is just the K × K identity matrix I K , since with (12) we have u = D 1 β (see (7)).
Deriving the penalized estimator with second-order difference penalty (8) using a linear mixed model with identity matrix G −1 is only slightly more involved. The linear part of the dummy-coded design matrix has to be separated from the non-linear part. This can be done by writing X = (1|χ) with χ = (x 1 , . . . , x n ) T (remember: x i are observations of x ∈ {0, . . . , K}) and ξ = (α, δ). Now, the random effects u 1 , . . . , u K−1 have to specify deviations of model (2) from linearity. To achieve this, we write Parameterizations (β 1 , . . . , β K ) T and (δ, Thus, the design matrix Z in model (10) needs to contain variables z 1 , . . . , z K−1 with In Appendix A, it is explained more explicitly how design matrix Z and variables z k are constructed.

Estimation of variance/penalty parameters
The advantage of using the mixed model formulation of the penalized estimates is that the penalty parameter γ equals σ 2 /τ 2 , and likelihood methods for estimating σ 2 and τ 2 can be used for estimating γ. A direct consequence of this is that testing the two hypotheses we are interested in can be reduced to testing certain zero variance components. For estimating σ 2 and τ 2 , ML or restricted maximum likelihood (REML) estimation are typically used. ML estimation is based on the the marginal model with covariance matrix Σ = τ 2 ZGZ T + σ 2 I n . The log-likelihood of y under this model is Maximization over the unknown parameters in ξ and Σ leads to ML estimates. For any fixed Σ, l(ξ, Σ) is maximized over ξ byξ = (X T Σ −1 X) −1 X T Σ −1 y, which equalsξ from (11), cf. Ruppert et al. (2003). If G is just the identity matrix, as it is for our penalized estimates, matrix Σ becomes τ 2 ZZ T + σ 2 I n , where only τ 2 and σ 2 are unknown. Crainiceanu and Ruppert (2004) use another parametrization: With ϑ = τ 2 /σ 2 and V ϑ = I n + ϑZZ T , we have Σ = σ 2 V ϑ . So the log-likelihood becomes A second parameter estimation criterion for our model is the restricted loglikelihood (cf. Harville, 1977;Crainiceanu et al., 2005) l where p denotes the number of columns of the fixed effects design matrix X. In case of a first-difference penalty on adjacent dummy coefficients we have p = 1; if derivations from linearity are penalized, p = 2.
The test problems introduced above are nonstandard because under the null hypothesis the parameter of interest (τ 2 , resp. ϑ) is on the boundary of the parameter space. Nevertheless, the (log) likelihood ratio test statistic LRT n may be used. Following Crainiceanu and Ruppert (2004) and Crainiceanu et al. (2005), we define Because in our case the fixed effects contained in ξ are the same under the null and the alternative hypotheses, we can also use the restricted likelihood ratio statistic (cf. Crainiceanu et al., 2005) RLRT n = 2 sup Crainiceanu and Ruppert (2004) derived the finite sample and asymptotic distributions of LRT n and RLRT n when testing for a zero random effects variance in linear mixed models with one variance component. Because of the mixed model representation of penalized dummy variables, these results can also be used for testing relevance and linearity of ordinal predictors.
With g denoting the number of random effects (that means, g = K or g = K −1, depending on the penalty employed), let µ s,n and ν s,n be the g eigenvalues of matrices Z T (I n −X(X T X) −1 X T )Z and Z T Z, respectively. Then, as a special case of Theorem 1 in Crainiceanu and Ruppert (2004), where = D denotes equality in distribution, and and w s are independent N (0, 1), s = 1, . . . , n − p; see also Crainiceanu et al. (2003). For the restricted likelihood ratio statistic (see Crainiceanu and Ruppert, 2004), we obtain Crainiceanu and Ruppert (2004) also give an algorithm for simulating the distributions of LRT n and RLRT n , which is implemented in the R package RLRsim (Scheipl, 2010). Since REML estimates of the variance parameters are preferable to ML (cf. Ruppert et al., 2003) and restricted likelihood based tests have been seen to perform better than those using the unrestricted likelihood (Morrell, 1998), we will focus on RLRT n .

Simulation studies
To investigate the performance of the proposed restricted likelihood ratio test, and to compare it to the standard F -test, we run several simulation studies. We will first consider the null hypothesis of constancy/relevance, i.e., H 0 : β 1 = · · · = β K = β 0 = 0. Then we will check linearity of dummy coefficients, i.e., H 0 : All computations are done using the statistical program R (R Development Core Team, 2010). For calculating the proposed restricted likelihood ratio statistic, we use the R package amer (Scheipl, 2011), since user defined basis functions (e.g., z k as defined in (13), resp. (15)) can easily be added. To perform the test, we use RLRsim (Scheipl, 2010). A more detailed description of the corresponding R-code is given in Appendix B.

Testing for relevance
We consider three scenarios with three different regression functions that are defined on the the categories 0, . . . , K of predictor x, see the left panels in Figures 1, 2 and 3. The response y is generated according to model (2) with standard normal error. We assume a sample size of n = 100. The effect strength is controlled by the parameter a ∈ [0, 1], where a = 0 means that y is not influenced by x, i.e., the null hypothesis H 0 : β 1 = · · · = β K = β 0 = 0 is true (green lines). For each regression function considered, the red line corresponds to a = 1. The other lines in the left panels of Figures 1, 2 and 3 correspond to a = 0.1, 0.2, . . . , 0.9. We consider two nonlinear functions (Figure 1 and 2) and a linear one (Figure 3), and investigate the power of the proposed restricted likelihood ratio test (RLRT) as a function of a for different values of K (middle/right panels of Figures 1, 2 and 3). Note, since k = 0, . . . , K, e.g., K = 4 means that predictor x has 5 levels. The power of the test is computed as the proportion of rejections of H 0 over 10,000 (independent) repetitions of data generation and testing. The significance level is α = 0.05 (dotted lines in Figure 1, 2 and 3). We compare the proposed RLRT (red lines) to some alternative tests. The standard F -test is carried out by the R function anova() from the base distribution (R Development Core Team, 2010). We compare the intercept model (fitted by lm(y~1)) and the model with dummy-coded predictor x (fitted by lm(y~factor(x))). In addition, we fit a linear model with the group labels of x as predictors (lm(y~x)), and compare it to the intercept model. The results of these F -tests are indicated as F anova and F lm , respectively.
Alternatively to linear modeling of x, analysts may also fit a smooth regression model using the gam() function from the mgcv package (Wood, 2006;Wood, 2011), and check significance of the smooth term (however, this would incorrectly treat x as continuous). For comparison, we also add results for such a test to Figures 1, 2 and 3, when a P-spline basis (Eilers and Marx, 1996) is used and derivations from a constant line over categories of x are penalized. The number of basis functions is chosen as the default value in mgcv, which is 9 in our case and seems sufficient since the true regression functions (see Figures 1, 2 and 3) are quite smooth and can hence be well approximated by a rather low-dimensional B-spline basis. The test used is the approximative F -test proposed by Wood (2006), which is part of the default gam-output. So the test is labeled as F gam .
From Figures 1, 2 and 3 it follows that in these cases RLRT is among the most powerful tests. It clearly outperforms F anova , particularly when the number of levels of x is high. F lm is only superior to RLRT if the true underlying regression function is linear (3). F gam does not seem to be as powerful as RLRT for ordinal predictors (particularly in Figures 2 and 3). Furthermore, F gam may not be a reliable α-level test, as, for example, shown by Scheipl et al. (2008). Though in Figures 1, 2 and 3 it seems that F gam has correct size, this is not the case. In our simulations for a = 0, the proportion of rejections of H 0 was mostly higher than 0.05 when F gam was applied. More precisely, if K ∈ {9, 14, 19}, the (observed) size of F gam was between 0.06 and 0.07 (which can be seen if the respective plots in Figure 1, 2 and 3 are zoomed in). F anova , F lm and RLRT, by contrast, have correct size. Left: The second set of considered regression functions (for K = 9) when testing null hypothesis H 0 : β 1 = · · · = β K = β 0 = 0, which is true for a = 0 (green lines). The red lines correspond to a = 1, black lines to a = 0.1, 0.2, . . . , 0.9. Right: Proportions of rejections of H 0 as a function of a when RLRT, Fanova, F lm and Fgam are applied.

Testing for linearity
To investigate if the proposed restricted likelihood ratio test for ordinal predictors is also suitable for testing linearity, we consider three types of nonlinear functions, see the left panels in Figures 4, 5 and 6. Now the extent of nonlinearity is controlled by parameter a ∈ [0, 1], where a = 0 means that the conditional expectation of y given predictor x is a linear function of the x-levels 0, . . . , K. That means, (assuming dummy model (2)) the null hypothesis H 0 : β k+1 − 2β k + β k−1 = 0, k = 1, . . . , K − 1, is true -indicated by green lines in the left panels of Figures 4, 5 and 6. Red lines correspond to a = 1 again, and the other lines to a = 0.1, 0.2, . . . , 0.9.
We compare the proposed RLRT, the standard F -test F anova (as described in Section 2) and the approximative F -test F gam from mgcv (Wood, 2006;Wood, 2011) when x is treated as a continuous covariate. For F gam , we use the procedure  suggested by Scheipl et al. (2008): We fit a model in which the smooth term represents only the deviations from linearity, which is done by fitting a model with an explicit linear term in x and an additional P-spline term in x with first-order difference penalization. We also tried the more general approximative F -test for comparing two nested GAMs as proposed by Wood (2006). But in our case this test was far away from being an α-level test (sometimes with rejection rates of around 20%, or even higher, for a = 0).
For data generation, we use dummy model (2), again with standard normal error and n = 100. As before, we set α = 0.05 and give rejection rates (computed over 10,000 simulated data sets) for the different tests as a function of a (right panels of Figures 4, 5 and 6). For a true function as given in Figures 4 and 6, RLRT distinctly outperforms F gam and F anova , with differences between RLRT and F anova becoming larger with increasing K. For a small number of levels K, F anova and F gam perform almost equally. If K is actually small and the true  regression function tends towards a smoothed step function ( Figure 5, left), F anova and F gam are superior to RLRT, as seen in Figure 5 (top). But also in this case, such superiority is not found anymore when the number of levels increases.

Rent standard data
We consider data from the Munich rent standard 2003. All larger German cities publish so-called rent standards for having guidelines available to tenants, landlords, renting advisory boards and experts. Rent standards are used, in particular, to determine the local comparative rent, cf. Gertheiss and Tutz (2010). Some of the data the Munich rent standard is based on can be found in the data archive of the Department of Statistics at the University of Munich. 1 We use the data to investigate the relationship between the rent per square meter (in Euro) and the number of rooms of the corresponding apartment, but we only consider new buildings, with 'new' meaning (according to the official rent standard definitions) that the building has been constructed after 1977. In Figure 7 (left) boxplots of the observed rents per square meter are shown for apartments with 1, 2, . . . , 6 rooms. In seems that the rent (per square meter) decreases for apartments with more rooms. When looking at the mean rents in each group -that is, fitting a dummy model by pure ML -it seems that rents decrease at first, but then increase for apartments with 5 and 6 rooms (green • in Figure 7, right). If a first-order difference penalty is imposed on adjacent dummy coefficients, rents for 4, 5 and 6 rooms seem to be (almost) equal (red * in Figure 7, right). When using a P-spline, fitted rents are very similar (blue ⋄ in Figure 7, right).
To check whether there are significant differences between the rents per square meter in apartments with a different number of rooms, we may use classical ANOVA, that is, the standard F -test. Though rents seem to be different in Figure 7, the test indicates just borderline significance on the 0.1-level (p-value 0.0995). Results for the approximative F -test from the gam-output are similar (pvalue 0.097). The proposed RLRT for ordinal predictors, however, indicates that rents are differing significantly on the 0.05-level (p-value 0.023). Nonlinearity that has been observed in Figure 7 is not indicated as significant by any of the considered tests. A possible explanation is that in the groups which are "responsible" for nonlinearity, namely 5 and 6, there are only a few observations Boxplots of the observed rents per square meter (in Euro) against the number of rooms of the considered apartments (left), and the corresponding fitted mean rents (right). For fitting, we used unpenalized ML estimation of dummy coefficients (green •), penalized estimation with first-difference penalty on adjacent dummy coefficients with the penalty parameter estimated by REML (red * ), and a penalized spline (using the gam-function from mgcv, blue ⋄).
available (5 and 4, respectively), as seen from Figure 7 (left), where widths of the boxes are proportional to the square-roots of the number of observations in each group.

ICF data
As another example, we consider patients with chronic widespread pain (CWP) and investigate the bivariate association between a score of physical health and the ICF category 'individual attitudes of immediate family members'. The ICF -the International Classification of Functioning, Disability and Health -provides a unified and standard language and framework for the description of functioning and disability (see WHO, 2001). The ICF consists of about 1400 so-called ICF categories, each of which refers to a health or a health-related domain. Category e410 'individual attitudes of immediate family members' is a so-called environmental factor. For environmental factors a differentiation is made between barriers and facilitators resulting in the coding scheme −4 (complete barrier), . . . , −1 (mild barrier), 0 (no barrier/facilitator), +1 (mild facilitator), . . . , +4 (complete facilitator). That means, e410 indicates whether, and to what extend, attitudes of family member are rather a barrier or a facilitator with respect to the patients' daily life. We investigate the relationship between e410 and a physical health component summary measure which is based on the SF-36 questionnaire (Ware and Sherbourne, 1992;McHorney et al., 1993). The data analyzed have been collected in a multicenter, international study (sample size n = 420), and are part of the R package ordPens (Gertheiss, 2011), see Gertheiss et al. (2011) and Cieza et al. (2004) for details. Figure 8 shows boxplots of the observed physical health component scores for the different levels of the 'individual attitudes of immediate family members'. For testing whether there are differences between the levels, we use F anova , F gam and the proposed RLRT. All three tests indicate highly significant differences with p-values being given in Table 1. Figure 9 (left) shows the corresponding fitted mean health scores for each level of ICF category e410, where penalizing methods impose a penalty on deviations from constancy. The right panel of Figure 9 shows the analogous plot if deviations from linearity are penalized. The relationship seems highly nonlinear. And indeed, the corresponding tests indicate highly significant deviations from linearity, see Table 1 (right). It is also seen that the p-values of RLRT for both testing constancy and linearity are much smaller than those of F anova and F gam . Though all tests imply that the null-hypotheses of constancy and linearity can be rejected, the smaller p-values may become relevant if more than one ICF category is considered and p-values have to be adjusted.
Concerning parameter estimation, results of different ordinal penalties seem quite unalike (compare red * in the left and right panel of Figure 9), whereas spline fits (blue ⋄) with different penalties are very similar. However, differences between the two types of ordinal penalties are mainly caused by differences with respect to category 4 and, more importantly, −4. But the number of observations in these extreme categories is very low. In category 4 we have just 10 observations, and only one in −4 (see Figure 8), which makes estimates of respective dummy coefficients very unreliable. This uncertainty at the boundary is also reflected by pointwise confidence intervals as provided by amer or mgcv for splines Table 1 Results of different tests in terms of p-values when testing relevance and linearity of the ordinal covariate 'individual attitudes of immediate family members'. The physical health component score shown in Figure 8 is considered as response Testing Relevance Testing Linearity p-values p-values (not shown). The test proposed here, by contrast, is not for hypotheses with respect to single coefficients, but for global hypotheses of constancy and linearity.

Summary and discussion
We proposed a new test for ordinal predictors in the classical linear model. When testing relevance or linearity of an ordinal covariate this can be interpreted as testing for a zero variance component in a linear mixed model. So the proposed test is based on the restricted likelihood ratio test for zero variance components in linear mixed models and uses the exact test distribution derived by Crainiceanu and Ruppert (2004).
We showed that in many situations the new test outperforms the standard F -test which is usually applied for categorical covariates, in particular -but not only -when the number of levels of the independent variable is high. The proposed test was mostly also superior to (approximative) testing in additive models that may be used if the ordinal predictor is treated as continuous and modeled smoothly using splines. In addition, the latter test may not be a reliable α-level test. So the presented RLRT for ordinal predictors can be seen as a very powerful alternative to existing F -tests when the levels of a categorical covariate are ordered. The advantages of the new test over existing F -tests, which may lead to an increase in power, can be summarized as follows. While the standard F -test for categorical data only uses the predictor's nominal scale level, the proposed test also takes the covariate's ordinal nature into account -thus using more information. Differences between the new test and the approximate test based on splines (and treating the ordinal predictor as continuous) are, on the one hand, not such big. Both tests use basis expansions and a difference penalty on basis coefficients. On the other hand, however, there are two important differences: The proposed test is an exact test, whereas the approximate test is not. Furthermore, with dummy-coding instead of, e.g., B-splines, the new test uses basis functions that are somewhat natural for discrete data, as values between predictor levels cannot be observed anyway. When only relevance of the ordinal predictor is to be tested, predictor levels may also be directly included in a linear model, assuming a linear relationship between the predictor level and the response (if there is some dependence). If the relationship is in fact (almost) linear, the corresponding F -test (or the equivalent t-test) is a very powerful test. If the true dependence is, however, clearly nonlinear (as non-monotone, for example), this F -test is often outperformed by the proposed RLR test.
In some situations it is less relevant of course to test whether there is a linear relationship between class labels and the response; for example, when a movie rating scheme is considered with labels "very bad", "bad", "okay", "good", "very good". In cases like that, however, it makes no sense either to assume such linear relationship when testing for relevance, whereas dummy-coding of the predictor seems to be adequate. Therefore the proposed RLRT is highly attractive when in such situations relevance of the ordinal covariate is to be tested.
Though in our simulations we only considered one ordinal predictor, the test is directly applicable to multiple regression problems if all additional covariates can be treated as fixed effects that are not be tested simultaneously (see Crainiceanu et al., 2005). In the case of more than one ordinal predictor that may be treated as a random effect, approximations of the RLRT distribution proposed by Greven et al. (2008) may be used.
For performing the test, existing software can be used and only a little additional implementation is necessary, as given in the Appendix. Instead of R package amer, mgcv could alternatively be used to fit the penalized regression model. Only the ordinal predictor has to be coded in the right way and the adequate penalty matrices need to be given. So ordinal predictors can be combined with several other types of predictors, including even functional covariates, for example. In addition, extensions to longitudinal or clustered data are possible by including additional random effects.

Acknowledgements
We thank Ciprian Crainiceanu, Sonja Greven, Fabian Scheipl, Gerhard Tutz and two anonymous referees for a number of very helpful comments and suggestions.
Appendix A: Construction of design matrix Z We consider the linear mixed model (10) and the representation of penalized dummy variables in the linear mixed models framework (see Subsection 4.2).

> fm1 <-amer(y~cd(x), data = dyx, basisGenerators = c("cd"))
The proposed test can then be done using exactRLRT() from the package RLRsim, where for models with only one variance component only the model under the alternative need to be specified (see the help files of exactRLRT for details):