The VGAM Package for Categorical Data Analysis

Classical categorical regression models such as the multinomial logit and proportional odds models are shown to be readily handled by the vector generalized linear and additive model (VGLM/VGAM) framework. Additionally, there are natural extensions, such as reduced-rank VGLMs for dimension reduction, and allowing covariates that have values speciﬁc to each linear/additive predictor, e.g., for consumer choice modeling. This article describes some of the framework behind the VGAM R package, its usage and implementation details.


Introduction
The subject of categorical data analysis (CDA) is concerned with analyses where the response is categorical regardless of whether the explanatory variables are continuous or categorical. It is a very frequent form of data. Over the years several CDA regression models for polytomous responses have become popular, e.g., those in Table 1. Not surprisingly, the models are interrelated: their foundation is the multinomial distribution and consequently they share similar and overlapping properties which modellers should know and exploit. Unfortunately, software has been slow to reflect their commonality and this makes analyses unnecessarily difficult for the practitioner on several fronts, e.g., using different functions/procedures to fit different models which does not aid the understanding of their connections.
This historical misfortune can be seen by considering R functions for CDA. From the Comprehensive R Archive Network (CRAN, http://CRAN.R-project.org/) there is polr() (in MASS; Venables and Ripley 2002) for a proportional odds model and multinom() (in nnet; Venables and Ripley 2002) for the multinomial logit model. However, both of these can be Quantity Notation VGAM family function P (Y = j + 1)/P (Y = j) ζ j acat() P (Y = j)/P (Y = j + 1) ζ R j acat(reverse = TRUE) Covariates x have been omitted for clarity. The LHS quantities are η j or η j−1 for j = 1, . . . , M (not reversed) and j = 2, . . . , M + 1 (if reversed), respectively. All models are estimated by minimizing the deviance. All except for multinomial() are suited to ordinal Y .
considered 'one-off' modeling functions rather than providing a unified offering for CDA. The function lrm() (in rms; Harrell, Jr. 2009) has greater functionality: it can fit the proportional odds model (and the forward continuation ratio model upon preprocessing). Neither polr() or lrm() appear able to fit the nonproportional odds model. There are non-CRAN packages too, such as the modeling function nordr() (in gnlm; Lindsey 2007), which can fit the proportional odds, continuation ratio and adjacent categories models; however it calls nlm() and the user must supply starting values. In general these R (R Development Core Team 2009) modeling functions are not modular and often require preprocessing and sometimes are not self-starting. The implementations can be perceived as a smattering and piecemeal in nature.
Consequently if the practitioner wishes to fit the models of Table 1 then there is a need to master several modeling functions from several packages each having different syntaxes etc. This is a hindrance to efficient CDA.
SAS (SAS Institute Inc. 2003) does not fare much better than R. Indeed, it could be considered as having an excess of options which bewilders the non-expert user; there is little coherent overriding structure. Its proc logistic handles the multinomial logit and proportional odds models, as well as exact logistic regression (see Stokes et al. 2000, which is for Version 8 of SAS). The fact that the proportional odds model may be fitted by proc logistic, proc genmod and proc probit arguably leads to possible confusion rather than the making of connections, e.g., genmod is primarily for GLMs and the proportional odds model is not a GLM in the classical Nelder and Wedderburn (1972) sense. Also, proc phreg fits the multinomial logit model, and proc catmod with its WLS implementation adds to further potential confusion.
This article attempts to show how these deficiencies can be addressed by considering the vector generalized linear and additive model (VGLM/VGAM) framework, as implemented by the author's VGAM package for R. The main purpose of this paper is to demonstrate how the framework is very well suited to many 'classical' regression models for categorical responses, and to describe the implementation and usage of VGAM for such. To this end an outline of this article is as follows. Section 2 summarizes the basic VGLM/VGAM framework. Section 3 centers on functions for CDA in VGAM. Given an adequate framework, some natural extensions of Section 2 are described in Section 4. Users of VGAM can benefit from Section 5 which shows how the software reflects their common theory. Some examples are given in Section 6. Section 7 contains selected topics in statistial computing that are more relevant to programmers interested in the underlying code. Section 8 discusses several utilities and extensions needed for advanced CDA modeling, and the article concludes with a discussion. This document was run using VGAM 0.7-10 (Yee 2010a) under R 2.10.0.

VGLM/VGAM overview
This section summarizes the VGLM/VGAM framework with a particular emphasis toward categorical models since the classes encapsulates many multivariate response models in, e.g., survival analysis, extreme value analysis, quantile and expectile regression, time series, bioassay data, nonlinear least-squares models, and scores of standard and nonstandard univariate and continuous distributions. The framework is partially summarized by Table 2. More general details about VGLMs and VGAMs can be found in Yee and Hastie (2003) and Yee and Wild (1996) respectively. An informal and practical article connecting the general framework with the software is Yee (2008).

VGLMs
Suppose the observed response y is a q-dimensional vector. VGLMs are defined as a model for which the conditional distribution of Y given explanatory x is of the form for some known function h(·), where B = (β 1 β 2 · · · β M ) is a p × M matrix of unknown regression coefficients, and the jth linear predictor is Here x = (x 1 , . . . , x p ) with x 1 = 1 if there is an intercept. Note that (2) means that all the parameters may be potentially modelled as functions of x. It can be seen that VGLMs are like GLMs but allow for multiple linear predictors, and they encompass models outside the small confines of the exponential family. In (1) the quantity φ is an optional scaling parameter which is included for backward compatibility with common adjustments to overdispersion, e.g., with respect to GLMs.
In general there is no relationship between q and M : it depends specifically on the model or distribution to be fitted. However, for the 'classical' categorical regression models of Table 1 we have M = q − 1 since q is the number of levels the multi-category response Y has.
The η j of VGLMs may be applied directly to parameters of a distribution rather than just to a mean for GLMs. A simple example is a univariate distribution with a location parameter ξ η Model Modeling Reference function Yee and Hastie (2003) B Yee and Wild (1996) B 1 x 1 + A ν RR-VGLM rrvglm() Yee and Hastie (2003) See Yee and Hastie (2003) Goodman's RC grc() Goodman (1981) and a scale parameter σ > 0, where we may take η 1 = ξ and η 2 = log σ. In general, η j = g j (θ j ) for some parameter link function g j and parameter θ j . For example, the adjacent categories models in Table 1 are ratios of two probabilities, therefore a log link of ζ R j or ζ j is the default. In VGAM, there are currently over a dozen links to choose from, of which any can be assigned to any parameter, ensuring maximum flexibility. Table 5 lists some of them.
VGLMs are estimated using iteratively reweighted least squares (IRLS) which is particularly suitable for categorical models (Green 1984). All models in this article have a log-likelihood where the w i are known positive prior weights. Let x i denote the explanatory vector for the ith observation, for i = 1, . . . , n. Then one can write In IRLS, an adjusted dependent vector , giving rise to the Fisher scoring algorithm. Let X = (x 1 , . . . , x n ) be the usual n × p (LM) model matrix obtained from the formula argument of vglm(). Given z i , W i and X at the current IRLS iteration, a weighted multivariate regression is performed. To do this, a vector linear model (VLM) model matrix X VLM is formed from X and H k (see Section 2.2). This is has nM rows, and if there are no constraints then M p columns. Then z 1 , . . . , z n is regressed upon X VLM with variance-covariance matrix diag(W −1 1 , . . . , W −1 n ). This system of linear equations is converted to one large WLS fit by premultiplication of the output of a Cholesky decomposition of the W i .
Fisher scoring usually has good numerical stability because the W i are positive-definite over a larger region of parameter space than Newton-Raphson. For the categorical models in this article the expected information matrices are simpler than the observed information matrices, and are easily derived, therefore all the families in Table 1 implement Fisher scoring.

VGAMs and constraint matrices
VGAMs provide additive-model extensions to VGLMs, that is, (2) is generalized to a sum of smooth functions of the individual covariates, just as with ordinary GAMs (Hastie and Tibshirani 1990) . . , f (M )k (x k )) are centered for uniqueness, and are estimated simultaneously using vector smoothers. VGAMs are thus a visual data-driven method that is well suited to exploring data, and they retain the simplicity of interpretation that GAMs possess.
An important concept, especially for CDA, is the idea of 'constraints-on-the functions'. In practice we often wish to constrain the effect of a covariate to be the same for some of the η j and to have no effect for others. We shall see below that this constraints idea is important for several categorical models because of a popular parallelism assumption. As a specific example, for VGAMs we may wish to take so that f (1)2 ≡ f (2)2 and f (2)3 ≡ 0. For VGAMs, we can represent these models using where H 1 , H 2 , . . . , H p are known full-column rank constraint matrices, f * k is a vector containing a possibly reduced set of component functions and β * (1) is a vector of unknown intercepts. With no constraints at all, H 1 = H 2 = · · · = H p = I M and β * (1) = β (1) . Like the f k , the f * k are centered for uniqueness. For VGLMs, the f k are linear so that for some vectors β * (1) , . . . , β * (p) . The X VLM matrix is constructed from X and the H k using Kronecker product operations. For example, with trivial constraints, X VLM = X ⊗ I M . More generally, (e k is a vector of zeros except for a one in the kth position) so that X VLM is (nM ) × p * where p * = p k=1 ncol(H k ) is the total number of columns of all the constraint matrices. Note that X VLM and X can be obtained by model.matrix(vglmObject, type = "vlm") and model.matrix(vglmObject, type = "lm") respectively. Equation 7 focusses on the rows of B whereas 4 is on the columns.
VGAMs are estimated by applying a modified vector backfitting algorithm (cf. Buja et al. 1989) to the z i .

Vector splines and penalized likelihood
If (6) is estimated using a vector spline (a natural extension of the cubic smoothing spline to vector responses) then it can be shown that the resulting solution maximizes a penalized likelihood; some details are sketched in Yee and Stephenson (2007). In fact, knot selection for vector spline follows the same idea as O-splines (see Wand and Ormerod 2008) in order to lower the computational cost.

VGAM family functions
This section summarizes and comments on the VGAM family functions of Table 1 for a categorical response variable taking values Y = 1, 2, . . . , M + 1. In its most basic invokation, the usage entails a trivial change compared to glm(): use vglm() instead and assign the family argument a VGAM family function. The use of a VGAM family function to fit a specific model is far simpler than having a different modeling function for each model. Options specific to that model appear as arguments of that VGAM family function.
While writing cratio() it was found that various authors defined the quantity "continuation ratio" differently, therefore it became necessary to define a "stopping ratio". Table 1 defines these quantities for VGAM.
The multinomial logit model is usually described by choosing the first or last level of the factor to be baseline. VGAM chooses the last level (Table 1) by default, however that can be changed to any other level by use of the refLevel argument.
If the proportional odds assumption is inadequate then one strategy is to try use a different link function (see Section 5.2 for a selection). Another alternative is to add extra terms such as interaction terms into the linear predictor (available in the S language; Chambers and Hastie 1993). Another is to fit the so-called partial proportional odds model (Peterson and Harrell 1990) which VGAM can fit via constraint matrices.
In the terminology of Agresti (2002), cumulative() fits the class of cumulative link models, e.g., cumulative(link = probit) is a cumulative probit model. For cumulative() it was difficult to decide whether parallel = TRUE or parallel = FALSE should be the default. In fact, the latter is (for now?). Users need to set cumulative(parallel = TRUE) explicitly to fit a proportional odds model-hopefully this will alert them to the fact that they are making the proportional odds assumption and check its validity (Peterson (1990); e.g., through a deviance or likelihood ratio test). However the default means numerical problems can occur with far greater likelihood. Thus there is tension between the two options. As a compromise there is now a VGAM family function called propodds(reverse = TRUE) which is equivalent to cumulative(parallel = TRUE, reverse = reverse, link = "logit").
By the way, note that arguments such as parallel can handle a slightly more complex syntax. A call such as parallel = TRUE~x2 + x5 -1 means the parallelism assumption is only applied to X 2 and X 5 . This might be equivalent to something like parallel = FALSE~x3 + x4, i.e., to the remaining explanatory variables.

Other models
Given the VGLM/VGAM framework of Section 2 it is found that natural extensions are readily proposed in several directions. This section describes some such extensions.

Reduced-rank VGLMs
Consider a multinomial logit model where p and M are both large. A (not-too-convincing) example might be the data frame vowel.test in the package ElemStatLearn (see Hastie et al. 1994). The vowel recognition data set involves q = 11 symbols produced from 8 speakers with 6 replications of each. The training data comprises 10 input features (not including the intercept) based on digitized utterances. A multinomial logit model fitted to these data would have B comprising of p × (q − 1) = 110 regression coefficients for n = 8 × 6 × 11 = 528 observations. The ratio of n to the number of parameters is small, and it would be good to introduce some parsimony into the model.
A simple and elegant solution is to represent B by its reduced-rank approximation. To do this, partition x into (x 1 , x 2 ) and B = (B 1 B 2 ) so that the reduced-rank regression is applied to x 2 . In general, B is a dense matrix of full rank, i.e., rank = min(M, p), and since there are M × p regression coefficients to estimate this is 'too' large for some models and/or data sets. If we approximate B 2 by a reduced-rank regression and if the rank R is kept low then this can cut down the number of regression coefficients dramatically. If R = 2 then the results may be biplotted (biplot() in VGAM). Here, C and A are p 2 × R and M × R respectively, and usually they are 'thin'.
More generally, the class of reduced-rank VGLMs (RR-VGLMs) is simply a VGLM where B 2 is expressed as a product of two thin estimated matrices (Table 2). Indeed, Yee and Hastie (2003) show that RR-VGLMs are VGLMs with constraint matrices that are unknown and estimated. Computationally, this is done using an alternating method: in (9) estimate A given the current estimate of C, and then estimate C given the current estimate of A. This alternating algorithm is repeated until convergence within each IRLS iteration.
Incidentally, special cases of RR-VGLMs have appeared in the literature. For example, a RR-multinomial logit model, is known as the stereotype model (Anderson 1984). Another is Goodman (1981)'s RC model (see Section 4.2) which is reduced-rank multivariate Poisson model. Note that the parallelism assumption of the proportional odds model (McCullagh and Nelder 1989) can be thought of as a type of reduced-rank regression where the constraint matrices are thin (1 M , actually) and known.
The modeling function rrvglm() should work with any VGAM family function compatible with vglm(). Of course, its applicability should be restricted to models where a reduced-rank regression of B 2 makes sense.

Goodman's R × C association model
Let Y = [(y ij )] be a n × M matrix of counts. Section 4.2 of Yee and Hastie (2003) shows that Goodman's RC(R) association model (Goodman 1981) fits within the VGLM framework by setting up the appropriate indicator variables, structural zeros and constraint matrices. Goodman's model fits a reduced-rank type model to Y by firstly assuming that Y ij has a Poisson distribution, and that where µ ij = E(Y ij ) is the mean of the i-j cell, and the rank R satisfies R < min(n, M ).
The modeling function grc() should work on any two-way table Y of counts generated by (10) provided the number of 0's is not too large. Its usage is quite simple, e.g., grc(Ymatrix, Rank = 2) fits a rank-2 model to a matrix of counts. By default a Rank = 1 model is fitted.

Bradley-Terry models
Consider an experiment consists of n ij judges who compare pairs of items T i , i = 1, . . . , M +1. They express their preferences between T i and T j . Let N = i<j n ij be the total number of pairwise comparisons, and assume independence for ratings of the same pair by different judges and for ratings of different pairs by the same judge. Let π i be the worth of item T i , where "T i > T j " means i is preferred over j. Suppose that π i > 0. Let Y ij be the number of times that T i is preferred over T j in the n ij comparisons of the pairs. Then Y ij ∼ Bin(n ij , p i/ij ). This is a Bradley-Terry model (without ties), and the VGAM family function is brat().
Maximum likelihood estimation of the parameters π 1 , . . . , π M +1 involves maximizing By default, π M +1 ≡ 1 is used for identifiability, however, this can be changed very easily. Note that one can define linear predictors η ij of the form VGAM family function Independent parameters As well as having many applications in the field of preferences, the Bradley-Terry model has many uses in modeling 'contests' between teams i and j, where only one of the teams can win in each contest (ties are not allowed under the classical model). The packaging function Brat() can be used to convert a square matrix into one that has more columns, to serve as input to vglm(). For example, for journal citation data where a citation of article B by article A is a win for article B and a loss for article A. On a specific data set, R> journal <-c("Biometrika", "Comm.Statist", "JASA", "JRSS-B") R> squaremat <-matrix(c (NA,33,320,284,730,NA,813,276,+ 498,68,NA,325,221,17,142,NA), 4, 4) R> dimnames(squaremat) <-list(winner = journal, loser = journal) then Brat(squaremat) returns a 1 × 12 matrix.

Bradley-Terry model with ties
The VGAM family function bratt() implements a Bradley-Terry model with ties (no preference), e.g., where both T i and T j are equally good or bad. Here we assume with π 0 > 0 as an extra parameter. It has η = (log π 1 , . . . , log π M −1 , log π 0 ) by default, where there are M competitors and π M ≡ 1. Like brat(), one can choose a different reference group and reference value.
Other R packages for the Bradley-Terry model include BradleyTerry (with and without ties; Firth 2005Firth , 2008 and prefmod (Hatzinger 2009).

Genetic models
There are quite a number of population genetic models based on the multinomial distribution, e.g., Weir (1996), Lange (2002).  Table 4: Probability table for the ABO blood group system. Note that p and q are the parameters and r = 1 − p − q.
For example the ABO blood group system has two independent parameters p and q, say.
Here, the blood groups A, B and O form six possible combinations (genotypes) consisting of AA, AO, BB, BO, AB, OO (see Table 4). A and B are dominant over bloodtype O. Let p, q and r be the probabilities for A, B and O respectively (so that p + q + r = 1) for a given population. The log-likelihood function is where r = 1 − p − q, p ∈ ( 0, 1 ), q ∈ ( 0, 1 ), p + q < 1. We let η = (g(p), g(r)) where g is the link function. Any g from Table 5 appropriate for a parameter θ ∈ (0, 1) will do.
A toy example where p = p A and q = p B is The function Coef(), which applies only to intercept-only models, applies to g j (θ j ) = η j the inverse link function g −1 j to η j to give θ j .

Three main distributions
Agresti (2002) discusses three main distributions for categorical variables: binomial, multinomial, and Poisson (Thompson 2009). All these are well-represented in the VGAM package, accompanied by variant forms. For example, there is a VGAM family function named mbinomial() which implements a matched-binomial (suitable for matched case-control studies), Poisson ordination (useful in ecology for multi-species-environmental data), negative binomial families, positive and zero-altered and zero-inflated variants, and the bivariate odds ratio model (binom2.or(); see Section 6.5.6 of McCullagh and Nelder 1989). The latter has an exchangeable argument to allow for an exchangeable error structure: since, for data (Y 1 , Y 2 , x), logit P Y j = 1 x = η j for j = 1, 2, and log ψ = η 3 where ψ is the odds ratio, and so η 1 = η 2 . Here, binom2.or(zero = 3) by default meaning ψ is modelled as an intercept-only (in general, zero may be assigned an integer vector such that the value j means η j = β (j)1 , i.e., the jth linear/additive predictor is an intercept-only). See the online help for all of these models.

Some user-oriented topics
Making the most of VGAM requires an understanding of the general VGLM/VGAM framework described Section 2. In this section we connect elements of that framework with the software. Before doing so it is noted that a fitted VGAM categorical model has access to the usual generic functions, e.g., coef() for β * T The methods function for the extractor function coef() has an argument matrix which, when set TRUE, returns B (see Equation 1) as a p × M matrix, and this is particularly useful for confirming that a fit has made a parallelism assumption.

Common arguments
The structure of the unified framework given in Section 2 appears clearly through the pool of common arguments shared by the VGAM family functions in Table 1. In particular, reverse and parallel are prominent with CDA. These are merely convenient shortcuts for the argument constraints, which accepts a named list of constraint matrices H k . For example, setting cumulative(parallel = TRUE) would constrain the coefficients β (j)k in (2) to be equal for all j = 1, . . . , M , each separately for k = 2, . . . , p. That is, H k = 1 M . The argument reverse determines the 'direction' of the parameter or quantity.
Another argument not so much used with CDA is zero; this accepts a vector specifying which η j is to be modelled as an intercept-only; assigning a NULL means none.

Link functions
Almost all VGAM family functions (one notable exception is multinomial()) allow, in theory, for any link function to be assigned to each η j . This provides maximum capability. If so then there is an extra argument to pass in any known parameter associated with the link function. For example, link = "logoff", earg = list(offset = 1) signifies a log link with a unit offset: η j = log(θ j + 1) for some parameter θ j (> −1). The name earg stands for "extra argument". Table 5 lists some links relevant to categorical data. While the default gives a reasonable first choice, users are encouraged to try different links. For example, fitting a binary regression model (binomialff()) to the coal miners data set coalminers with respect to the response wheeze gives a nonsignificant regression coefficient for β (1)3 with probit analysis but not with a logit link when η = β (1)1 + β (1)2 age + β (1)3 age 2 . Developers and serious users are encouraged to write and use new link functions compatible with VGAM.

Examples
This section illustrates CDA modeling on three data sets in order to give a flavour of what is available in the package.

2008 World Fly Fishing Championships
The World Fly Fishing Championships (WFFC) is a prestigious catch-and-release competition held annually. In 2008 it was held in New Zealand during the month of March. The data was released and appears in VGAM as the data frames wffc, wffc.nc, wffc.indiv and wffc.teams. Details about the competition are found in the online help, as well as Yee (2010b).
Briefly, we will model the abundance of fish caught during each three-hour session amongst the 90 or so competitors (from about 19 countries) who fished all their sessions. There were five sectors (locations) labelled I-V for the Whanganui River, Lake Otamangakau, Lake Rotoaira, Waihou River and Waimakariri River, respectively. The sessions were sequentially labelled 1-6 where odd and even numbers denote morning and afternoon respectively. There were three consecutive days of fishing during which each sector experienced a rest session.
Yee (2010b) fitted Poisson and negative binomial regressions to the numbers caught at each competitor-session combination. The negative binomial regression had an intercept-only for its index parameter k and Var(Y ) = µ(1 + µ/k). Both models had the log-linear relationship where µ = E(Y ) is the mean number caught, β (1)1 is the intercept, α s are the sector effects for s = 1, . . . , 5 sectors, δ c are the "competitor effects" for c = 1, . . . , 91 competitors (8 competitors who did not fish all 5 sessions were excluded), β a are the morning (a = 1) and afternoon (a = 2) effects, γ d are the day effects for day d = 1, 2, 3. Recall for factors that the first level is baseline, e.g., α 1 = β 1 = 0 etc. Not used here is b = 1, . . . , 19 for which beat/boat was fished/used (e.g., fixed locations on the river). We will fit a proportional odds model with essentially the RHS of (14) as the linear predictor.
Here is a peek at the data frame used. Each row of wffc.nc is the number of captures by each sector-session-beat combination. We first process the data a little: create the regressor variables and restrict the analysis to anglers who fished all their sessions. Here, "nc" stands for numbers caught, and "f" stands for factor.
R> fit.pom <-vglm(ordnum~fsector + mornaft + fday + finame, family = + cumulative(parallel = TRUE, reverse = TRUE), data = fnc) Here, we set reverse = TRUE so that the coefficients have the same direction as a logistic regression. It means that if a regression coefficient is positive then an increasing value of an explanatory variable is associated with an increasing value of the response. One could have used family = propodds instead.
Before interpreting some output let's check that the input was alright. The checking indicates no problems with the input. Now let's look at some output. Note that the Whanganui River, Mornings and Day 1 are the baseline levels of the factors. Also, the variable mornaft is 0 for morning and 1 for afternoons. Likewise, the factor fday has values 1, 2 and 3.
R> head(coef(fit.pom, matrix = TRUE), 10) Not surprisingly, these results agree with the Poisson and negative binomial regressions (reported in Yee 2010b). The most glaring qualitative results are as follows. We use the rough rule of thumb that if the absolute value of the t statistic is greater than 2 then it is 'statistically significant'.
The two lakes were clearly less productive than the rivers. However, neither of the other two rivers were significantly different from the Whanganui River.
There is a noticeable day effect: the second day is not significantly different from the opening day but it is for the third day. The decreasing values of the fitted coefficients show there is an increasing catch-reduction (fish depletion if it were catch-and-keep) as the competition progressed. Replacing fday by a variable day and entering that linearly gave a t statistic of −4.0: there is a significant decrease in catch over time.
Mornings were more productive than afternoons. The p value for this would be close to 5%. This result is in line with the day effect: fishing often results in a 'hammering' effect over time on fish populations, especially in small streams. Since the morning and afternoon sessions were fixed at 9.00am-12.00pm and 2.30-5.30pm daily, there was only 2 1 2 hours for the fish to recover until the next angler arrived. shows some of the constraint matrices, H 1 = I 2 and H 2 = H 5 = H 6 = 1 2 (see Equations 6-7).

Marital status data
We fit a nonparametric multinomial logit model to data collected from a self-administered questionnaire administered in a large New Zealand workforce observational study conducted during 1992-3. The data were augmented by a second study consisting of retirees. For homogeneity, this analysis is restricted to a subset of 6053 European males with no missing values. The ages ranged between 16 and 88 years. The data can be considered a reasonable representation of the white male New Zealand population in the early 1990s, and are detailed in MacMahon et al. (1995) and Yee and Wild (1996). We are interested in exploring how Y = marital status varies as a function of x 2 = age. The nominal response Y has four levels; in sorted order, they are divorced or separated, married or partnered, single and widower. We will write these levels as Y = 1, 2, 3, 4, respectively, and will choose the married/partnered (second level) as the reference group because the other levels emanate directly from it.
It may be seen that the ±2 standard error bands about the Widowed group is particularly wide at young ages because of a paucity of data, and likewise at old ages amongst the Singles. The f (s)2 (x 2 ) appear as one would expect. The log relative risk of being single relative to being married/partnered drops sharply from ages 16 to 40. The fitted function for the Widowed group increases with age and looks reasonably linear. The f  R> plot(fit.ms, deriv = 1, lcol = mycol, scale = 0.3) results in Figure 2. Once again the y-axis scales are commensurate.

Stereotype model
We reproduce some of the analyses of Anderson (1984) regarding the progress of 101 patients with back pain using the data frame backPain from gnm Firth 2007, 2009). The three prognostic variables are length of previous attack (x 1 = 1, 2), pain change (x 2 = 1, 2, 3) and lordosis (x 3 = 1, 2). Like him, we treat these as numerical and standardize and negate them. The output R> head(backPain, 4) x1 x2 x3 pain 1 1 1 1 same 2 1 1 1 marked. ) and Table 2) which agrees with his Table 6. Here, what is known as "corner constraints" is used ((1, 1) element of A ≡ 1), and only the intercepts are not subject to any reduced-rank regression by default. The maximized log-likelihood from logLik(bp.rrmlm1) is −151.55. The standard errors of each parameter can be obtained by summary(bp.rrmlm1). The negative elements of C imply the latent variable ν decreases in value with increasing sx1, sx2 and sx3. The elements of A tend to decrease so it suggests patients get worse as ν increases, i.e., get better as sx1, sx2 and sx3 increase.
A rank-2 model fitted with a different normalization R> bp.rrmlm2 <-rrvglm(pain~sx1 + sx2 + sx3, multinomial, + backPain, Rank = 2, Corner = FALSE, Uncor = TRUE) produces uncorrelated ν i = C x 2i . In fact var(lv(bp.rrmlm2)) equals I 2 so that the latent variables are also scaled to have unit variance. The fit was biplotted (rows of C plotted as arrow; rows of A plotted as labels) using R> biplot(bp.rrmlm2, Acol = "blue", Ccol = "darkgreen", scores = TRUE, + xlim = c(-4.5, 2.2), ylim = c(-2.2, 2.2), chull = TRUE, + clty = 2, ccol = "blue") to give Figure 5. It is interpreted via inner products due to (9). The different normalization means that the interpretation of ν 1 and ν 2 has changed, e.g., increasing sx1, sx2 and sx3 results in increasing ν 1 and patients improve more. Many of the latent variable points ν i are coincidental due to discrete nature of the x i . The rows of A are centered on the blue labels (rather cluttered unfortunately) and do not seem to vary much as a function of ν 2 . In fact this is confirmed by Anderson (1984) who showed a rank-1 model is to be preferred.
This example demonstrates the ability to obtain a low dimensional view of higher dimensional data. The package's website has additional documentation including more detailed Goodman's RC and stereotype examples.

Some implementation details
This section describes some implementation details of VGAM which will be more of interest to the developer than to the casual user.

Common code
It is good programming practice to write reusable code where possible. All the VGAM family functions in Table 1 process the response in the same way because the same segment of code is executed. This offers a degree of uniformity in terms of how input is handled, and also for software maintenance (Altman and Jackman (2010) enumerates good programming techniques and references). As well, the default initial values are computed in the same manner based on sample proportions of each level of Y .

Matrix-band format of wz
The working weight matrices W i may become large for categorical regression models. In general, we have to evaluate the W i for i = 1, . . . , n, and naively, this could be held in an array of dimension c(M, M, n). However, since the W i are symmetric positive-definite it suffices to only store the upper or lower half of the matrix.
The variable wz in vglm.fit() stores the working weight matrices W i in a special format called the matrix-band format. This format comprises a n × M * matrix where is the number of columns. Here, hbw refers to the half-bandwidth of the matrix, which is an integer between 1 and M inclusive. A diagonal matrix has unit half-bandwidth, a tridiagonal matrix has half-bandwidth 2, etc.
Suppose M = 4. Then wz will have up to M * = 10 columns enumerating the unique elements of W i as follows: That is, the order is firstly the diagonal, then the band above that, followed by the second band above the diagonal etc. Why is such a format adopted? For this example, if W i is diagonal then only the first 4 columns of wz are needed. If W i is tridiagonal then only the first 7 columns of wz are needed. If W i is banded then wz needs not have all 1 2 M (M + 1) columns; only M * columns suffice, and the rest of the elements of W i are implicitly zero. As well as reducing the size of wz itself in most cases, the matrix-band format often makes the computation of wz very simple and efficient. Furthermore, a Cholesky decomposition of a banded matrix will be banded. A final reason is that sometimes we want to input W i into VGAM: if wz is M × M × n then vglm(..., weights = wz) will result in an error whereas it will work if wz is an n × M * matrix.
To facilitate the use of the matrix-band format, a few auxiliary functions have been written. In particular, there is iam() which gives the indices for an array-to-matrix. In the 4 × 4 example above, (the actual code is slightly more complicated). In general, VGAM family functions can be remarkably compact, e.g., acat(), cratio() and multinomial() are all less than 120 lines of code each.

Extensions and utilities
This section describes some useful utilities/extensions of the above.

Marginal effects
Models such as the multinomial logit and cumulative link models model the posterior probability p j = P (Y = j|x) directly. In some applications, knowing the derivative of p j with respect to some of the x k is useful; in fact, often just knowing the sign is important. The function margeff() computes the derivatives and returns them as a p × (M + 1) × n array. For the multinomial logit model it is easy to show while for cumulative(reverse = FALSE) we have p j = γ j − γ j−1 = h(η j ) − h(η j−1 ) where h = g −1 is the inverse of the link function (cf. The function margeff() returns an array with these derivatives and should handle any value of reverse and parallel.

The xij argument
There are many models, including those for categorical data, where the value of an explanatory variable x k differs depending on which linear/additive predictor η j . Here is a well-known example from consumer choice modeling. Suppose an econometrician is interested in peoples' choice of transport for travelling to work and that there are four choices: Y = 1 for "bus", Y = 2 "train", Y = 3 "car" and Y = 4 means "walking". Assume that people only choose one means to go to work. Suppose there are three covariates: X 2 = cost, X 3 = journey time, and X 4 = distance. Of the covariates only X 4 (and the intercept X 1 ) is the same for all transport choices; the cost and journey time differ according to the means chosen. Suppose a random sample of n people is collected from some population, and that each person has access to all these transport modes. For such data, a natural regression model would be a multinomial logit model with M = 3: for j = 1, . . . , M , we have where, for the ith person, x i2j is the cost for the jth transport means, and x i3j is the journey time of the jth transport means. The distance to get to work is x i4 ; it has the same value regardless of the transport means.
Equation 19 implies H 1 = I 3 and H 2 = H 3 = H 4 = 1 3 . Note also that if the last response category is used as the baseline or reference group (the default of multinomial()) then x ik,M +1 can be subtracted from x ikj for j = 1, . . . , M -this is the natural way x ik,M +1 enters into the model.
Recall from (2) that we had Importantly, this can be generalized to or writing this another way (as a mixture or hybrid), Often β * * j = β * * , say. In (22) the variables in x * i are common to all η j , and the variables in x * ij have different values for differing η j . This allows for covariate values that are specific to each η j , a facility which is very important in many applications.
The use of the xij argument with the VGAM family function multinomial() has very important applications in economics. In that field the term "multinomial logit model" includes a variety of models such as the "generalized logit model" where (20) holds, the "conditional logit model" where (21) holds, and the "mixed logit model," which is a combination of the two, where (22) holds. The generalized logit model focusses on the individual as the unit of analysis, and uses individual characteristics as explanatory variables, e.g., age of the person in the transport example. The conditional logit model assumes different values for each alternative and the impact of a unit of x k is assumed to be constant across alternatives, e.g., journey time in the choice of transport mode. Unfortunately, there is confusion in the literature for the terminology of the models. Some authors call multinomial() with (20) the "generalized logit model". Others call the mixed logit model the "multinomial logit model" and view the generalized logit and conditional logit models as special cases. In VGAM terminology there is no need to give different names to all these slightly differing special cases. They are all still called multinomial logit models, although it may be added that there are some covariate-specific linear/additive predictors. The important thing is that the framework accommodates x ij , so one tries to avoid making life unnecessarily complicated. And xij can apply in theory to any VGLM and not just to the multinomial logit model. Imai et al. (2008) present another perspective on the x ij problem with illustrations from Zelig (Imai et al. 2009).
Using the xij argument VGAM handles variables whose values depend on η j , (22), using the xij argument. It is assigned an S formula or a list of S formulas. Each formula, which must have M different terms, forms a matrix that premultiplies a constraint matrix. In detail, (20) can be written in vector form as where β * k = β * (1)k , . . . , β * (r k )k is to be estimated. This may be written To handle (21)-(22) we can generalize (24) to Each component of the list xij is a formula having M terms (ignoring the intercept) which specifies the successive diagonal elements of the matrix X * (ik) . Thus each row of the constraint matrix may be multiplied by a different vector of values. The constraint matrices themselves are not affected by the xij argument.
How can one fit such models in VGAM? Let us fit (19). Suppose the journey cost and time variables have had the cost and time of walking subtracted from them. Then, using ".trn" to denote train,  should do the job. Here, the argument form2 is assigned a second S formula which is used in some special circumstances or by certain types of VGAM family functions. The model has H 1 = I 3 and H 2 = H 3 = H 4 = 1 3 because the lack of parallelism only applies to the intercept. However, unless Cost is the same as Cost.bus and Time is the same as Time.bus, this model should not be plotted with plotvgam(); see the author's homepage for further documentation.

A more complicated example
The above example is straightforward because the variables were entered linearly. However, things become more tricky if data-dependent functions are used in any xij terms, e.g., bs(), ns() or poly(). In particular, regression splines such as bs() and ns() can be used to estimate a general smooth function f (x ij ), which is very useful for exploratory data analysis.
Suppose we wish to fit the variable Cost with a smoother. This is possible with regression splines and using a trick. Firstly note that will not work because the basis functions for ns(Cost.bus), ns(Cost.trn) and ns(Cost.car) are not identical since the knots differ. Consequently, they represent different functions despite having common regression coefficients.
Fortunately, it is possible to force the ns() terms to have identical basis functions by using a trick: combine the vectors temporarily. To do this, one can let This computes a natural cubic B-spline evaluated at x but it uses the other arguments as well to form an overall vector from which to obtain the (common) knots.

Discussion
This article has sought to convey how VGLMs/VGAMs are well suited for fitting regression models for categorical data. Its primary strength is its simple and unified framework, and when reflected in software, makes practical CDA more understandable and efficient. Furthermore, there are natural extensions such as a reduced-rank variant and covariate-specific η j . The VGAM package potentially offers a wide selection of models and utilities.
There is much future work to do. Some useful additions to the package include: 1. Bias-reduction (Firth 1993) is a method for removing the O(n −1 ) bias from a maximum likelihood estimate. For a substantial class of models including GLMs it can be formulated in terms of a minor adjustment of the score vector within an IRLS algorithm (Kosmidis and Firth 2009). One by-product, for logistic regression, is that while the maximum likelihood estimate (MLE) can be infinite, the adjustment leads to estimates that are always finite. At present the R package brglm (Kosmidis 2008) implements biasreduction for a number of models. Bias-reduction might be implemented by adding an argument bred = FALSE, say, to some existing VGAM family functions.
2. Nested logit models were developed to overcome a fundamental shortcoming related to the multinomial logit model, viz. the independence of irrelevant alternatives (IIA) assumption. Roughly, the multinomial logit model assumes the ratio of the choice probabilities of two alternatives is not dependent on the presence or absence of other alternatives in the model. This presents problems that are often illustrated by the famed red bus-blue bus problem.
3. The generalized estimating equations (GEE) methodology is largely amenable to IRLS and this should be added to the package in the future .
4. For logistic regression SAS's proc logistic gives a warning if the data is completely separate or quasi-completely separate. Its effects are that some regression coefficients tend to ±∞. With such data, all (to my knowledge) R implementations give warnings that are vague, if any at all, and this is rather unacceptable (Allison 2004). The safeBi-naryRegression package (Konis 2009) overloads glm() so that a check for the existence of the MLE is made before fitting a binary response GLM.
In closing, the VGAM package is continually being developed, therefore some future changes in the implementation details and usage may occur. These may include non-backwardcompatible changes (see the NEWS file.) Further documentation and updates are available at the author's homepage whose URL is given in the DESCRIPTION file.