The Rasch Sampler

The Rasch sampler is an efficient algorithm to sample binary matrices with given marginal sums. It is a Markov chain Monte Carlo (MCMC) algorithm. The program can handle matrices of up to 1024 rows and 64 columns. A special option allows to sample square matrices with given marginals and fixed main diagonal, a problem prominent in social network analysis. In all cases the stationary distribution is uniform. The user has control on the serial dependency.


Introduction
Parameter estimates in the Rasch model only depend on the marginal totals of the data matrix that is used for the estimation. From this it follows that, if the model is valid, all binary matrices with the same marginals as the observed one are equally likely. For any statistic of the data matrix, one can approximate the null distribution, i.e., the distribution if the Rasch model is valid, by taking a random sample from the collection of equally likely data matrices and constructing the observed distribution of the statistic (Besag and Clifford 1989). One can then simply determine the exceedance probability of the statistic in the observed sample (its p-value), and thus construct a non-parametric test of the Rasch model. The accuracy of the observed (i.e., simulated) distribution will depend on the sample size, and this in turn will only depend on the time one can or wants to spend to draw the random sample.
2. The choice of the next state is based on importance sampling, not on random sampling from the neighborhood.
3. The stationary distribution is uniform by an application of the Metropolis-Hastings algorithm to the basic transition matrix.
These three issues will be discussed in three separate sections. In a separate section the algorithm will be extended to the case of square matrices with fixed diagonals. Proofs of propositions will be omitted since these are published elsewhere (Verhelst 2007).

Tetrad and binomial neighborhoods
Let Σ rc be the sample space, i.e., the set of all binary matrices with row totals given by the n-dimensional vector r, and column totals given by the k-dimensional vector c. To exclude the possibility that Σ rc = ∅, it will be assumed that there exists a matrix with these marginals. Usually this will be an observed matrix, that will be denoted as A 0 .

The case of n × 2 matrices
The concepts of tetrad and binomial neighborhoods will be introduced by considering the special case of matrices with two columns. Since in such matrices each row consists of one out of four possible patterns, such a matrix, A say, can be summarized like in Table 1. Table 1: Summary of a n × 2 matrix.
A tetrad is defined as a 2×2 submatrix of A (by selecting 2 rows) that has one of the following forms: 1 0 0 1 or 0 1 1 0 (1) A tetrad transformation consists of changing either form of (1) into the other for a single tetrad. By such a transformation the row and column totals of the matrix do not change.
The tetrad neighborhood of matrix A is the set of all matrices that can be obtained by a tetrad transformation on A. This set will be denoted by A T (A). Notice that A itself does not belong to its tetrad neighborhood. From Table 1, it is clear that By a finite sequence of tetrad transformations all matrices belonging to Σ rc can be reached.
In the matrix represented by Table 1, there are m = a + b rows with a row total of one. A binomial operation consists of assigning a one to the first position and a zero to the second position for a of these m rows. The b other rows have the complementary pattern. By such an operation on A, the marginals do not change. A binomial transform of A is a matrix that results from a binomial operation on A and that is not equal to A. The binomial neighborhood of A is defined as the set of binomial transforms of A, and denoted by A B (A). From Table 1 it will be clear that and that i.e., a tetrad transform is a binomial transform.
2.2. The general case of n × k matrices.
For each of the k 2 column pairs of a matrix A ∈ Σ rc , a submatrix consisting of these two columns can be formed, and a table like Table 1 can be constructed. Applying a binomial transformation on this submatrix, and putting the transformed columns back in the original matrix will be called a binomial transformation of the original matrix. The notation is adapted as follows. Column pairs are denoted by the ordered pair (i, j), (i < j ≤ k). The number of rows in the submatrix having a row total of one is denoted as m ij (A), the number of these rows having a one in column i is denoted as a ij (A) and b ij (A) = m ij (A) − a ij (A). The explicit reference to the matrix A will be dropped if it is clear from the context which matrix is referred to. The set of matrices that can be constructed by applying a binomial transformation on the column pair (i, j) will be called the B ij neighborhood of A and will be denoted as A (i,j) B (A).
A similar notation can be applied for tetrad neighborhoods: the T ij neighborhood of A is the set of matrices that can be formed by a tetrad transformation on the column pair (i, j) of A, and is denoted as A (i,j) T (A). Next, B-and T -neighborhoods are defined.
An easy to prove but important result is given in Verhelst (2007).

Proposition 1 For any two column pairs
A similar result holds for tetrad neighborhoods: It may be the case that for a column pair (i, j) it holds that a ij = 0 or b ij = 0, in which case there are no tetrads for this pair.
Definition 3. The column pair (i, j) is a Guttman pair if a ij × b ij = 0. If a ij × b ij > 0, the pair will be called regular.
is a regular pair.
In the testing phase of the program, two artificial data matrices with 300 rows and 30 columns were created. In Table 2, the sizes of their tetrad and binomial neighborhoods are displayed, together with an estimate of the sizes of the two sample spaces. For details, see Verhelst (2007); for the estimate of the size of the sample space, see also Chen, Diaconis, Holmes, and Liu (2005).

The basic transition matrix of the Markov chain
The transition matrix of a Markov chain, P = (p st ), contains the transition probabilities, where p st is the probability that the next state is s given that the current state is state t. In the classical approach, p st is defined as representing simple random sampling from the tetrad neighborhood. It is well know that this Markov chain is irreducible, and therefore has a stationary distribution (Rao et al. 1996). Using binomial neighborhoods, a similar sampling scheme can be followed, by sampling randomly from the binomial neighborhood of the current matrix. It turns out, however, that this sampling scheme is not very efficient. The variability of #A B (A t ) over column pairs can be considerable, with the consequence that a binomial transformation is applied to the same column pair for very long sequences of transitions. To avoid this, a transition matrix Q = (q st ) for the Markov chain is defined as where and Notice that d t in (8) is not well defined only in case k 2 (A t ) = 0, but then, from Proposition 2, #Σ rc = 1, and the problem is solved. This trivial case will be left out of consideration. Notice also that (9) is well defined and unambiguous because of Proposition 1: if A s ∈ A B (A t ), then there exists exactly one pair of columns, (i, j) say, such that A s ∈ A (i,j) The stationary distribution of the Markov chain defined by (7) is given in the Proposition 3. The proof is given in Verhelst (2007).
The Markov chain can be implemented, using the following algorithm.
Algorithm 1. Importance sampling from the binomial neighborhood of A 1. Select randomly a pair of columns from the k 2 (A) regular column pairs of A.
2. Apply a random binomial operation to the selected pair. If the resulting matrix equals A, repeat step 2; otherwise the resulting matrix is the new state.
Notice that all matrices from A B (A) have a positive probability of being sampled, but the probabilities are not equal, because the probability that column pair

The Metropolis-Hastings algorithm
Extensive experimentation with Algorithm 1 showed that the variability in the k 2 -measures is often very small but not zero, so that the stationary distribution of Q is not uniform. The Metropolis-Hastings algorithm is an elegant procedure to make the stationary distribution uniform. In general, the algorithm transforms a defined transition matrix, Q say, into a transition matrix Q * with a prespecified vector π as stationary vector. The algorithm is defined as (see, e.g., Tanner (1994) for a concise description) where The diagonal elements are given by If we want the uniform stationary distribution, then π s /π t = 1. Taking Q as defined by (7), because w st = w ts (Verhelst 2007), we obtain that whence we can easily deduce our final result: Algorithm 2. Importance sampling and Metropolis-Hastings 1. Select randomly a pair of columns from the k 2 (A) regular column pairs of A.
2. Apply a random binomial operation to the selected pair, yielding A * .
, then the new state remains A with probability 1−k 2 (A)/k 2 (A * ), otherwise the new state is A * .

Square matrices with fixed diagonals
In social network theory, often a square incidence matrix, representing a binary asymmetric relation has to be analyzed. Having a tool to sample from the set of all binary matrices with given marginals and with the main diagonal values kept at their (arbitrary) value is valuable for social network research (Wasserman 1977). In this section we discuss an adaptation of Algorithm 2 to fit these cases.
The adaptation is relatively simple, thanks to a nice theorem of Rao et al. (1996). They start from a simple observation. In a 3 × 3 subtable of the matrix, formed by taking out three rows and the corresponding columns, the following two patterns, called alternating hexagons, can occur:  where * indicates a fixed value. Notice that in neither of the two tables there is a tetrad, and yet, the marginals of the table do not change when an alternating hexagon is replaced by its complement, i.e., by a hexagon transformation. But this implies that in general not all matrices can be constructed by a finite sequence of tetrad transformations.
In their Theorem 2, Rao et al. (1996) prove that in case of a square binary matrix with fixed marginals and a fixed main diagonal, all matrices can be constructed from any other matrix by a finite sequence of tetrad transformations and hexagon operations.
Now it is easy to extend Algorithm 2 to apply also to this case. We do this by a number of easy to check features of the neighborhood: 1. The definition of the binomial neighborhood has to be adapted a little bit in a straightforward manner, in the sense that m ij is the number of rows in column pair (i, j) that are of the form (0, 1) or (1, 0). Patterns containing a fixed value are not counted but are treated in the same way as the patterns (0, 0) or (1, 1).
2. If we apply a single hexagon operation to matrix A, exactly three columns of the matrix are changed, so that the resulting matrix cannot belong to the tetrad or binomial neighborhood of A.
3. So in a natural way we can define the hexagon neighborhood of a matrix A as the set of matrices that can be constructed with a single hexagon transformation. This neighborhood will denoted as A H (A).
4. Since a hexagon operation changes three columns of the matrix, and a tetrad or a binomial transformation change two columns, it is clear that so that the combined tetrad-hexagon or binomial-hexagon neighborhoods are partitioned much in the same way as the tetrad or binomial neighborhoods.
5. Next we define for all A ∈ Σ rc , the k 3 -measure as Without giving all the details, we can describe an algorithm which is very closely related to Algorithm 2.
Algorithm 3. Importance sampling from the combined binomial-hexagon neighborhood of A 1. If k 2 (A) = k 3 (A), apply Algorithm 2. 6. An implementation in R: The RaschSampler package 6.1. General structure The workhorse of the RaschSampler is a FORTRAN 95 subroutine which implements the MCMC algorithm described in this paper as an R (R Development Core Team 2007) package. It is called from the R (wrapper) function rsampler. Apart from several parameters controlling the algorithm (to be explained in Section 6.2) it expects a binary matrix (usually the reponses from n subjects to k items, or a k × k adjacency matrix) as input and returns the generated matrices in encoded form. The output is stored in a list together with all control parameters to allow further operations (calculating statistics, saving the results of the sampler, replicating the sampling process, etc.). To make use of these sampled matrices the user has to define an appropriate R function that operates on each of the generated matrices, e.g., calculating a statistic (for convenience the RaschSampler package contains an example user function phi which calculates the range of all inter-column correlations, denoted as R ϕ henceforth). The second main function of the package is rstats, a (wrapper) function which decodes the sampled matrices and passes them to the user defined function in turn. The output of rstats is a list with the same number of elements as the number of matrices passed to it. Each element contains the output from the user defined function, e.g., the statistic(s) for a (sampled) matrix.
All other functions in the package are utility functions, for defining the control parameters, printing the status of the output lists, and extracting some of the generated matrices.

User control
The use of MCMC methodology has two serious drawbacks. The first one is that the stationary distribution is an asymptotic distribution, and one can never be sure to sample from this distribution after a finite number of steps through the sample space. The second drawback has to do with serial dependency: even when the stationary distribution has been reached, a sequence of draws from the sample space will show autocorrelations, meaning that the sampled matrices are not independent of each other, even if their distribution is uniform.
To face these two problems, two practical tools are provided in the program, by means of the two tuning parameters burn_in and step. These are explained in turn.
If Q is a transition matrix of a Markov chain with a uniform stationary distribution, then Q , = 1, 2, . . .also has the same stationary vector. In practice, this means that steps governed by Q will be carried out before the resulting matrix is considered as a sampled matrix. As increases, the serial dependency will decrease. In the program the value of is at the discretion of the user by the tuning parameter step.
To allow the process to approximate its stationary distribution, it is customary to use a number of idle steps through the sample space, i.e. sampling matrices that are not used to compute the statistic(s) of interest. This number is under the control of the user by the tuning parameter burn_in. Notice, however, that the number of matrices generated before the real sampling starts equals the product of the two parameters burn_in and step.
After the burn in period, a number of matrices are sampled, using Q . This number is the sample size and is governed by the tuning parameter n_eff.
If the program is run, the total number of matrices generated equals step * (burn_in + n_eff) However, on output the resulting list contains only n_eff + 1 (encoded) matrices, i.e., n_eff sampled plus the original input matrix (in position 1).
The other user controls concern the seed of the random number generator (the default value of zero generates a random seed based on the system time, the used seed value is returned from the sampler) and the parameter tfixed that enables the sampling of square matrices with a fixed diagonal (e.g., adjacency matrices).

Some examples
For the first example we use a (ficitious) data matrix xmpl with n = 300 rows and k = 30 columns provided in the RaschSampler package.

R> data("xmpl")
The default control parameters can be obtained by calling the control function without any arguments.
To calculate statistics for these matrices we use rstats with the sampling result and the (predefined) user function phi as arguments and store the results in stat1: R> stat1 <-rstats(result1, phi) The output is R> unlist(stat1) As mentioned in Section 6.1 the output from rstats returns the values (statistics) from the user defined function specified as the second argument in the call of rstats(). For this example, we can see that the range of all item intercorrelations for the (fictitious) data matrix, i.e. R ϕ = 0.3517041, is of similar magnitude as the corresponding values for the five sampled matrices (stored in positions 2 -6).
A utility function rsextrobj has been defined to allow for extracting certain parts from the sampling result. For example, if we want to perform the statistics function only to the first two generated matrices we might use R> resultparts <-rsextrobj(result1, start = 2, end = 3) R> summary(resultparts) Status of extracted object resultparts : n = 300 k = 30 burn_in = 100 n_eff = 2 step = 16 seed = 123 tfixed = FALSE n_tot = 2 outvec contains 600 elements R> stat1parts <-rstats(resultparts, phi) R> unlist(stat1parts) 1 2 0.3661533 0.3874665 The summary shows that resultparts is an extracted object and that the number of contained matrices is n_tot = 2. Since n_eff, the number of sampled matrices is also 2 we know that the original matrix is no longer contained in the list resultparts.
Original sampling results as well as extracted parts can be saved and reloaded for later usage, e.g., by R> save(stat1parts, file = "some.RSobj") R> again <-load("some.RSobj") The second example examines the quantile for the statistic R ϕ (as implemented in the example user function phi) for a larger data matrix. The data matrix xmplbig is provided in in the RaschSampler package and has n = 1024 rows and k = 64 comlumns.

R> data("xmplbig")
The sampling process is specified to generate n_eff = 1000 matrices and the starting value for the random number generator is fixed to seed = 987654321. The other control parameters are set to their default values.