spa : Semi-Supervised Semi-Parametric Graph-Based Estimation in R

In this paper, we present an R package that combines feature-based ( X ) data and graph-based ( G ) data for prediction of the response Y . In this particular case, Y is observed for a subset of the observations (labeled) and missing for the remainder (unlabeled). We examine an approach for ﬁtting ˆ Y = X ˆ β + ˆ f ( G ) where ˆ β is a coeﬃcient vector and ˆ f is a function over the vertices of the graph. The procedure is semi-supervised in nature (trained on the labeled and unlabeled sets), requiring iterative algorithms for ﬁtting this estimate. The package provides several key functions for ﬁtting and evaluating an estimator of this type. The package is illustrated on a text analysis data set, where the observations are text documents (papers), the response is the category of paper (either applied or theoretical statistics), the X information is the name of the journal in which the paper resides, and the graph is a co-citation network, with each vertex an observation and each edge the number of times that the two papers cite a common paper. An application involving classiﬁcation of protein location using a protein interaction graph and an application involving classiﬁcation on a manifold with part of the feature data converted to a graph are also presented.


Introduction
In this work, we present a package geared towards providing a semi-supervised framework for processing semi-parametric models. In this setting, the data are partitioned into two sets, one is a feature set X and the other is either an additional feature set Z or observed graph G. Here X is an n × p data matrix, Z is an n × q data matrix, and the observed graph G can be written as G = (V, E), where the nodes in V correspond to the observations and E is the set of similarity edges between observations. We have a labeled response for a subset of the observations, denoted by Y L of length m ≤ n. The data matrix X can be partitioned into genomics and drug discovery. In addition to observed graphs, this work is also applicable as an extension of standard semi-parametric modeling, where an additional feature set Z is observed separately from X. In the classical semi-parametric setting one would commonly fit φ(X L , Z L ) = X Lβ +f (Z L ) where f is a smooth function. For the spa package in this case we provide several R functions to convert Z to a proximity graph, Z → G[Z] and then subsequently fit φ(X, G[Z]) = Xβ + f (G [Z]). The advantage is that the information in X U and Z U influence training of φ, that is, the estimates are semi-supervised.
The spa package is written for the R programming environment (R Development Core Team 2011) and available from the Comprehensive R Archive Network at http://CRAN.R-project. org/package=spa. This package provides the implementation of the sequential predictions algorithm. The package design consists of three general phases: (i) graph construction, (ii) constructing the spa object and (iii) post fitting functionality. The graph construction segment provides a series of functions to process observed graphs or to construct graphs from part of the feature data. Upon completing this stage an n × n adjacency matrix is obtained with diagonal elements representing nodes (observations in the data) and off-diagonal elements representing edge associations between nodes. For the second phase a SPA object is created from the inputs using semi-supervised smoothers (discussed in Section 3). One technical challenge common with smoothers is tuning parameter estimation. The SPA features sophisticated algorithms for data driven estimation of the smoother parameter based on a semi-supervised generalized cross-validation measure.
Step (ii) completes with a SPA object that encapsulates φ.
The post fitting step consists of generic function utilities for manipulating and accessing a SPA object, such as coef and fitted, functions to plot and print the object, and functions for performing prediction/updating with the SPA object. The prediction/updating functionality for the package was carefully designed to distinguish between transductive and inductive prediction. Transductive prediction requires re-training or updating φ to use local proximity information with the intent of improving performance, or to incorporate a new set of observations that became available after training. This is most useful when the unlabeled data provide intrinsic structural information that can improve the general analysis with the classifier. For example, prediction with an observed graph, G, is inherently transductive since unlabeled data influence the topology of the graph (Culp and Michailidis 2008a); or in cases when the Z information provides a manifold in the data that is pertinent for classification of Y L (Chapelle, Schölkopf, and Zien 2006). An inductive learner is designed to predict a new observation using the classifier without retraining. In this case the new observation does not provide any structural information and we are interested in classifying this case to the rule established. For example, constructing a contour plot of a classification rule is based off of predicting a grid determined by the range of the data. The grid has no inherent meaning and therefore transductive prediction is inappropriate.
The SPA classifier is both transductive and inductive, i.e,. transductive prediction is used to obtain the border which best represents the structure in the data, and then inductive prediction is used to classify new observations with respect to the border. The spa distinguishes between transductive and inductive prediction using the update and predict generics, respectively.
The author is aware of two software packages available for semi-supervised graph-based learning. The first is the SGT-light software which performs a SVM like approach on a graph to classify a response (Joachims 2003). The other is the Manifold Regularization software which can be adapted to fit a function to an observed graph (Belkin et al. 2004). Neither approach to date performs internal parameter estimation, semi-parametric estimation, or is integrated in the R language. Indeed, the spa provides an R package that fits a semi-parametric graphbased estimator, that performs internal tuning parameter estimation, and that accommodates both transductive and inductive prediction.
In Section 2, we provide a simple example to gain insight for the graph-based classifiers implemented in this package. In Section 3, we provide analysis using spa in classification of the Cora text mining data set. In Section 4 we provide the methodology underpinning this approach, which motivates several key features of this approach including parameter estimation and transductive prediction. In Section 5, we provide the detailed layout of the spa package, discussing several functions offered for this package in graph construction (Section 5.1), object generation (Section 5.2), and post fitting (Section 5.3). In Section 6 we present two data examples, one involving classification on a manifold (Section 6.1) and the other involving protein interaction (Section 6.2). We close with some concluding remarks in Section 7.

Local classification on a graph
At the core of the proposed package are two intuitive algorithms designed to process a local similarity graph: (i) fitting the fits and (ii) the sequential predictions algorithm. To gain deeper insight into these approaches we first present the algorithms in an intuitive special case involving local similarity on two-dimensional planar graph. In Section 4, we rigorously describe the algorithmic details, assumptions and generalizations of these algorithms as implemented in the package.
In graph-based classification, the observations are vertices on a graph and the edges between observations describes local similarity. For example, in Figure 1 we present a simple lattice, where each observation is defined by the intersection of two grid lines. In the figure there are two labeled vertices, one is black and the other is white, while there are 34 unlabeled vertices. For classification the goal is to estimate the probability of each label belonging to either class, i.e., estimate probability class estimates (PCEs) for each vertex.
For fitting the fits, we initialize all unlabeled cases with PCEs of 0.5 (gray in the figure). Then we classify every observation as the weighted local average using this initialization for unlabeled cases and the true labels for labeled cases. The new probabilities for the unlabeled cases provide estimates for the unlabeled cases and are likely more precise than the initializations. Therefore, we now retrain the approach using the updated probabilities. This process is repeated until the unlabeled PCEs used in classification are the same as the unlabeled PCEs resulting from classification. This process is shown visually for the lattice example in Figure 1. The FTF algorithm is quite similar to semi-supervised harmonic approaches including random walks and label propagation (Chapelle et al. 2006, Chapter 11) but differs in that probabilities class estimates are also obtained for the labeled cases. The use of probability class estimates in the iterative phase of FTF for the unlabeled responses is referred to as soft labels, since they are not a hard binary responses. The latter is referred to as hard labels (Culp and Michailidis 2008a).
In the case of fitting the fits all unlabeled observations are treated equally in iterative training. The sequential predictions algorithm provides a way to correct for the use of unlabeled probabilities in training by penalizing vertices farther away from the labeled cases. The process is to first train observations one hop away from labeled cases, then penalize the PCEs resulting

Motivating text analysis application
In this section, we provide a prototype analysis with the spa package to illustrate the key functionality of this package on the Cora text data set (McCallum, Nigam, Rennie, and Seymore 2000). For these data, the observations are n = 827 text documents published predominantly in engineering, computer science and statistics journals. The attributes were generated electronically on the web and then processed with in-house scripts. The X information is 827 × 9 orthogonal matrix where X ik = I{paper i is in journal k}, for k = 1, · · · , 9 corresponding to the 8 journals listed in Table 1 plus a category for all other journals. The G information is a weighted graph represented by a co-citation network between the text documents. Specifically, the adjacency matrix is constructed with A ij = I{i and j cite the same document}. The response is the binary indication that the specific document is classified to be about To begin analysis, we provide the necessary preprocessing: R> data("coraAI") R> y <-coraAI$class R> x <-coraAI$journals R> g <-coraAI$cite R> keep <-which(as.vector(apply(g, 1, sum) > 1)) R> setdiff(1:length(y), keep) [ Observations 12, 19, · · · , 819 were removed since they were isolated vertices. A stratified sample is used to insure that every journal is represented in labeled and unlabeled partitions.
The spa is used to fit an estimate on the observed graph directly without the journal indication matrix (i.e.,Ŷ =f (G)): R> A1 <-as.matrix(g) R> gc <-spa(y1, graph = A1, control = spa.control(dissimilar = FALSE)) R> gc Call: spa(y1, graph = A1, control = spa.control(dissimilar = FALSE)) Sequential Predictions Algorithm (SPA) with training= 4 % The output provides several descriptive aspects of the model fit. First notice that, the fit used 4% of the data as labeled during training. The fit measures provide standard fit description with root mean squared error, parameter estimation type (refer to Section 4) and transductive degrees of freedom. The transductive parameters were set to default, which means that the entire unlabeled data set was fit as one unit and no regularization was employed (i.e., FTF as a special case of SPA was employed). The unlabeled data measures provide a rough assessment of the potential usefulness of unlabeled data during training. The dissimilar flag must be FALSE for the spa.control to indicate that the co-citation network is a similarity matrix. The natural concern is in the performance of the SPA on U (unlabeled set). To check performance on U , we calculate the misclassification error rate: The unlabeled error for this fit is extremely poor. An error this large typically indicates that the procedure is not using the unlabeled data appropriately during training. To investigate this, we provide a useful diagnostic by considering the A U L partition of A. This partition provides the common links between labeled-unlabeled data, and if an unlabeled observation has no such links (i.e., A U L i × 1 L = 0) then the observation cannot be utilized directly for training. To assess the problematic unlabeled observations simply execute: R> sum(apply(A1[U, L], 1, sum) == 0) / (n -m) * 100 [1] 39.92443 This tells us that nearly 40% of the cases have no direct associations with Y L , which significantly deteriorates performance.
Next, we incorporate transductive prediction to account for unlabeled response estimates used in training by penalizing observations farther away from labeled cases. The idea is to sort the A U L × 1 L information and break the data into regions enumerated from highest (most associated) to lowest (least associated). In other words, the vector A U L × 1 L provides a sorting method that groups objects to be classified at subsequent rounds similar to the shortest path used in the lattice example in Section 2. The observations are then sequentially estimated, where at each round new unlabeled node estimates are treated as true values (refer to Section 4.3 for the general SPA or Section 2 for a simple example). To perform transductive prediction with the spa package use the update generic with the dat argument: 1 R> pred <-update(gc, ynew = y1, gnew = A1, dat = list(k = length(U), The performance improves significantly. To update the spa object so that it corresponds to the above output execute: R> gc <-update(gc, ynew = y1, gnew = A1, + dat = list(k = length(U), l = Inf), trans.update = TRUE) R> gc Sequential Predictions Algorithm (SPA) with training= 4 % We see that, the transductive parameters are updated with the new fit. The training unlabeled measure of 0.105 does not change when using transductive prediction.
Next, we investigate the estimateŶ = Xβ +f (G) where X is the journal indication matrix and G is the co-citation network. To execute this with spa: R> gjc <-spa(y1, x, g, control = spa.control(diss = FALSE)) R> gjc [1] 0.395466 The coefficient estimate is provided in the output and can be extracted using the coef generic. The output also shows that the estimate suffers from the same performance problems with respect to the unlabeled data as the previous fit. As before, this problem can be fixed by using transductive prediction: [1] 0.197733 As we can see the performance of the approach improved significantly after we updated the model to include transductive prediction. The strong performance on the unlabeled data suggests that using both the X information and G information contributed to predicting the class label of a document. In addition, a coefficient vector is obtained that may provide interpretable value for assessing a specific journal's contribution to this classification problem.

Methodology
In Section 3, we provided an analysis illustrating several features utilized by the spa package. The output offers utilities useful for assessing and improving the fit, including transductive generalized cross validation and degrees of freedoms, influence of observations in U onŶ L , an unlabeled contribution i I{A U L i × 1 L = 0} and a transductive prediction algorithm for processing data. Next we present the details of the algorithmic fitting methodology, which will also provide some justification for these measures of fit.
The proposed approach essentially extends the concept of a linear smoother into the semisupervised setting using the fitting the fits algorithm (the sequential predictions algorithm is a generalization for which the package is named and presented in Section 4.3) (Culp and Michailidis 2008b). It is also quite similar to self-training approaches that use the expectationmaximization (EM) algorithm (Chapelle et al. 2006;Culp and Michailidis 2008b). The first step is to initialize the unlabeled vectorŶ (0) U = 0. Then the FTF iteratively fits: where S is a linear smoother for responseŶ and is independent of k. Notice that the labeled response estimate is reset to Y L before fitting, which intuitively allows the procedure to train with the observed response (known response) and use the previous estimated response fit where the responses are unobserved (unknown). The algorithm is repeatedly executed until global convergence defined by Ŷ (k+1) and therefore the geometric matrix series is guaranteed to converge whenever ρ(S U U ) < 1 where ρ is the spectral radius. The closed form convergent solution can be obtained as the limit on the geometric matrix series: The spa package primarily fits stochastic based smoothers including kernel smoothers, graph smoothers and a semi-parametric extension involving kernel smoothing. However, before we study kernel smoothers in the semi-supervised setting, we first look at linear regression to gain important insight into the procedure, since several key aspects of the spa package are motivated from linear regression. Upon presenting FTF with linear regression (Section 4.1), we then focus on kernel smoothing and graph smoothing (Section 4.2), the full sequential predictions algorithm (Section 4.3), and semi-parametric smoothing (Section 4.4) in the semisupervised setting.

Example: Fitting the fits with linear regression
Next we study linear regression in the semi-supervised setting to provide insight and motivate key results, including transductive degrees of freedom and transductive generalized crossvalidation. To implement semi-supervised linear regression, one must apply the hat matrix as the smoother in (1). Note the supervised hat matrix is H LL = X L (X L X L ) −1 X L . Semi-Supervised linear regression with FTF is quite simple to fit in R, using the .GlobalEnv nested in sapply command: R> set.seed(100) R> x <-matrix(rnorm(40), 20) R> z <-cbind(x, 1) R> y <-z %*% rnorm(3) + rnorm(20, , 0.5) R> H <-z %*% solve(t(z) %*% z, t(z)) R> L <-1:5 R> U <-6:20 .GlobalEnv$yh <-H %*% .GlobalEnv$yh + } R> ftfres <-sapply(1:200, ftfls) Upon convergence an estimate for the response is obtained in closed form withŶ The natural question to ask is, "What are the properties of the convergent solution for linear regression with FTF?" The answer is clarified by comparing our result to the ordinary supervised least squares solution: R> dat <-data.frame(y = y, x) R> supls <-predict(lm(y~., data = dat, subset = L), newdata = dat) R> apply((ftfres -supls)^2, 2, sum)[c(1:2, 199:200)] [1] 2.831134e+01 1.668253e+01 ... 8.545281e-15 7.265162e-15 By iteration 200 the difference between the two is negligible and therefore we see that semisupervised linear regression using FTF reduces to the supervised least squares solution. This equivalency is proven for the generalized linear model in Culp and Michailidis (2008b). A similar result can be shown when treating the semi-supervised linear regression problem as a missing data problem (Little and Rubin 2002, p. 29) The equivalency between semi-supervised linear regression using FTF and supervised ordinary least squares results in two new expression for the hat matrix. The equivalent hat matrices are given by: where H U L = X U (X L X L ) −1 X L (supervised unlabeled estimate). Upon convergence each formulation has the same error and the same trace. The degrees of freedom of the residuals of the semi-supervised estimator are defined as since this form yields the smoother for the labeled estimator resulting from FTF, i.e., This equivalency between these projection matrices is useful in computing the generalized cross-validation (GCV) error rate of a smoother, i.e., the GCV for smoother S is defined as: To do this, we define GCV with reference to (2) as labeled GCV ("lGCV"), GCV with reference to (3) as transductive GCV ("tGCV"), and GCV with reference to (4) as approximate transductive GCV ("aGCV"). All of these measures provide an identical value in linear regression but will differ in kernel smoothing (refer to Section 4.2). The "tGCV" criterion would be the best choice but is computationally demanding due to the need for (I − H U U ) −1 . We recommend using the approximate transductive GCV criterion since it is less demanding computationally and is still influenced by observations in U . The choice of GCV is set by spa.control with "aGCV" as the default. Remark: Linear regression is not implemented in the spa package since the lm command computes it equivalently.

Semi-supervised stochastic kernel smoothing
Next we study kernel smoothers, where a kernel function is applied directly to distances in In general λ is a tuning parameter where λ ≈ 0 tends to yield solutions with low bias and high variance while λ → ∞ tends to yield solutions with high bias and low variance. The matrix W emits the partition for i, j ∈ L ∪ U : The FTF algorithm allows for the observations in W U U , W U L , and W U L to influence training of the procedure, by iteratively fitting (i)Ŷ U . Tracing the iteration, we get that the unlabeled estimator isŶ k U and take the limit as k → ∞ to get the closed form solution,Ŷ =Ŷ (∞) : The condition for producing a stable estimator is discussed in Section 4.3.
Unlike in linear regression,Ŷ L is not equivalent to the supervised estimator. However, the supervised estimator is influential on the result as we show next. First for kernel regression we have that S = D −1 W . Define D L , D U as the respective labeled/unlabeled diagonal submatrices of the diagonal row-sum matrix D. In addition, denote D LL = diag(W LL 1 L ), that is the row sum matrix of W LL , and similarly denote D LU = diag(W LU 1 U ), D LU = diag(W LU 1 U ), and D U U = diag(W U U 1 U ). From this, D L = D LL + D LU and D U = D U U + D U L . Next, we denote the supervised kernel smoother as T LL = D −1 LL W LL and the supervised prediction smother as T U L = D −1 U L W U L . The supervised estimator is denote by Y . To connect supervised estimation toŶ L we have that: where Q = D −1 L D LL ≤ I and S LU L = D −1 LU W LU (I − S U U ) −1 S U L . From this decomposition, we notice that if Q ≈ I then the supervised labeled estimate is close to the semi-supervised labeled estimate. On the other hand, shrinking this matrix provides more influence with S LU L . It is easy to show that S LU L is right stochastic and therefore (6) is an average of two right stochastic matrices which is also right stochastic. The valueQ ≡ tr(Q) |L| provides a quick determination of the effect of the unlabeled data on the labeled estimate in training and is provided in the output of the spa object (0.105 for the Cora data). Values near one indicated that (6) is close to Y L = T LL Y L (supervised) while smaller values indicate that the estimate ofŶ L is influenced byY L .
For the kernel smoothing problem, we are also interested in approximating the matrix S LU L without requiring the computation of the inverse (for λ estimation). Using the linear regression result, we employ where V is the appropriate matrix to make the approximation right stochastic. This adjustment is extremely useful for fitting GCV as it is used by type = "aGCV" and is default for spa.control.
This result in the case of kernel regression is not equivalent to GCV using the transductive smoother or GCV using the labeled smoother (refer to Section 4.1).
Graph-based smoothing is a natural generalization of kernel smoothing. The adjacency matrix of the graph is given by A and the smoother is obtained as S = D −1 A. The FTF is then applicable in the same manner as with the kernel smoothing approach discussed previously (i.e., A and W are quite similar). One key difference between graph-based smoothing and kernel smoothing is that the graph problem is inherently semi-supervised (often referred to as transductive in this special case), and therefore a supervised alternative does not exists. Care is taken with the design of the spa to allow for transducitve prediction when necessary.

The sequential predictions algorithm
The FTF algorithm was previously employed with smoothers. The full sequential predictions algorithm is an extension of FTF to perform transductive prediction with the intent of improving performance (refer to Cora example). Transductive prediction addresses the issue of updating a classification or regression technique to incorporate either new observations (labeled or unlabeled) unavailable originally during training, reassess the current unlabeled observations used previously during training, or both.
To motivate the need for the SPA algorithm, we first consider the circumstances for when (I −S U U ) −1 is expected to be unstable. To provide insight recall that the smoother S = D −1 W is right stochastic, i.e., In such a circumstance, a stable estimator is achieved whenever S U U 1 U < 1 U (i.e., the sum of each row in S U U is less than one). From the above equation we observe that the requirement is satisfied whenever S U L 1 L > 0 L (i.e., every row in S U L sums to a positive number). The instability occurs whenever S U L 1 L ≈ 0 L or sufficiently W U L × 1 L ≈ 0 L . This condition can easily be violated in practice. For example, a compact kernel could force weak connections between labeled and unlabeled data to vanish. It is also common to employ a non-compact kernel and then a k-nn function to that kernel (known as a k-nn graph), which results in the analogous effect. In the case of an observed graph, it is not difficult to obtain a vertex that is only connected to other unlabeled vertices (refer to the Cora text data). This problem becomes worse as | L | decreases for each of these examples, since there are less labeled cases to connect with the unlabeled cases.
To correct for this we extend the FTF algorithm into the sequential predictions algorithm (Culp and Michailidis 2008a). The main idea is to sort W U L × 1 L and partition the unlabeled data U = {U ( ) } k =1 into k disjoint sets such that the set with the most labeled connectivity is employed first with FTF. In other words, apply FTF with labeled cases L (0) = L and unlabeled cases in U (1) . Then treat the unlabeled estimates as true and at iteration train FTF with labeled cases L ( ) = L ( −1) ∪ U ( ) and unlabeled set U ( +1) . This process repeats with a total of k calls to FTF. In Culp and Michailidis (2008a), we provide a penalty function based on constrained regularization to correct for the use of unlabeled predictions at subsequent rounds. The main idea is to increasingly push the estimates towards a global point estimates (mean of Y L ) as the algorithm progresses. This entire approach is implemented in the update generic, requiring parameter dat = list(k = 0, l = Inf) by default with subparameters k (number of regions) and l (inverse regularization penalty, i.e., l = 0 is infinite regularization while l = Inf is no regularization). The approach is shown to work quite well for the Cora text data and was also shown to be competitive to several state-of-the-art graph-based semisupervised approaches. An extension to hard labels based on exponential loss is also discussed in Culp and Michailidis (2008a) and is implemented in spa (use type = "hard" for spa call and reg = "hlasso" for update).

An optimization framework for the semi-parametric model
The fitting the fits algorithm provides a simple and effective way to fit data with an unlabeled partition and or graph based terms. The connection to optimization is now established for fitting semi-parametric models. In order to do this, we first establish the link between the FTF algorithm and optimization. Then we generalize this approach to the more general semi-parametric problem.
For this, denote φ(y, f ) = (y−f ) 2 as the squared error loss function, and consider the following optimization problem: where P is a positive semi-definite penalty matrix. In this construct the loss is incurred from the labeled estimator f L to the observed response Y L , while the full function is accounted for with the penalty term. The penalty matrix P is taken as the combinatorial Laplacian P = ∆ ≡ D − W for observed graph and constructed graph terms.
Next we demonstrate that the FTF algorithm with kernel regression does indeed solve the following optimization problem: From this, the derivative solves −W (Y (f U ) −f ) + ∆f = 0. Rearranging we get that: where S = (W + ∆) −1 W = D −1 W . Solving the fixed pointf U = S U UfU + S U L Y L and plugging the solution back in yields the identical solution to (5) for kernel regression.
Using this approach, we then motivate the semi-parametric model η = Xβ + f (G) as the solution to: As we proceeded in kernel regression, we have that ( and is independent of η U . Taking the gradient with respect to f we get that Taking the gradient with respect to β and substituting for f we get that: From this, we get thatη = Xβ +f = Xβ + S(Y (η U ) − Xβ) = H(S)Y (η U ) for some smoother H(S). From this, the FTF for the semi-parametric estimator reduces to: initializeη

The spa package
The spa package provides semi-supervised functions in R to fit and evaluate an estimate of the formŶ =f (G) orŶ = Xβ +f (G), whereβ is the coefficient vector,Ŷ is the estimated response, andf is a function defined over the vertices of G. The process is broken into three key phases: (i) processing the graph, (ii) object generation and (iii) post fitting (refer to Figure 2). The first phase provides a series of functions for processing an observed graph (refer to the Cora example above) or a graph constructed from the Z data. The second phase is generating the spa object with the tuning parameter estimation functionality. The third phase provides post fitting functions featuring transductive prediction with the update generic, inductive prediction with the predict generic, and a series of other standard generic functions.    Figure 3: Scatterplot for two variables X 1 and X 2 with the class label as light gray, dark gray, or missing (small gray circle). (left) A supervised kernel smoother is gray, and the SPA fit is black. (right) Transductive prediction with the spa, the original fit for the two moon data set is provided as black, while the retrained spa was gray and dashed.

Two Moon Simulated Data Inductive Prediction with a Transductive
To illustrate the key functions necessary for fitting a model with the spa package we consider a synthetic two moon data set (refer to Figure 3). The unlabeled data are generated based on the intuitive cluster assumption: If the classes are linearly separable then observations on similar clusters should share the same label. In this example, the SPA breaks the unlabeled data into two moon-like clusters, a separation that cannot be found by the supervised kernel smoother.
To load the data in R execute: R> library("spa") R> set.seed(100) R> dat <-spa.sim(type = "moon") The three phases for spa presented in Figure 2 are each discussed in detail for these data.

Processing graph
The graph processing stage highlighted in Figure 2 provides the means necessary for preparing the Z information or an observed graph for processing with the spa function. The functions for manipulating these data consist of uDist, knnGraph, epsGraph, and floyd.
In the case of an observed graph, this step accounts for several diverse types of similarity graphs. For example, if the graph were sparse (e.g., a planar graph) then one may wish to employ the shortest path distance metric on the graph using the uDist function. The purpose of performing this manipulation with a sparse graph is to relate distant nodes with a distance proximity measure through neighboring links. As a consequence the unlabeled observations that were not originally directly linked to labeled cases may now have associations through neighboring unlabeled observations that were connected with labeled cases. On the other hand, when the graph is dense, like with the Cora example, this type of manipulation is not necessary and in our experience can hinder performance. If the uDist function is employed with an observed graph, then the dissimilar flag in the spa.control must remain TRUE otherwise it must be set to FALSE. In the case of converting Z information to a graph, the first step is to specify an appropriate distance metric on Z. For the simulated data, we would employ Euclidean distance: For Z matrices with large q, restricting to local connectivity with a k-nn or graph can significantly improve performance of this technique (Tenenbaum, Silva, and Langford 2000). However, the k-nn object is not a proper distance metric, and we often wish to convert into a proper metric with Floyd's algorithm (the triangle inequality does not hold for a k-nn graph). To do this with spa execute: R> DNN <-knnGraph(Dij, dist = TRUE, k = 5) R> DFL <-floyd(DNN) The flag dist = TRUE indicates that the object is distance weighted.
Upon obtaining the appropriate graph representation for the spa function, we focus next on generating a spa object. Notice the other spa input parameters, Y and X, require minor processing. For Y , set Y U = N A and make sure X is either a data.frame or matrix.

Generating a spa object
The spa object encapsulates the model fit by the procedure. This step accounts for parameter estimation, problematic unlabeled data, and object generation. The parameter estimation procedure is designed to estimate parameters efficiently using one of four generalized cross-validation measures with either a bisection algorithm or grid search. The problematic unlabeled data cases (W U L i × 1 L = 0) are carefully identified and handled with the global argument. The final object is then generated, reporting several useful measures that describe the current fit. The functions employed here include spa.control, spa and internal GCV/fitting functions.
For parameter estimation, one must first consider the spa.control function. For this, the specified GCV is optimized over a set of potential λ candidates, or a value is provided using the ltval option. To determine the set of potential candidates we provide a fast bisection algorithm and a slow, more thorough, grid search. For the bisection algorithm the key parameters are lqmax = 0.2, lqmin = 0.05, ldepth = 10, and ltmin = 0.05. The lqmax and lqmin set a range to search based on the [lqmin, lqmax] ·100 quantile of the density taken on finite distances. In our experience, this provides a fairly good spread of potential λ candidates. In some cases, the quantile determined by lqmin is negative due to a large density at zero, and one can set the parameter ltmin to correct for this. The bisection algorithm recursively divides the interval in half and finds the subinterval that contains the minimum GCV. This process is repeated ldepth times. The approach is fast and usually works well. However the algorithm can get stuck and arrive at a λ estimate that is unsuitable. To correct for this, a uniform search over a grid of size lgrid between the lqmin to lqmax quantiles can be performed to estimate λ.
The object generation required specification of the unlabeled data. This can be done in two ways. First a labeled set of indices can be provided using the L argument, and the second approach is to set Y U = N A. The algorithm processes a supervised kernel smoother if no unlabeled data are specified. To generate the spa object with the simulated moon example, we execute the following command in R: R> L <-which(!is.na(dat$y)) R> U <-which(is.na(dat$y)) R> gsup <-spa(dat$y[L], graph = Dij[L, L]) R> gsemi <-spa(dat$y, graph = Dij) First we consider the supervised kernel smoother fit on only the labeled data: For the supervised kernel smoother, the parameter λ was estimated as 0.04. Notice also that if no unlabeled data are specified then the parameter estimation defaults to using labeled GCV. In addition, the unlabeled data measures have no meaning here. The supervised kernel smoother result is shown as the gray curve in both panels of Figure 3. Notice that this estimate completely misses the moon like structure in the left figure.
Next, we consider the semi-supervised spa fit on the moon data: The approximate GCV is used to estimate λ as 0.05. The first unlabeled data measure is 1 |L| tr(D −1 LL D L ) = 0.848. The semi-supervised result corresponds to the black curve in left panel of Figure 2. Here the use of the unlabeled data produces a classification border that separates the moon-like data clusters.
To view the GCV path for parameter estimation consider the parm. We see that the approach considered several values of λ before settling on λ = 0.05.

Post fitting
The spa package provides several enhancements for processing a spa object post fitting.
Most are standard generic methods associated with models in general. The main focus is on transductive and inductive prediction. The generics available at this stage are: update, predict, plot, print, dim, coef, fitted, and residuals.
One could classify these observations according to the above probability class estimates.
Transductive prediction (updating) is more complex. In this case the observations provide inherent meaning and the current classifier should be modified to incorporate them. The object is changed post fitting, and therefore this type of prediction is really an update. The purpose of this update could be to perform the general SPA on the observations used for training with different settings or to perform the general SPA with new observations (labeled or unlabeled). The key parameters other than the (y, x, G) data for retraining include dat and trans.update. The dat parameter is a list with two components: k = 0 and l =Inf (default). The parameter k controls the number of regions to break U into, while l controls the regularization for penalizing estimates based on its regions (regions farther from the original L get higher regularization). To update with a new unlabeled data set generated from the same underlying data mechanism, execute: R> dat <-rbind(dat, spa.sim(100, 0)) R> gsemi <-update(gsemi, ynew = dat$y, , as.matrix(daisy(dat[, -1])), + trans.update = TRUE) The R contour plot for the new updated border is given in Figure 3 (right).

Concluding remarks
In this paper, an R package spa that implements a semi-supervised framework for fitting semi-parametric models was presented. Its key features are parameter estimation in semisupervised learning, the distinction between transductive and inductive prediction, and a variety of functions for processing a graph.

A. Large graph considerations
The spa package is designed to process a graph of approximately 5000 nodes or less. In the case of graphs constructed from X data, and several observed graphs this is a reasonable restriction. The software can handle larger graphs in batch mode but is unlikely to work on a significantly larger graph. In this appendix, we provide considerations for processing a graph too large for the current software. In this capacity, we assume that the graph is large, unweighted, irreducible and that the labeled size (response set) is small.
For an example of a large graph, suppose that the lattice in Figure 1 grew to be 1000×1000. To process a graph of this size, one could simplify the SPA algorithm and perform a generalization of the result in Figure 1. The technical details are somewhat involved but the idea operates similar to the example in the figure. To start, process the graph into vertex subsets where V 0 are the labeled cases, V 1 are the set of vertices one hop away from the labeled cases, . . . and V k are the vertices k hops away. For SPA at iteration r, we have that V r − V r−1 are the vertices to be labeled, V r−1 are the vertices labeled from previous rounds, and V 0 contains the observed labeled cases. Denote y = Y V r−1 as the vector with true labeled cases for observations in V 0 and the final estimates from previous rounds for the other cases. From this, one can apply FTF at iteration r which reduces to:Ŷ i = 0 for all i ∈ V r − V r−1 then iteratively fit:Ŷ where D i normalizes each case, and W ij reduces to the indicator that observation j is in the neighborhood of i since the graph is unweighted. Once this converges then we penalizeŶ towards the prior with an increasing penalty. For example, let π 1 be the prior for class 1, then we setŶ The estimates for class 0 are obtained in a similar manner. The estimates are treated as truth for the next round. The last step is to adjust the labeled estimates asŶ L = QT LL Y L + (I − Q)S LUŶi (refer to Section 4.2). The most computationally demanding step is in the construction of the vertex subsets and special considerations should be considered depending on the graph (Zhan and Noon 1998).
In the above algorithm, (8) uses the Nadaraya Watson line estimator form of kernel regression to generate PCEs as opposed to the spa package which uses the matrix S = D −1 W . In the convergent solution both are equivalent. For a small graph computing the matrix approach is fast and requires the entire graph in memory. However, for a large graph computation using (8) is more practical since we only need the local observations to the vertex being estimated (i.e., W ij is zero for most vertices in the expression).
For (9) we are penalizing the estimates towards the class prior to account for observations farther away from the labeled cases. The penalizing function f γ (r) = r 1+r γ is a cumulative density function. For the package SPA we estimate the cumulative density function from the data using both the parameter k and in the dat argument as described in Culp and Michailidis (2008a). Estimating this function directly for a large graph is computationally demanding, but using f γ (r) in (9) is simpler and likely sufficient. For γ, values too small will