A statnet Tutorial

The statnet suite of R packages contains a wide range of functionality for the statistical analysis of social networks, including the implementation of exponential-family random graph (ERG) models. In this paper we illustrate some of the functionality of statnet through a tutorial analysis of a friendship network of 1,461 adolescents


Introduction
statnet comprises a suite of R (R Development Core Team 2007) packages for the analysis of social networks and other relational data. Previous papers in this issue describe in detail the functionality for most of these packages, and the underlying statistical theory behind it. In this paper we illustrate some of the functionality of three core packages within the statnet suite-network, sna, and ergm-through a tutorial-based analysis of a friendship network with 1,461 vertices.
Readers of a more methodical nature may wish to examine the rest of the papers in this issue first. For those who do wish to tackle the tutorial early on, and who do not have much familiarity with statistical network models, we suggest that you begin by reading at least the introduction to this issue (Handcock, Hunter, Butts, Goodreau, and Morris 2008). In the tutorial, we assume a basic familiarity with R (including how to obtain further help or documentation for any of the commands we use) and with a variety of concepts and terminology from social network analysis. For newcomers, some of the latter may be understandable from context with the aid of a social network reference text, although those sections pertain-under GPL-3 with attribution requirements for the source code and documentation. To obtain license information for all packages, simply type license.statnet(), or go to http: //statnetproject.org/attribution/. Please cite the statnet packages when you use them for research that is published or otherwise publicly distributed. Citation information is provided on our Web site at http:// statnetproject.org/citation.shtml, and can be obtained by typing citation("statnet"), or for any of the constituent packages by typing citation("<name.of.package>").

Network exploration
In this section, we explore methods for querying our data. We do not cover methods for importing, converting or manipulating network data; for information on these functionalities, please see Butts (2008a).
To begin, make sure that the necessary packages are loaded into your current session (with library("statnet") or with both library("ergm") and library("sna")) and then load the dataset we will be analyzing, found in the ergm package: R> data("faux.magnolia.high") A shorter name will save us much typing effort: R> fmh <-faux.magnolia.high fmh is an object of class network; to see what information is contained in a network object, type:
As this output is extensive (and includes the entire network edge list), it is often more useful to look at an abridged version of this information, which may be obtained by typing summary(fmh). Among other things, we see that the network has 1,461 vertices and 974 edges.
Before we begin modeling these data statistically, it is good to have a general handle on their nature; perhaps the easiest way to do so for network data is by visualizing them. Since this is a large, sparse network, it is probably best to leave the isolates out: Note that this command might take upwards of a minute or two, depending on the speed of your computer. There appears to be one large component and a smattering of many very small components (Figure 1), not to mention all of the components of size 1 that we removed from the visualization. To get a count of the component size distribution: R> There are four different vertex attributes associated with the vertices in this network: "vertex.names", "Race", "Sex", and "Grade." Grade is likely to be a strong determinant of social relations, so let us visualize the network with vertices colored by their grade: R> plot(fmh, displayisolates = FALSE, vertex.col = "Grade", vertex.cex = 0.7) Clearly there is a strong tendency for friendships to form within grade (Figure 2), although at the same time the grades themselves do not appear to be cohesive units; rather each grade seems to consist of a number of subgroups which link together. One might choose to explore other visualization options by typing help(plot.network).
Network modelers are frequently interested in the degree distribution: R> fmh.degreedist <-table(degree(fmh, cmode = "indegree")) R> fmh. degreedist  0  1  2  3  4  5  6  7  8  524 403 271 128 85 30 13  5  2 The cmode argument tells the degree command that since these are undirected data, we do not want to count both the upper and lower triangles of the sociomatrix, but rather only one. We could have chosen "outdegree" instead with identical results; leaving the argument off would yield numbers in the upper row double those seen above. We leave it to the interested reader to split this out by sex, re-express as proportions and plot. Instead, we turn to another network feature that is commonly of interest, the count of triangles:

R> summary(fmh~triangle)
triangle 169 Instinct tells us that 169 triangles is many more than would be expected by chance in a network of 1,461 actors and 974 edges. We will revisit this in a more rigorous fashion below. Note that one can obtain more than one statistic at a time using this function; simply separate them with the plus symbol (standard syntax within R):

R> summary(fmh~edges + triangle)
edges triangle 974 169 When we visualized the network we saw a preponderance of within-grade edges. Let us display the mixing matrix by grade (i.e., the count of relationships cross-tabulated by the grades of the two actors involved): R> mixingmatrix(fmh, "Grade") Note: Marginal totals can be misleading for undirected mixing matrices. Clearly most of the contacts are concentrated on the diagonal; you may wish to examine the mixing matrices for Race and Sex as well. To see a vector of attribute values for all of the actors for one of these attributes, or a table of the same, use the get.vertex.attribute command, or its shortcut, %v%: R> gr <-fmh %v% "Grade" R> table(gr) For more information on importing, exporting and manipulating network objects, see Butts (2008a) or the documentation for the network package. For more information on performing classical network analysis, see Butts (2008b) or the documentation for the sna package.
Meanwhile, saving your work is always wise:

Fitting an ERG model
The ergm package allows the user to fit exponential-family random graph (ERG) models to network data sets. These models, also known as p* (p-star), are described in great detail in earlier papers in this issue Hunter et al. 2008b), and elsewhere in the literature (Wasserman and Pattison 1996;Snijders, Pattison, Robins, and Handcock 2006;Robins and Morris 2007). Very briefly, the model class formulates the probability of observing a set of network edges (and non-edges) as where Y is the (random) set of relations (edges and non-edges) in a network, y is a particular given set of relations, X is a matrix of attributes for the vertices in that network, g(y, X) is a vector of network statistics, θ is the vector of coefficients, and κ(θ) is a normalizing constant. Equivalently, the model states that the log-odds that any given edge will exist given the current state of the rest of the network is where Y ij is an actor pair in Y (which is ordered if Y is directed and unordered if not), and δ[g(y, X)] ij is the change in g(y, X) when the value of y ij is toggled from 0 to 1.
Let us begin with the simplest model of interest, a single-parameter model that posits an equal probability for all edges in the network. This model is known in different branches of network science as the Bernoulli model or the Erdös-Rényi model, and it is a natural null model from which to proceed. In the ERG modeling framework, this corresponds to a model with a g(y, X) vector of statistics that contains only a single element, the number of edges in the network. The coefficient estimate tell us that if this network had been generated by the posited model, then the log-odds of a tie should equal -6.998 × δ(g(y, X)) ij , which would equal -6.998 for all edges, since the addition of any edge to the network changes g(y, X), the total number of edges in the network, by 1.
The probability that corresponds to this log-odds is exp(−6.998)/(1 + exp(−6.998)) = 0.000913 The summary(fmh) command reveals that the density of the network, that is, the fraction of possible edges that are realized, is indeed 0.000913. This model is sufficiently trivial that the coefficient could have been obtained analytically; this will not remain the case for long.
In addition, the model fit not only tells us the point estimate for the log-odds of a tie, but the standard error of those log-odds as well, thus providing additional information about our level of certainty in the point estimate.
Other components of the output include information about the amount of deviance explained by the model, two standard measures of model fit based on the likelihood (AIC and BIC), and the MCMC standard error (i.e. the additional uncertainty in the coefficient estimates induced by the MCMC estimation procedure). Since this model exhibits dyadic independence, MCMC estimation is not necessary, and the MCMC standard error does not apply. For more information on all of these concepts, see the preceding papers in this issue, especially Hunter et al. (2008b).
There is much more to an ergm object than that which is shown in the summary: R> names(model1) [1] "coef" "sample" "iterations" "mle.lik" [5] "MCMCtheta" "loglikelihoodratio" "gradient" "hessian" [9] "covar" "samplesize" "failure" "mc.se" [13] "glm" "glm.null" "null.deviance" "aic" [17] "theta1" "offset" "drop" "network" [21] "newnetwork" "formula" "constraints" "prop.weights" The user may find out more about each of these components by examining the "Value" section under help(ergm). Those which one is typically most interested in extracting for analysis and model comparison would include the coefficients and the likelihood: Since we have all lived through adolescence, we know full well that the network of high school friendships is not simply random. How can we explore alternative hypotheses about the structure that exists and the underlying processes that generated it? This is precisely the strength of the ERG modeling framework.
A good model with which to begin is one that proposes a tendency for assortative mixingthat is, a greater probability of individuals forming edges with others of the same race, sex, or grade as themselves. This model is a convenient one with which to extend our investigations since it exhibits dyadic independence. As explained in Hunter et al. (2008b), such models are generally easier to fit than others, and the fitting process itself is not usually affected if the model is inappropriate for the data. In this case, it is also a very reasonable hypothesis to propose.
Let us test for a tendency towards assortative mixing that is uniform within each attribute class-i.e. there is the same tendency for within-race edges (for example), regardless of which race one is talking about. This model can be fit using the nodematch term, with the default argument of diff = FALSE: R> model2 <-ergm(fmh~edges + nodematch("Grade") + nodematch("Race") + + nodematch("Sex")) R> summary (model2 We see that all four terms are significant; and we also see a dramatic increase in model likelihood relative to model 1. One could conduct a formal likelihood ratio test between models 1 and 2 if one wished. Note that in practice, if one is including nodematch terms in a model, one would typically also include nodefactor terms for the same attributes (since the latter capture "main effects" of attributes whereas the former can be thought of as capturing "interaction" or "second-order effects" of attributes).
One can interpret the coefficients of this model in terms of the log-odds of different types of ties: the log-odds of a tie that is completely heterogeneous (the two members differ from each other in race, sex, and grade) is −10.01; the log-odds of a tie that is homogeneous by race only is −8.82 (= −10.01 + 1.20, with rounding error); one that is homogeneous in all three attributes is −4.70 (= −10.01 + 3.23 + 1.20 + 0.88), etc.
Let us delve a little deeper to see what this model does and does not capture about the structure in our original data. To do this, we use the model fit that we have just generated to simulate new networks at random, and consider how these are similar to or different from our data. Because this and many of the following commands rely on pseudo-random number generation, we use the seed argument throughout the article to seed the random number generator. This makes the output exactly reproducible, though we have noticed that for certain commands, the output sometimes differs on certain platforms (we used R version 2.6.2 for Windows to obtain this output). Simulating networks from an ergm object is done via the simulate command: R> sim2 <-simulate(model2, burnin = 1e+6, verbose = TRUE, seed = 9) Starting 1 MCMC iteration of 1e+06 steps.
Here we have changed the burn-in (the number of steps in the simulation chain before the simulated network is drawn) from the default of 1000. When simulating a single network this allows the chain a chance to move away from the starting network so that the output is approximately independent of initial conditions. Using verbose = TRUE allows the user to see what percentage of the million proposals were accepted, and thus to check how far the simulation was capable of moving. For the above example, approximately 33%, or 330,000 proposals, were accepted, allowing plenty of opportunity for a network of only 974 edges to move away from its initial conditions. (The interested reader might wish to try the same simulate command, but accept the default burn-in of 1000, and see how the resulting network compares to the data).
The default for the simulate command is to simulate a single network, and thus the object returned from the above command is of class network. You may examine the mixing matrices for the attributes in this network and in the data to see how they compare: R> mixingmatrix(sim2, "Race") .  Figure 3: Degree distribution, data vs. model 2.
How are the matrices above similar to, and different from, one another?
Let us now compare our model fit to our data on another feature of great interest to many network modelers: the degree distribution.
How do these two networks compare with regard to triangles?
R> c(fmh = summary(fmh~triangle), sim2 = summary (sim2~triangle)) fmh.triangle sim2.triangle 169 2 Clearly, this model does not capture the process that generated the observed level of triangles. Perhaps, then, one might wish to fit a simple triangle model, with only edge and triangle terms. Or, one might wish to add a triangle term to model 2, considering both triangles and assortative mixing together, since it is fairly clear that assortative mixing is at work. Or, one might wish to explore the classic homogeneous Markov model (edges, triangle, k-star terms). Unfortunately, all of these models are degenerate for this network, in the sense used by Handcock (2003a,b). That is, they are such poor descriptions of the underlying processthe observed network is so unlikely even under the maximum likelihood coefficients-that the model cannot even be properly estimated.

Identifying model degeneracy
To observe how exactly degeneracy plays out in practice and how to diagnose it with your own model estimation, let us work through one example. Let us first fit the triangle model with all of the defaults for ergm (and a value of seed to make the output reproducible, as well as a flag to return some additional degeneracy diagnostics): R> model3 <-ergm(fmh~edges + triangle, seed = 99) Iteration 1 of at most 3: the log-likelihood improved by 1.373 Iteration 2 of at most 3: the log-likelihood improved by 0.9386 Iteration 3 of at most 3: the log-likelihood improved by 2.837 ...
This model is a dyadic dependence model, so the fitting algorithm draws on MCMC and is thus stochastic.
Let us see what the initial and final coefficients look like:

R> model3
Newton- From this output, things might seem relatively good: the model yielded coefficient estimates that were slightly different from the starting values, but not wildly so. Moreover, the improvements in the log-likelihood values are reasonably low numbers. (A number approaching 20.0, the point at which the algorithm cuts off further estimation, usually indicates problems with convergence). It is good if the successive changes in log-likelihood get small (e.g., less than 1) and are generally decreasing. So we should look more closely. Let us do so by examining the diagnostics for the MCMC model fitting process: R> pdf("model3diagnostics.pdf") R> mcmc.diagnostics(model3) R> dev.off() This requires the coda package (Plummer, Best, Cowles, and Vines 2007) to be installed from CRAN. The mcmc.diagnostics function creates a PDF file in your working directory that should resemble Figure 4 (along with some text output, which we omit here). The plots tell the user what is happening to the model statistics during the last iteration of the MCMC estimation procedure, in this case iteration 3 since the default for maxit in ergm is 3. The left-hand plots represent the chain as one time series for each model statistic, while the righthand side summarizes this chain in a histogram. Both are normalized to the observed data, represented by 0. In a converged model, these statistics will vary stochastically around the mean, but will not trend steadily away from the mean, as they do here. This tells us that our initial estimates for the coefficient values generate networks with many more triangles than actually observed, which will make maximum likelihood estimation impossible. This is good evidence that one has a degenerate model.
How else might one detect this? Some of the feedback given in verbose mode also helps (the output below is produced during the third of three iterations): R> model3 <-ergm(fmh~edges + triangle, verbose = TRUE, seed = 8 This tells us that in this particular simulation the mean for the edge statistic was off by an average of 468.7, and the triangle statistic by 784.1. Even the smallest simulated edge and triangle statistics were 78 and 144 larger than the observed statistics, respectively. Given the high instability in this model fit, repetitions of this command may give results that look considerably different. These are all further evidence of degeneracy. Another means for diagnosis that uses the same principle is simply to simulate from the final coefficient estimates and compare the simulated network to the observed for the statistics in the model.
On the flip side, one must be careful not to assume model degeneracy too quickly, either. The problems with convergence just described can in some cases arise not because the model is a degenerate one, but simply because the Markov chain has not been allowed to run long enough to cover the sample space sufficiently. This is especially true if the starting values for the estimation are very far from the true MLE. Before one settles on a diagnosis of degeneracy, it is best to explore some of the following options. For this particular model, none of these will help greatly, but it is still an informative exercise.
One straightforward approach is to increase the MCMC sample size, or interval, or both: R> model3.take2 <-ergm(fmh~edges + triangle, MCMCsamplesize = 1e+5, + interval = 1000, verbose = TRUE, seed = 88) The output should contain the following comment: The network has more than 50000 edges, and the model is likely to be degenerate.
which is a very clear sign that model convergence has failed.
Another approach is to decrease the step length, increase the maximum iterations through the simulation-estimation cycle, or both: R> model3.take3 <-ergm(fmh~edges + triangle, maxit = 25, seed = 888, + control = control.ergm(steplength = 0.25), verbose = TRUE) In this case, the model eventually returns a similar warning as for model3.take2, starting at iteration 4. Note that in the first three iterations, the algorithm seems to stabilize at a set of coefficient estimates around −7.3 and 4.5. However, this stability is again a ruse: examining the MCMC diagnostics and/or simulating from these intermediate coefficient values will reveal that this model has indeed failed to converge.
A final option is to consider a different proposal type for the MCMC, which may introduce some constraints on the estimation process. The default MCMC proposal in ergm has no constraints, but does weight the selection of ties (y ij = 1) and non-ties (y ij = 0) so as to ensure that each class is selected with overall probability 0.5 regardless of network density. In statnet, this proposal is called TNT (i.e. "tie/no-tie"). However, one might choose a proposal type that constrains the sample space, e.g. one that fixes the total number of edges or the nodal degrees. These can be selected by adding the argument constraints =~edges or constraints =~degrees to the ergm command, respectively. Doing so should be based on a theoretical belief that this constraint appropriately captures the stochastic process behind the network; if not, it may simply mask degeneracy rather than overcoming it. Other options for constraints or weights for the MCMC proposals are described in Morris et al. (2008) or can be found in the help for the ergm command.

Non-degenerate dyadic dependence models
Let us explore clustering using one of the more recently proposed statistics that is closely related to the triangle, but has proven to be more robust to degeneracy in practice. This is the GWESP (geometrically weighted edgewise shared partner) statistic; For more information on it, and the equivalent "alternating k-triangle" statistic, see Snijders et al. (2006); Robins, Snijders, Wang, Handcock, and Pattison (2007b) or Hunter (2007). Since we are fairly confident that the assortative mixing processes are real, we shall add this new term to model 2. The GWESP statistic has one scaling parameter (α) as well as the usual coefficient.
Although ergm contains functionality to fit the MLE of both parameters simultaneously, this generally requires much longer computation time to converge. For this tutorial we will keep things simpler and explore a number of models with fixed α.
When the scaling parameter equals 0, the statistic is equivalent to the number of edges that are in at least one triangle; when it approaches infinity, the statistic approaches three times the number of triangles. The lower the value of α, the less likely the model is to be degenerate, so we may wish to start close to zero and move up, looking to see how well we do in matching the count of triangles: R> model4.take1 <-ergm(fmh~edges + nodematch("Grade") + + nodematch("Race") + nodematch("Sex") + gwesp(0, fixed = TRUE), + MCMCsamplesize = 1e+5, maxit = 15, verbose = TRUE, + control = control.ergm(steplength = 0.25), seed = 123) Note that these and subsequent model estimation runs may take upwards of 15-20 minutes each. With verbose set to TRUE, you should receive occasional feedback from R about the progress of the algorithm, although at other times the program may or may not be available depending on your operating system.
How well did the fitting algorithm do? How do the diagnostic plots look? If you simulate a single network from this model, how many edges and triangles does it contain?
Interpreting the coefficients for the GWESP model gets much more complex than the previous dyadic independence models: As an example, imagine a pair of actors who differ from each other on all attributes. If they have no friends in common, then the log-odds of them becoming friends is −9.83. If they have any positive number of friends in common, and each of them is in at least one other triangle with each of those friends, then the log-odds of them becoming friends rises to −8.03 (= −9.83 + 1.80). If α had had a positive value (unlike this case, where it equals 0), then the log-odds of them becoming friends would have increased monotonically (but sub-linearly) with the number of friends they had in common. If some of the two actors' common friends were not already in other triangles with each of them, then the log-odds rises even more: in the case of α = 0, by an additional 1.80 for each edge that is not in any triangle but that enters one when the two actors become friends. Now try using the GWESP term with some different values for the α parameter, such as 0.1 and 0.2: R> model4.take2 <-ergm(fmh~edges + nodematch("Grade") + + nodematch("Race") + nodematch("Sex") + gwesp(0.1, fixed = TRUE), + MCMCsamplesize = 1e+5, maxit = 15, verbose = TRUE, + control = control.ergm(steplength = 0.25), seed = 123) R> model4.take3 <-ergm(fmh~edges + nodematch("Grade") + + nodematch("Race") + nodematch("Sex") + gwesp(0.2, fixed = TRUE), + MCMCsamplesize = 1e+5, maxit = 15, verbose = TRUE, + control = control.ergm(steplength = 0.25), seed = 123) Notice that the only difference among the last three models is the value of the α argument to the gwesp term. Try repeating the above command for progressively higher values of α, until the model loglikelihood ceases to improve or you get bored.
You can compare the approximate maximized values of the loglikelihood function for each of the last three models via: R> c(model4.take1$mle.lik, model4.take2$mle.lik, model4.take3$mle.lik) [1] -5604. 181 -5549.739 -5502.275 Or, you may wish to simulate one or more networks from each and compare their triangle counts to the original.
You should also examine the estimated coefficients for the three different models by typing summary(model4.take1), etc. Note that no matter what value is used for α within this range, all of the terms are significant and the coefficients do not change all that much. We select the version with an α of 0.2 to continue; you should feel free to pick whichever you wish: increased (i.e. become less negative). A homogeneous tie that closes a triangle is now more likely than one that does not close a triangle, while each is more likely than its heterogeneous counterparts.

Goodness of fit
Comparing a single outcome from the simulation to the original is of limited value. The logical next step, now that we have a model that may be starting to do a reasonable job of capturing our original network, is to simulate many networks and compare the full distribution of our statistic(s) of interest to that from the observed data. Let us continue with model 4, looking to see whether it does a better job of capturing the triangle count than model 2 did earlier.
Notice that the number of steps accepted remains right at 33% or so throughout the chain of total length 10 7 , another good sign that our model is not degenerate. Since we asked to simulate more than one network, sim4 is of a different object class than sim1 was:
R> hist(model4.tridist) R> fmh.tri <-summary(fmh~triangle) R> arrows(fmh.tri, 20, fmh.tri, 5, col = "red", lwd = 3) You should see something like Figure 5. Next, you can check how many simulations were at least as extreme in the upper tail as our observed value, marked by the red arrow in Figure 5, using the command: R> sum(fmh.tri <= model4.tridist) [1] 1 In our simulation, the answer was 1 out of 100; our observed data evidently lie outside the central 95% of the simulated distribution, although model 4 is still a major improvement over model 2 with regard to triangles.
To make this process easier on the user, we have developed code to conduct all of the above stages in the goodness-of-fit automatically for some common network distributions; these include the degree distribution, the distribution of edgewise shared partners (the number of  Note that without including information on degree directly, this model pretty well captures the degree distribution for this network.
Do be sure to use a sufficiently large interval on the gof command; otherwise one might obtain a plot that suggested that the model was well fitting, but in fact simply did not have enough time to move away from the original network. Here again we could see that the chain accepted about 33% of proposals, with 100,000 proposals between each run, meaning that consecutive samples differed by around 33,000 toggles.
Note that since the simulated networks themselves are so large, they are not saved in the gof object by default; only those network statistics included in the formula for the gof call are. In practice it is often a good idea to include all three distributions (degree, edgewise shared partner, and geodesic distance) in one's formula even when one's primary interest is in a subset, if there is any chance one might be interested in the others later; this will avoid the need to regenerate the simulated networks.
We plot the edgewise shared partner and geodesic distributions on the log-odds scale (Figure 7), as this makes the latter in particular easier to visualize: R> get(getOption("device"))(width = 8, height = 4) R> par(mfrow = c(1, 2)) R> plot(gof4.esp.dist, plotlogodds = TRUE) log−odds for a dyad q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Goodness−of−fit diagnostics Figure 7: Model 4 goodness of fit for edgewise shared partner and geodesic distance.
Note that the first line of this code generates a graphics window of specified size regardless of operating system. The left-hand graph plots the edgewise shared partner distribution. The GWESP statistic was included in the model, which is a parametric form of the edgewise shared partner distribution, so one would hope that we are able to capture the distribution at least rudimentary. We seem to have done so, although our model overestimates the number of edges with one shared partner, and underestimates the number with two or more; it still does not capture the full amount of clustering.
On the other hand, we have included nothing about distance in our model, and distance is a very global property of the network. How well do the observed and simulated distributions match? What is the nature of the disparities? Can you think of ways to expand upon your model that might lead to a better global fit? We leave it to the reader to continue exploring different models to attempt to match the observed data as best as possible. For a full list of model terms available for this process, see Morris et al. (2008). Once you are done, see the Appendix to find out the actual model that was used to generate the network.

Next steps
If this tutorial has left you interested in learning about the remaining functionality of the statnet suite of packages, we encourage you to read the remaining papers in this issue, and the documentation available within each package. Also be aware that statnet is under continual development; users interested in keeping abreast of new features are encouraged to check the statnet homepage http://statnetproject.org, where they are also encouraged to sign up for the statnet mailing list.
grants from the National Institutes of Health (R01-HD041877, R01-DA012831). DRH received additional funding from Le Studium, an agency of the Centre National de la Recherche Scientifique of France, and NIH grant R01-GM083603-01.