Nonparametric inference for continuous-time event counting and link-based dynamic network models

A flexible approach for modeling both dynamic event counting and dynamic link-based networks based on counting processes is proposed, and estimation in these models is studied. We consider nonparametric likelihood based estimation of parameter functions via kernel smoothing. The asymptotic behavior of these estimators is rigorously analyzed by allowing the number of nodes to tend to infinity. The finite sample performance of the estimators is illustrated through an empirical analysis of bike share data.


Introduction
In this paper we present a modeling approach that can be applied to both, dynamic interactions in networks as well as dynamic link deletion and addition in networks. The case of dynamic interactions considers a network as a collection of actors who can cause instantaneous interactions. Both directed and undirected interactions are considered. In the model we assume that the time at which an interaction happens, and the pair of actors involved in this interaction, are random, and we model the joint distribution of these quantities as depending on parameters that may change over time. We call this model a dynamic network interaction model. The Enron e-mail data set provides a typical example for a data set that can be modeled in such a way. Here one person sending an e-mail to another person is interpreted as the interaction. While such interactions can be thought of as edges between two nodes, and while the nodes themselves persist over time, each such edge only exists for an infinitesimal time.
In contrast to interaction events, we also consider models for connections between the actors of a network that persist over a longer time period. In this case, connections comprise four quantities: sender, receiver (a pair of actors), and starting and ending time of each connection. Again we allow the parameters of their distributions to depend on time. Here we speak of dynamic networks. Examples for such models are social networks with links indicating an ongoing friendship between two actors. The notions of dynamic network interactions and of dynamic networks are different but they are closely related. A dynamic network defines two network interaction models, one given by the starting time and one given by the ending time of a connection. Furthermore, a network interaction model of e-mails defines a dynamic network by aggregation, where the network shows a connection between two actors as long as they have exchanged e-mails over a certain time period in the past.
In this paper we will make use of counting process models from survival analysis that we adapt for our network models. We will develop asymptotic theory for a kernel-based estimator of local parameters in the models. The estimator is based on a localized likelihood criterion.

Literature Review and Related Work
Random networks/graphs have been considered in various scientific branches since a long time, in particular in the social sciences (c.f. the textbook by Wasserman and Faust, 1995). The importance of the analysis of random networks within the more statistical and machine learning literature is more recent, but corresponding literature is significant by now (e.g. see Kolaczyk, 2009 or Goldenberg et al. 2010). The reason for this increase in significance is not least due to the development of modern technologies that lead to the ever increasing number of complex data sets that are encoding relational structures. Examples for real 1 network data can be found at the Koblenz Network Collection KONECT, the European network data collection SocioPatterns, the MIT based collection Reality Commons, the data sets made available by the Max Planck Institute for Softwaresysteme (MPI-SWS), or the Stanford Large Network Dataset Collection (SNAP). In this paper, we will use the Capital Bikeshare Performance Data (see http://www.capitalbikeshare.com/system-data), which is a data set collected on the Washington, DC bikeshare system.
Modeling and analyzing dynamic random networks is challenging as networks can have a multitude of different topological properties. Such topological properties are measured by various quantities, including the flow through the network, the degree distribution, centrality, the existence of hubs, sparsity etc. Time-varying or dynamic random networks appear quite naturally, even though the dynamic aspect significantly adds to the complexity of modeling and analyzing the networks. Early work on networks involving temporal structures can already be found in Katz and Proctor (1959), who consider a discrete time Markov process for friendships (links). Other relevant literature using discrete time settings include work on dynamic exponential random graph models ( Xu, 2015), dynamic nodal states models (Kolar andXing, 2009, Kolar et al., 2010), various dynamic latent features model (Foulds et al, 2011), dynamic multi-group membership models (Kim and Leskovec, 2013), dynamic latent space models (Durante andDunsen, 2014, Sewell andChen, 2015), and dynamic Gaussian graphical models (Zhou et al, 2008, Kolar andXing, 2012). Also time-continuous models have been discussed in the literature. They include link-based continuous-time Markov processes (Wassermann, 1980, Leenders, 1995, Lee and Priebe, 2011, actor-based continuous-time Markov processes (Snijders, 1996, 2001, Snijders et al. 2010, and also models based on counting processes as in Perry and Wolfe (2013), who are considering the modeling of network interaction data and Butts (2008), who applies such a model to radio communication data. Link Prediction, a problem related to the analysis of dynamic networks, received quite some attention in the computer science literature (e.g. see Liben-Nowell and Kleinberg, 2007, or Backstrom and Leskovec, 2011).

Our Work
We begin by making the above informal descriptions precise. Let V n := {1, . . . , n} denote the set of actors (also called agents or nodes), and let L n ⊆ {(i, j) : i < j, i, j ∈ V } denote the set of all possible links among them. (For directed networks L n is the set of all ordered pairs.) Furthermore, let G n := (V n , L n ) be the corresponding graph. For each pair of actors (i, j) ∈ L n we denote by N n,ij : [0, T ] → N the number of interactions between these two N n,ij (t) = #{interaction events between i and j before or at time t}.
We assume that for (i, j) ∈ L n , the processes N n,ij are one-dimensional counting processes with respect to an increasing, right continuous, complete filtration F t , t ∈ [0, T ], i.e. the filtration obeys the conditions habituelles, see Andersen et al. (1993), pp. 60. The σ-field F t contains all information available up to the time point t. For simplicity we formulate all results for undirected interactions only, i.e, we assume that N n,ij = N n,ji for all pairs (i, j) ∈ L n . All results can be formulated for the directed case as well (see also discussion in Section 2.3 after assumption (A1)).
In our approach, we model the intensities of the counting processes N n,ij at time t only for a subset of edges {(i, j) ∈ L n : C n,ij (t) = 1}. The functions C n,ij (t) ∈ {0, 1} are indicator functions assumed to be predictable with respect to the filtration F t , and they determine the active part of the network. Our aim is to model the active part. The definition of the active part depends on the application. For instance, modeling the edges between actors i and j might depend on whether i and j have had a low or a high interaction intensity in the past. We will come back to this point later.
For the set {(i, j) ∈ L n : C n,ij (t) = 1} the intensities of the counting processes N n,ij are modeled by λ n,ij (θ, t) := Φ(θ(t); (Y n,ij (s)) i,j=1,...,n : s ≤ t), where X n,ij : [0, T ] → R q are F t -predictable covariates, Φ is a link function, and the parameter function θ : [0, T ] → R q is the target of our estimation method. The presented above approach is quite flexible and general. In order to be more specific, and for modeling reasons explained in Section 2.1, we will, in the following, assume that Φ has the following Cox-type form Φ(θ; (C n,ij (s), X n,ij (s)) i,j=1,...,n : s ≤ t) = C n,ij (t) exp(θ T (t)X n,ij (t)).
(1.1) Butts (2008) used a similar modeling framework in an empirical analysis to radio communication data from responders to the World Trade Center Disaster. Another related model can be found in Leenders (1995). Perry and Wolfe (2013) studied a model similar to (1.1) with constant parameters. In their specification the intensity was equal to λ(t) exp(θ T X n,ij (t)) where λ(t) is an unknown baseline hazard. They developed asymptotic theory for maximum partial likelihood estimators of θ in an asymptotic framework where the time horizon T converges to infinity. Our work was motivated by their research but it differs in several respects. First of all we allow the parameters to change over time. Our estimates of the parameter functions can be used for statistical inference on time changes in the effects of covariates. Furthermore, by choosing the first component X n,ij = 1 our model includes the time-varying baseline intensity e θ (1) (t) . Thus in contrast to Perry and Wolfe (2013) we propose a fit of the full specification of the intensity function including all parameters and the baseline intensity. Furthermore, our aim is to model large networks whereas Perry and Wolfe (2013) considered relatively small networks over a long time period T . Thus in our asymptotics we let the number of actors converge to infinity instead of T . We will argue below (at then end of Section 2.1 and after Assumption (A6) in Section 2.3) that appropri-ate choices of the censoring factor C n,ij (t) allow for modeling large networks with degrees of the nodes/actors being relatively small compared to the size of the network.
Despite the strong interest in dynamic models for networks, rigorous statistical analysis of corresponding estimators (asymptotic distribution theory) are relatively sparse, in particular in the case of time-varying parameters, as considered here. The temporal models in the literature are usually Markovian in nature. In contrast to that, our continuous-time model based on counting processes allows for non-Markovian structures (i.e. dependence on the infinite past). This increases flexibility in the modeling of the temporal dynamics. Our model also allows for a change of the network size over time without the networks degenerating in the limit. Moreover, we are presenting a rigorous analysis of distributional asymptotic properties of the corresponding maximum likelihood estimators. To the best of our knowledge, no such analysis can be found in the literature, even for the simpler models indicated above.
In section 2, we discuss our model (Section 2.1), define our likelihood-based estimators, and present our main result on the point-wise asymptotic normality of our estimators in section 2.2. In section 3, we demonstrate the flexibility of our approach by presenting an analysis of the Capital Bike-share Data. The proof of our main result is deferred to Section 4. The appendix contains additional simulations where we compare network characteristics as degree distributions, cluster coefficients and diameters of the observed network with networks distributed according to the fitted model. Moreover, we discuss data adaptive bandwidth choices in the appendix.
2 Link-based dynamic models 2.1 Link-based dynamic models with constant parameters We will first discuss the model described in Section 1.2 with general link function Φ and constant parameter function θ ≡ θ 0 . Andersen et al. (1993) show the following form of the log-likelihood for the parameter θ: where ∆N n,ij (t) := N n,ij (t) − N n,ij (t−) (and N n,ij (t−) := lim δ→0,δ>0 N n,ij (t − δ)) is the jump height (either 0 or 1) of N n,ij at t. And hence, we obtain the maximum likelihood estimator asθ where Θ denotes the range in which the true parameter is located. The choice of Φ as in (1.1) allows for an easy interpretation of the parameters: The intensity has the form n,ij (t) denotes the k-th component of X n,ij (t). Hence, θ k quantifies the impact of X (k) n,ij (t) on the intensity, given that the remaining covariate vector stays the same.
The presence of the function C n,ij (t) enhances the modeling flexibility significantly. By choosing C n,ij the researcher who applies the model is able to fit the model only to a sub-network. This becomes necessary when it is natural to assume that certain pairs of actors behave fundamentally different from others. For instance, consider a social media network, and contrast pairs exchanging messages regularly, having a big impact on each other's activity in the network, with pairs living in different continents who never had any interaction or indirect relation over related friendship sub-networks. It is intuitive that these two pairs cannot be modeled accurately by the same model. It would instead be advantageous to ignore interactions between pairs of people who have been totally separated in the recent past when we want to estimate the parameters for the behavior of people living near by. On the other hand, the interaction intensity of course is dynamic, and thus different pairs might be included over time. This is achieved by the presence of the selector variables C n,ij (t). Also note that C n,ij (t) = 0 for t ∈ [a, b] does not necessarily mean that there will be no interactions between i and j in [a, b], it rather means that interactions which happen between i and j in [a, b] are not fitted by our model. Two things should be noted about these selectors: Firstly, choosing (C n,ij ) i,j too conservatively is not a problem in the sense that we still estimate the 'correct' parameters. For instance, suppose that C n,ij (t) and θ are the correct quantities to be used in the model (1.1). Assume now that C * n,ij is a predictable selector that is more conservative than C n,ij , i.e., C n,ij (t) = 0 implies C * n,ij (t) = 0. Then, the observations are given by t → N * n,ij (t) := t 0 C * n,ij (s)dN n,ij (s) for t ∈ [0, T ]. Clearly, N * n,ij is a counting process comprising those jumps of N n,ij at which C * n,ij equals 1. By assumption, C * nij (t)C n,ij (t) = C * n,ij (t) and hence the compensator of N * n,ij is given by C * n,ij (t) exp(θ T (t)X n,ij (t)). Thus, the processes N * n,ij can also be used to estimate θ. On the other hand, using fewer data of course leads to a loss of information, and this might effect the rates of convergence of the parameters (cf. Theorem 2.1). We do not attempt here to determine the best C n,ij in a data driven way. Instead, in real data applications, and motivated by this discussion, we attempt to choose C n,ij in a way that is not too liberal. This is illustrated in section 3, where we set C n,ij (t) equal to zero, if there was no event between i and j for a certain period ∆t = (t − δ, t), for some δ > 0, so that our model is only fitted to "active" pairs. For pairs with low activity one may look for a different model. Thus a proper choice of C n,ij (t) allows to split up the analysis into different regimes.
Secondly, consider the social media example again. We will assume that, with positive probability, links can form at any time 0 < t < T , i.e., P(C n,ij (t) = 1) > 0. However, it is intuitive that even when more people connect to the platform, one actor will not acquire an infinite number of friends but the number of friends is bounded by some constant κ ∈ N. Then, the fraction of C n,ij that equal 1 with respect to all n(n−1) 2 available C n,ij at time t, should be of order 1 n . Hence, in this case it is natural to have that P(C n,ij (t) = 1) → 0 as n → ∞. This way sparsity in the observations can be captured in the model. For the asymptotic result (Theorem 2.1) to hold, the network cannot be too sparse (essentially an increase in the number of actors must lead to an increase in the number of selected pairs). This will be made precise in the assumptions given in Section 2.3.

Estimation in time-varying coefficient models
In time series applications, it turns out that powerful fits can be achieved by letting the time series parameters depend on time, and this is what we consider here as well. We will use the above model with θ in (1.1) depending on t, or in other words, θ = θ(t) is now a parameter function.
An estimator of this parameter function at a given point t 0 can be obtained by maximizing the following local likelihood function in µ which is obtained by localizing the likelihood (2.1) for a constant parameter at time t 0 by means of a kernel K where K is a kernel function (positive and integrating to one), and h = h n is the bandwidth. The corresponding local MLE is defined aŝ with Θ being the allowed range of the parameter function θ. Recall that we use the following Cox-type form of the intensity: Using the hazard functions from (2.4), the local log-likelihood can be written as: The maximum likelihood estimatorθ n (t 0 ) studied in this paper is defined as the maximizer of (2.5) over θ. Denote by L n (t 0 ) the set of active edges, i.e., the set of all pairs (i, j), such that, C n,ij (t 0 ) = 1. We denote by |L n (t 0 )| the size of the set L n (t 0 ). Our main theoretical result, given below, says that for a given t 0 ∈ (0, T ), the maximum likelihood estimator θ n (t 0 ) exists, is asymptotically consistent, and is asymptotically normal.
To formulate our main result, the following technical assumptions are needed.

Assumptions
Our assumptions do not specify the dynamics of the covariates X n,ij (t) and of the censoring variable C n,ij (t). Instead of this, we assume that the stochastic behavior of these variables stabilizes for n → ∞. Assumption (A1) is specific to our setting and it states our general understanding of the dynamics, while assumptions (A2), (A3) and (A5) are standard. Assumption (A4) guarantees that the covariates are well behaved and can be found similarly in Perry and Wolfe (2013). Finally, (A6) and (A7) specifically describe the dependence situation in our context. They quantify how we make the idea mathematically precise that while the network grows the actors get further and further apart and hence influence each other less and less. The structure in the following is that we firstly state an assumption and then discuss its meaning and the intuition behind them.
(A1) For every n and for any t ∈ [t 0 − h, t 0 + h], the joint distribution of (C n,ij (t), X n,ij (t)) is identical for all pairs (i, j). Furthermore, for any s, t ∈ [t 0 − h, t 0 + h], the conditional distribution of the covariate X n,ij (t) given that C n,ij (s) = 1, has a density f s,t (y) with respect to a measure µ on R q , and this conditional distribution does not depend on (i, j) and n. We use the shorthand notation f s for f s,s . Finally, it holds that: n → ∞, h → 0, and with l n := n(n−1) 2 P(C n,12 (t 0 ) = 1) → ∞, we have l n h → ∞, and l n h 5 = O(1).

2
· P(C n,12 (t 0 ) = 1) is the effective sample size at time t 0 , because n(n−1) 2 is the number of possible links between vertices, of which, in average, we observe the fraction P(C n,12 (t 0 ) = 1). (For directed networks, one simply has to replace n(n−1) 2 by n(n − 1) in the definition of l n .) With this in mind, the assumptions on the bandwidth are standard. The most restrictive assumption in (A1) is that the conditional distribution of X n,ij (t), given C n,ij (s) = 1, does not depend on i, j. Observe that this holds if the array of (C n,ij , X n,ij ) i,j is jointly exchangeable in (i, j) for any fixed n. The additional assumption that the conditional distribution of X n,ij (t), given C n,ij (s), does not change with n is not very restrictive, because it is natural to assume that the distribution depends only on the local structure of the network. For instance, if we assume that a fixed vertex i has only a bounded number of close interaction partners j while the network grows, then it is natural to assume that the local structure given by the interacting partners does not undergo major changes for n → ∞ (we discuss after assumption (A6) in more detail how this scenario gets incorporated in the C n,ij ). We make this additional assumption mainly to avoid stating lengthy technical assumptions allowing to interchange the order of differentiation and integration at several places in the proof.
We add some standard assumptions on the kernel.
(A2) The kernel K is positive and supported on [−1, 1], and it satisfies The next condition makes smoothness assumptions for the parameter curve θ 0 and the density f s (y).
(A3) The parameter space Θ is compact and convex. Let τ := sup θ∈Θ θ < ∞. The parameter function θ 0 (t) takes values in Θ, and, in a neighborhood of t 0 , it is twice continuously differentiable. The value θ 0 (t 0 ) lies in the interior of Θ.

8
The following assumption addresses the asymptotic behavior of the distributions of the processes C n,ij (t). In particular, for t in a neighborhood of t 0 , the assumptions address asymptotic stability of the marginal distributions of these processes, and also a certain kind of asymptotic independence of C n,ij and C n,kl for |{i, j} ∩ {k, l}| = 0.
(A6) For w(u) = K(u) and w(u) = K 2 (u)/ K 2 (v)dv it holds that for n → ∞. For A n,ij,kl we assume that

12)
and for edges with |{i, j} ∩ {k, l}| ≤ 1 P(C n,ij (t 0 ) = 1, C n,kl (t 0 ) = 1) P(C n,12 (t 0 ) = 1) Note firstly that, due to the localization of our likelihood function, all time dependence is only locally around the target time t 0 . Condition (2.10) appears reasonable in the regime of asymptotics in the size of the network: Consider, for instance a dynamic social media network, and assume, for example, that we consider data from a certain geographic region. One might assume that over night the number of active pairs, i.e. the pairs with C n,ij = 1, is lower than during the day, and we expect that there will be a gradual decrease between, e.g., 8pm and 11pm. This time window does not get narrower when n increases and hence slow change of the distribution seems to be a reasonable assumption. Assumption (2.12) holds for example in the following model: Assume that in the previous example communications between pairs are ended at δ 0 := 8pm plus a certain random time δ n,ij , i.e., C n,ij (t) = 1(t ≤ δ 0 + δ n,ij ). In this case, the ratio of probabilities in (2.12) becomes P(C n,ij (t 0 ) = 0, C n,kl (s) = 1) P(C n,12 (t 0 ) = 1) .
Since we are using a localizing kernel, the length of the interval [s − δ 0 , t 0 − δ 0 ) is of the order h, and if δ n,ij has a density, then (2.12) holds.
If we assume that relabeling the vertices does not change the joint distribution of the whole process (i.e. if we assume exchangeability). Then, the joint distribution of two pairs (i, j) and (k, l) depends only on |{i, j} ∩ {k, l}|. Hence, it is very natural to distinguish the three regimes |{i, j} ∩ {k, l}| ∈ {0, 1, 2}. This pattern will appear again in the next Assumption (A7). Let us for the moment consider C n,ij that are constant over time. Then, in (2.11), in the case |{i, j} ∩ {k, l}| = 2, the assumption is true because We discuss the remaining cases for the uniform configuration model. In the uniform configuration model all vertices have (approximately) the same pre-defined degree κ, and we assume that the C n,ij are created as follows: Equip each vertex i = 1, ..., n with κ edge stubs, and create edges by randomly pairing the stubs. After that, discard multiple edges and self-loops. If two vertices i and j are connected after this process, set C n,ij = 1. We use the same heuristics as e.g. in Newman (2010), Chapter 13.2, to compute the probability of edges. Fix i and j, then for any fixed edge stub of i there are κn − 1 stubs left to pair with, κ of which belonging to vertex j. Hence, the probability of connecting to j is given by κ 2 κn−1 as there are κ edge stubs from i as well. Approximating κn − 1 by κn, as n gets large, we obtain the following probabilities: P(C n,12 = 1) ≈ κ n P(C n,12 = 1, C n,23 = 1) = P(C n,12 = 1|C n,23 = 1) · P(C n,23 = 1) ≈ κ(κ − 1) n 2 P(C n,12 = 1, C n,34 = 1) = P(C n,12 = 1|C n,34 = 1) · P(C n,34 = 1) ≈ We see now that also in the cases |{i, j}∩{k, l}| ≤ 1, the assumptions (2.11) and (2.13) hold.
The next assumption involves θ 0,n , defined as the maximizer of where g is defined in (A7). We show later that θ 0,n is uniquely defined, and that θ 0,n is close to θ 0 (t 0 ) (see Lemma 4.2 and Proposition 4.2, respectively). Define furthermore Note that, by Assumption (A1), f 2 and g do not depend on (i, j) and n.
(A7) We assume that f n,1 depends on (i, j) and (k, l) only through |{i, j}∩{k, l}|. Moreover, we assume that, for all sequences θ n → θ 0 (t 0 ) and u, v ∈ [−1, 1], it holds that f n,1 (θ n , t 0 + uh, t 0 + vh, (i, j), (k, l)) converges to a value that depends only on |{i, j} ∩ {k, l}|. We denote this limit by f 1 (θ 0 (t 0 ), |{i, j} ∩ {k, l}|), and assume that For r n,ij (s), we assume that, with ρ n,ijkl (u, v) := r n,ij (t 0 + uh)r n,kl (t 0 + vh) and for |{i, j} ∩ {k, l}| = 0, Assumption (A7) specifies in which sense the covariates are asymptotically uncorrelated. For motivating these assumptions build a graph G with vertices 1, ..., n and (i, j) is an edge if C n,ij (t 0 ) = 1. Denote by d G the distance function between edges on G (adjacent edges have distance 0). In the same heuristic as explained after Assumption (A6), this graph is very large (asymptotics over the number of vertices) and sparse (n vertices and of order n edges), because every vertex is incident to at most κ edges. In this scenario, the number of pairs of edges e 1 and e 2 for which d G (e 1 , e 2 ) = d is of order (κ − 1) d · n, and there are of order n 2 many pairs of edges in total. Let now A i,j be arbitrary, centered random variables indexed by the edges of G. We make the assumption that A i,j is influenced equally by all A k,l with (k, l) being adjacent to (i, j). In mathematical terms, we formulate this assumption as Then, we obtain for non-adjacent edges (i, j) and (k, l) which converges to zero after being multiplied with l n h ≈ nh (in this case). Because, in (2.18) and (2.19), we consider only expectations conditional on C n,ij (t) = 1, we can think of A n,ij being the random variables τ n,ij (a centered version of it) or r n,ij and the expectations in the above heuristic are conditional expectations conditionally the respective conditions in (2.18) and (2.19). This serves as motivation for these two assumptions. Moreover, unconditionally, τ n,ij and τ n,kl (and r n,ij and r n,kl ) do not need to be uncorrelated.
Recalling that l n = n(n−1) 2 P(C n,ij (t 0 ) = 1) (in the case of undirected networks) is the effective sample size, i.e., the expected number of pairs relevant for estimation of θ 0 (t 0 ), we see that Theorem 2.1 is a classical asymptotic normality result up to the additional bias term B n , which we will discuss next. It holds that This is of order O(1) by (2.12) and (2.8). Hence, we get that B n = O P (1). In general one cannot argue that the expectation converges to 0. Thus, in general we will have an additional bias term of order h 2 . Let us suppose that one can show B n − E(B n ) = o(1) by using some additional assumptions that bound the second moment of this term. We have This assumption can only hold if only for a negligible minority of edges the membership to the active set changes. In particular, for the extreme case of C n,ij being constant, we have γ n,ij ≡ 0 and B n = 0. Hence, the bias term B n is induced by a change in the sparsity of the active set.

Direct network modeling
We consider the following general model for the link-based dynamics of a random network, using a multivariate continuous-time counting process approach allowing for arbitrary dependence structure between the links by applying the model for dynamic interactions twice: Once for the formation of new links and once for the deletion of existing links (this separation can also be found in Krivitsky and Handcock (2014)). As before let V n = {1, ..., n} be the set of vertices and L n be the set of edges. Note that here we are considering undirected networks. But directed networks can be handled similarly. For a given link (i, j), we let describes the random network, or, equivalently, the (upper half of the) adjacency matrix at time t. To describe the dynamics of the links over time we introduce two processes, N + n,ij (t) and N − n,ij (t), counting how often a link (i, j) was added or deleted, respectively, until time t. Formally, With these definitions, we can write, for (i, j) ∈ L n , For v ∈ {+, −} the intensities of the counting processes N v n,ij (t) are here defined as for some functions γ + and γ − respectively, where θ + and θ − are two different parameters, determining the addition and the deletion processes, respectively. The vectors X v n,ij (t) for v ∈ {+, −} denote covariates that are assumed to be F t -predictable. Note that this definition of the intensities makes sure that, as it should be, a link can only be added if none was present immediately before, and similarly for the removal for a link.
These definitions of the intensities fit into the framework of Section 2.2 with intensity Again, as in Section 2.2, we allow that the parameter is a function of time.
To sum it up: The processes N + n,ij are modeled with intensity λ + n,ij (θ + 0 , t) and the processes N − n,ij are modeled with intensity function λ − n,ij (θ − 0 , t). Our model allows the covariates X v n,ij and the true parameter functions θ v 0 to be different for v = + and v = − . For estimating the parameters, we consider observations of the same type only, i.e., we will compute two maximum likelihood estimators: the estimator of θ + 0 (t) based on the processes N + n,ij , and the estimator for θ − 0 based on the processes N − n,ij . Both estimators can be treated as coming from an interaction based model and hence the theory from Section 2.2 can be applied.

Application to Bike Data
We intend to illustrate the finite sample performance of our estimation procedure described above, by considering the Capital Bikeshare (CB) Performance Data, publicly available at http://www.capitalbikeshare.com/system-data. This data describes the usage of the CB-system at Washington D.C. from Jan. 2012 to March 2016. Using this data, we construct a network as follows. Each bike station j will become a node in our network, and edges between two stations i, j will be formed depending an whether a bike was rented at station i and returned to station j (or vice versa) at the same day. So, in our analysis, rentals over several days have been ignored. We ignore the direction of travel as well because we aggregate over days and assume that directed effects cancel out (most riders go one way in the morning and the other way in the evening).
It should be noted that, while we believe that this example serves as a serious and interesting illustration of our proposed method, it is not meant to be a full-fledged analysis of bike sharing performance. In particular, for computational and coding simplicity, we ignored that bike stations might be full or empty and thus prohibiting certain bike rides. Also, the authors' personal bike sharing experience is that entirely empty or full bike stations are not encountered too often, and so the hope is that the bias induced by ignoring this effect is negligible.  Figure 1 shows some summary statistics of the data. In Figure 1a, we see the number of available bike stations, which is strongly increasing. Figure 1b shows the number of bike tours on Fridays. An obvious periodicity is visible: The cycling activity is much lower in winter than during summer.
We aim at modeling the bike sharing activities on Fridays, seen on the right panel of Figure 1. We choose Fridays because we believe that there is a difference between different days, so we wanted to concentrate on one day, Friday. We use the event (interaction) counting approach as introduced in Section 1.2, where event here means that a bike is rented at station i and returned at station j, or vice versa. We will also refer to this event as a tour between i and j. In order to reduce computational complexity to a minimum (fitting the model takes several minutes on standard laptop), we assume that the covariates change only at midnight and stay constant over the day. Furthermore, we estimate the time-varying parameter function θ only for one time point per day, namely 12pm noon. The next paragraph contains more details.
Since we do not consider any asymptotics here, we omit the index n. Time t is measured in hours of consecutive Fridays. So, if k is the current week, and r is the time on Friday (in 24h), then t := (k − 1) · 24 + r. Thus, with r t := (t mod 24), the quantity k t := t−rt 24 + 1 gives the week the time point t falls into. The processes N i,j (t), counting the number of tours between i and j on Fridays, are modeled as counting processes with intensities The covariate vector X i,j (k t ) and the censoring indicator C i,j (k t ) will be defined later. Note that they both only depend on k t , i.e. on the current week, and not on the actual time on the Friday under consideration. The function α is 24 periodic and integrates to one over a period, i.e., α(t) = α(t+24) and t+24 t α(s)ds = 1. The function α is introduced to model the reasonable assumption that the activity varies during the day. Suppose now, that our target is the estimation of the parameter vector θ(t 0 ) with t 0 = (k t 0 − 1)24 + r 0 and r 0 = 12, say. We choose a piecewise constant kernel K with K((24k + x)/h) = K(24k/h), for all k ∈ N and 0 ≤ x < 24. Substituting in these choices of the intensity and the kernel to the log-likelihood (2.2), we see that our maximum likelihood estimator maximizes the function gives the number of tours between i and j on the Friday in week k, and where K κ (k) = K(k/κ) with κ = h/24. In our empirical analysis, we chose K κ (k) as triangle weights with support {−κ, ..., κ} and considered only integer choices of the bandwidth κ. The bandwidth choice is discussed at the end of this section.
We explain now the choice of our covariate vector X i,j . Denote by ∆ i,j (k, d) the number of tours between i and j on day d in week k, where d = 4 means Monday and d = 7 refers to Thursday (for us the week starts on Fridays, i.e. Friday is d = 1). For r ∈ (0, 1), we encode the activity between i and j in week k as A i,j,k = (1 − r) 7 d=4 r 7−d ∆ i,j (k, d) (mind the limits of the summation -Fridays are not included). In our simulations, we chose r = 0.8 (this choice is somewhat arbitrary, and a full study of the data would include investigating the sensitivity of the parameter estimate on the choice of r as well as a data driven choice. We do not attempt to do this here). We construct a network G(k), for every week k, by connecting i and j, if and only if, there was at least one tour on the Friday in that week. We denote by I i,j,k the number of common neighbors of i and j in the graph G(k). We let d i,k be the degree of node i in G(k), T i,j,k the number of tours between i and j on the Friday in the k-th week, and T i,j,k,k−1 = (T i,j,k + T i,j,k−1 )/2 the average number of tours on the two Fridays in weeks k and k − 1. Finally we collect everything in the covariate vector: The censoring indicator function C i,j is defined to be equal to zero, if there was no tour between stations i and j in the last four weeks. In summary, we estimate a total of six parameter curves, corresponding to the effects of six covariates in our model: • θ 2 (t) activity between stations on previous week-days popularity of station, measured by degrees • θ 5 (t) activity between stations on two previous Fridays • θ 6 (t) inactivity between stations on two previous Fridays The resulting estimated parameter curves are shown in Figures 2 and 3. All calculations have been executed on the BwForCluster (cf. Acknowledgement). In all six parameter curves in Figures 2 and 3, we observe a clearly visible seasonality. Looking at Figure 2b, we see that importance of the activity in the week (Monday to Thursday) is higher during the winter months than in the summer. A plausible interpretation for this might be that the opportunist cyclists might be less active in winter because of the colder weather. So only those keep using a bike, who ride the same tour every day regardless of the weather. This makes the activity in the week a better predictor. Moreover, the weather in winter is more persistent, e.g., when there is snow it is likely to remain for a while. Figure 2c shows that the number of common neighbors always has a significant positive effect on the hazard. This reflects the empirical finding that observed networks cluster more than totally random networks (e.g. see Jackson, 2008).
The influence of the popularity of the involved bike stations is investigated in Figure 3a (measured by the degree of the bike station). Interestingly, it always has a significant negative impact. The size of the impact is higher in the summer months, which again supports the hypothesis that in summer the behavior of the network as a whole appears more random than in winter. But still, the negative impact is a bit unforeseen. This finding can be interpreted as the observed network having no hubs. Another reason for this effect might be, that stations can only host a fixed number of bikes: If a station i is empty, no new neighbors can be formed. A similar saturation effect happens if a lot of bikes arrive at station i. Moreover, it is plausible that effects caused by the degrees are already included in 2b, as well as in Figure 3b. They show the effect of the bike rides on the days immediately preceding the current Friday, and the effect of the average number of bike tours on the last two Fridays, respectively. In Figure 3b, we observe a similar behavior as in Figure 2b (even more pronounced): In summer the predictive power of the tours on the last two Fridays is significantly lower than in winter, underpinning the theory that the destinations in summer tend to be based on more spontaneous decisions. Finally, in Figure 3c, we observe that no bike tours on the last two Fridays between a given pair of stations always has a significant negative impact on the hazard. Again a very plausible finding.
We are currently working on testing whether the parameter functions depend on time, i.e., on testing for constancy of the parameter functions. For a complete data analysis it would  Modeling other network characteristics. In stochastic network analysis, a central strand of research is concerned with the question of whether characteristics observed in real networks can be adequately mimicked by stochastic network models. Important characteristics are degree distribution, clustering coefficient and diameter (these and other characteristics can be found in Jackson (2008) Chapter 2.2, we define them also in the appendix). As in Zafarani et al. (2014), Chapter 4, we compare these three characteristics with a typical network produced by our model. In order to see how much our fitted model is able to capture these characteristics, we have simulated 3840 1 networks corresponding to three randomly chosen days, by using the network model with the fitted parameters of the corresponding day. We then compared the simulated three characteristics on these three days to the ones observed in the networks (this way of assessing the goodness of fit is also used in Hunter et al. (2008)). Here, we present the results for the degree distribution on 7th December 2012. The other results are reported in the appendix.
In our analysis, we consider fitting sub-networks defined by the popularity of their edges: For given values 0 ≤ l 1 < l 2 ≤ ∞, the network is constructed by placing an edge between a pair of nodes (i, j), if the number of tours between i and j falls between l 1 and l 2 . Different ranges of l 1 and l 2 are considered. The idea is to consider the network of low frequented tours (for l 1 = 1 and l 2 = 3) up to the network of highly frequented tours (for l 1 = 10 and l 2 = ∞).. Figure 4 shows the simulated degree distributions for six different choices of l 1 and l 2 . The dotted lines indicate 10% and 90% quantiles of the simulated graphs, and the solid line shows the true degree distribution. We see that, in all six cases, the approximation is reasonable accurate, in particular if one takes into account that we did not specifically aim at reproducing the degree distributions. The plots show that the largest degree of the simulated networks and the observed network lie not too far from each other, and the overall shape of the degree distribution is captured well. It should also be noted that we used only six covariates, whereas in other related empirical work much higher dimensional models have been used, see e.g. the discussions in Perry and Wolfe (2013).
Brief remark on choice of bandwidth via one-side cross validation. To choose the bandwidth, we calculate a local linear estimate with a one-sided kernel K +,κ (k) = K κ (k)1(k < 0). For all values of κ, the fitted value of the conditional expectation of X i,j (k t 0 ), given the past, is compared with the outcome of X i,j (k t 0 ). This is done for all non-censored edges. The results for different bandwidths are shown in Figure 5. We see that the prediction error of the model decreases, until we reach the bandwidth κ = 23. In one-sided cross-validation, one now makes use of the fact that the ratio of asymptotically optimal bandwidths of two kernel estimators with different kernels, K and L is equal to . For a triangular kernel, and its one-sided version, we get ρ ≈ 1.82. The one-sided CV bandwidths is given by dividing 23 by ρ which yields bandwidth roughly twelve (here we also only consider integer bandwidths). More details on the one-sided cross-validation approach are presented in the appendix.

Proof of Theorem 2.1
In the proof, we do not distinguish explicitly directed and undirected networks: in the undirected case, we always assume i < j, moreover we will need l n = O(n 2 P(C n,12 (t 0 ) = 1), which is true in both cases. The processes N n,ij are counting processes with intensity given by λ n,ij (θ 0 (t), t). We can decompose these counting processes as where M n,ij is a local, square integrable martingale. We use this decomposition of the counting processes in order to decompose the likelihood and its derivatives. Let P n (θ) be defined as Note that we do not make the dependence of P n (θ) on t 0 explicit in the notation. Using P n (θ), we can write Recall that θ 0,n is defined as the maximizer of θ → where g is defined in (A7). Note that the function g does not depend on n, see Assumption (A1). Lemma 4.2 shows that θ 0,n is uniquely defined. The value θ 0,n is the deterministic counterpart of the random quantity θ n (t 0 ) that is defined as the solution of P n ( θ n (t 0 )) = 0. The existence of the latter is considered in Proposition 4.3. Equality holds, if and only if, θ 0 (s) T y = θ T y. In particular, θ 0 (s) is the unique maximizer of θ → g(θ, s).
Proof. Note that, for arbitrary y ∈ R, implies that the differentiable function x → xe y − e x has the unique maximizer x = y. This also implies the second statement of the lemma.
In all lemmas and propositions of this section, we assume that Assumptions (A1)-(A7) hold.
Fact 4.1. For j ∈ {0, 1, 2}, k ∈ {0, 1, 2, 3}, with j + k ≤ 3, the partial derivatives of order j of the function g(θ, s) with respect to s, and of order k with respect to θ, exist, for t in a neighborhood of t 0 , and θ ∈ Θ. The partial derivatives can be calculated by interchanging the order of integration and differentiation in (2.17). For θ ∈ Θ and s in a neighborhood of t 0 , all these partial derivatives of g(θ, s) are absolutely bounded. For the calculation of the first two derivatives of g with respect to θ, differentiation and application of the expectation operator can be interchanged in (2.16). The matrix Σ is invertible.
Proof. The statement of this fact follows immediately from (2.6) of Condition (A4). Note that the functions θ 0 , θ 0 and θ 0 are absolutely bounded in a neighborhood of t 0 . This holds because these functions are continuous in a neighborhood of t 0 , see (A3).
Proof of Lemma 4.2. From Fact 4.1, we know that ∂ t g(θ, t) is absolutely bounded for t in a neighborhood of t 0 and θ ∈ Θ. This implies that converges to g(θ, t 0 ), uniformly for θ ∈ Θ. Because ∂ θ 2 g(θ, t) is negative definite, this implies the statement of the lemma.
Moreover, the sequence v n = 2 is bounded, and it holds that v n → v, as n → ∞.
Proof. Using Lemmas 4.2 and Fact 4.1, we conclude that the integrand (note that u ∈ [−1, 1] and α ∈ [0, 1]). The first statement of the lemma follows by an application of Lebesgue's Dominated Convergence Theorem, and the fact that ∂ θ 2 g is bounded as a continuous function on a compact set. The second statement of the lemma follows similarly.
Proposition 4.2. We have, for t 0 ∈ (0, T ), Proof of Proposition 4.2. Since θ 0 (s) maximizes θ → g(θ, s) (cf. Lemma 4.1), we have ∂ θ g(θ 0 (s), s) = 0. Furthermore, by definition of θ 0,n , we have Having observed that, we compute, for h small enough, Σ n converges to the invertible matrix Σ by Lemma 4.3. The first integral is of order h 2 . This follows by a Taylor expansion in the time parameter: By Lemma 4.3, v n is bounded. Thus, together with (4.6), we have established The statement of the proposition now follows from v n → v. For any k, l ∈ {1, ..., q}, it holds that P n (θ 0,n ) P → −Σ. where P (r) n denotes the r-th component of P n , the supremum runs over k, l, r ∈ {1, ..., q}, and θ ∈ Θ.
For the proof of (4.9), we calculate a bound for the expectation of the absolute value of the third derivative of P n . With s = t 0 + uh, it holds  × C n,ij (s)X n,ij (s) e θ 0 (s) T X n,ij (s) − e θ T 0,n X n,ij (s) − ∂ θ g(θ 0,n , s) ds With B n from Theorem 2.1, we have Proof. In this proof, we use the shorthand notation K h,t 0 (s) = 1 h K z−t 0 h . We begin with proving (4.19). Denote for vectors a, b ∈ R q by [a, b] the connecting line between a and b. Note firstly that by a Taylor series application for a random (depending on X n,ij (s)) intermediate value θ * (s) ∈ [θ 0 (s), θ 0,n ] e θ 0 (s) T X n,ij (s) − e θ T 0,n X n,ij (s) = X n,ij (s) T e θ * (s) T X n,ij (s) · (θ 0 (s) − θ 0,n ). (4.20) Hence, we obtain We decompose (4.21) into two terms by splitting For the second part we obtain, by using that the parameter space is bounded by τ and convex (A3), use also Fubini in the second line and rewrite as a conditional expectation in the last line (1−C n,12 (t 0 ))C n,12 (s) P(C n,12 (t 0 )=1) X n,12 (s) 2 e τ X n,12 (s) ds (4.22) × E X n,12 (s) 2 e τ X n,12 (s) C n,12 (s) = 1, C n,12 (t 0 ) = 0 ds (4.23) where the last equality holds, because by assumption (2.12) the first factor is O(h), the second factor is uniformly bounded by (2.8) and θ 0,n − θ 0 (t 0 ) = O(h 2 ) by Proposition 4.2.
We now discuss the second term of the split of (4.21). Recall therefore the definitions of γ n,ij (s) and τ n,ij (θ, s) from Theorem 2.1 and (2.15), respectively. Applying the above and using that θ 0 (s) − θ 0 (t 0 ) = θ 0 (t * )(s − t 0 ) for an appropriate point t * ∈ [t 0 , s], we obtain Hence, we need to prove that e θ * (s) T X n,ij (s) − e θ 0 (s) T X n,ij (s) = X n,ij (s) T e θ * * (s) T X n,ij (s) (θ * (s) − θ 0 (s)). Now arguments are again similar to the ones leading to (4.24), we just have to use the power three part in (2.8) and the fact that sup s∈U h θ * (s) − θ 0 (s) ≤ sup s∈U h θ 0 (s) − θ 0,n which converges to zero by continuity of θ and Proposition 4.2. This concludes the proof of (4.19).

Proposition 4.3.
With probability tending to one, the equation P n (θ) = 0 (cf. equation (4.2)) has a solution θ n (t 0 ), which has the property To prove this proposition, we will make use of the following theorem, see Deimling (1985): (Newton-Kantorovich Theorem) Let R(x) = 0 be a system of equations where R : D 0 ⊆ R q → R is a function defined on D 0 . Let R be differentiable and denote by R its first derivative. Assume that there is an x 0 such that all expressions in the following statements exist and such that the following statements are true Proof of Proposition 4.3. We show that P n (θ) has a root by using Theorem 4.4. Lemma 4.4 gives that P n (θ 0,n ) P → 0 and P n (θ 0,n ) P → −Σ. Since Σ is invertible we also have that the sequence of random variables B n := ||P n (θ 0,n ) −1 || is well-defined (for large n) and that it is of order O P (1). Thus we also have η n := ||P n (θ 0,n ) −1 P n (θ 0,n )|| = o P (1). For the Lipschitz continuity of P n we bound the partial derivatives of P n by Lemma 4.4. Hence we conclude that every realization of P n is Lipschitz continuous with (random) Lipschitz constant K n = O P (1). Combining everything, we get that r n := B n K n η n = o P (1). Thus with probability tending to one we have r n ≤ 1 2 , and hence the Newton-Kantorovich Theorem tells us that with probability tending to one the equation P n (θ) = 0 has a solution θ n (t 0 ) with the property that θ n (t 0 ) − θ 0,n ≤ 2η n = o P (1).
To prove the asserted rate, we have to investigate η n further. We note first that since P n (θ 0,n ) −1 is stochastically bounded, the rate of η n is determined by the rate with which P n (θ 0,n ) converges to zero. To find this rate we observe that every summand of P n (θ 0,n ) has expectation zero conditionally on C n,ij (s) = 1: g(θ, s)ds. So, in P n (θ 0,n ), we can subtract C n,ij (t 0 ) T 0 K s−t 0 h ∂ θ g(θ 0,n , s)ds from every summand without changing anything, i.e., P n (θ 0,n ) −e θ T 0,n X n,ij (s) − ∂ θ g(θ 0,n , s) ds × e θ 0 (s) T X n,ij (s) − e θ T 0,n X n,ij (s) ds.
By Lemma 4.5, this term is equal to h 2 · B n + o P 1 √ lnh + o P (h 2 ), which concludes the proof of Proposition 4.3.
Lemma 4.6. For k, l ∈ {1, ..., q}, we have that n,ij (s)C n,ij (s) exp(θ 0 (s) T X n,ij (s))ds Moreover, it holds that Proof. The proof of (4.27) follows by using similar arguments as in the proof of Lemma 4.4, with θ 0,n replaced by θ 0 (s), and with K replaced by K 2 .
By using exactly the same arguments as in the proof of Lemma 4.5, we obtain e θ T 0,n X n,ij (s) − e θn(t 0 ) T n X n,ij (s) ≤ X n,ij (s) e τ X n,ij (s) · θ 0,n − θ n (t 0 ) .

This gives
The expectation of the first factor is bounded because of assumptions (2.10) and (2.9). Furthermore, the second term is of order o P (1) by Proposition 4.3. Thus, the product is of order o P (1). This shows (4.30) and concludes the proof of (4.29).
Proposition 4.5. With probability tending to one, ∂ θ T (θ, t 0 ) = 0 has a solutionθ n (t 0 ), and Proof of Proposition 4.5. The proof is based on modifications of arguments used in the asymptotic analysis of parametric counting process models, see e.g. the proof of Theorem VI.1.1 on p. 422 in Andersen et al.(1993). Define U l (θ) := h∂ θ l T (θ, t 0 ), l = 1, . . . , q, and let U l t (θ) be defined as U l (θ), but with t being the upper limit of the integral in (2.5), (i.e., U l (θ) = U l T (θ)). Furthermore, we write U (θ) = (U 1 (θ), ..., U q (θ)), and the vector U t (θ) is defined analogously. In the first step of the proof, we will show that For the local, square integrable martingale M n,ij defined in (4.1), it holds that M n,ij and M n,i j are orthogonal, meaning that < M n,ij , M n,i j > t = 0 if (i, j) = (i , j ), i.e. the predictable covariation process is equal to zero. For the predictable variation process of M n,ij , we have By definition of θ n (t 0 ), see the statement of Proposition 4.3, we have that (write K h,t 0 (s) : n,ij (s)dN n,ij (s) (4.33) n,ij (s)dM n,ij (s) n,ij (s)dM n,ij (s).
So θ n (t 0 ) was chosen such that the non-martingale part of ∂ θ ( θ n (t 0 ), t 0 ) vanishes. Now, we want to apply Rebolledo's Martingale Convergence Theorem, see e.g. Theorem II.5.1 in Andersen et al.(1993). This theorem implies (4.31), provided a Lindeberg condition (4.28) holds, and To verify (4.34), first note that (4.32) and (4.27) imply finiteness of with probability tending to one. Note that Lemma 4.6 is formulated with t = T , but the integral is finite also for t < T simply because the integrand is non-negative. From now on we assume the above integral is finite. The process is a local square integrable martingale, see e.g. Theorem II.3.1 on p.71 in Andersen et al.(1993). Since the martingales M n,ij are orthogonal, and by using Lemma 4.6, the predictable covariation satisfies n,ij (s)C n,ij (s) exp(θ 0 (s) T X n,ij (s))ds This shows (4.34), and concludes the proof of (4.31).
Results (4.29) and (4.31) imply that η n = o P (1). Next, notice that P n has a Lipschitz constant K n that is bounded by the maximum of the third derivative of P n . According to (4.9), this maximum is bounded, and we obtain K n = O P (1). Hence, r n = B n K n η n = o P (1). Now, Theorem 4.4 implies that, with probability converging to one, there isθ n (t 0 ) such that U (θ n (t 0 )) = 0 and ||θ n (t 0 ) − θ n (t 0 )|| ≤ 2η n P → 0.
To obtain the asymptotic distribution ofθ n (t 0 ), we note that, by (4.31) and (4.29), Thus it holds that √ l n h · Z n = O P (1), and as a consequence we get √ l n h · η n = O P (1). Using the second statement of the Newton-Kantorovich Theorem 4.4, we obtain || l n h · ( θ n (t 0 ) −θ n (t 0 )) − l n hZ n || ≤ l n h · 2r n η n = o P (1).
Thus √ l n h · ( θ n (t 0 ) −θ n (t 0 )) and √ l n h · Z n have the same limit distribution. Because of (4.36) this implies the statement of the proposition.
Proof of Theorem 2.1 Combining Propositions 4.3 and 4.5, and applying Slutzky's Lemma, we obtain by the assumptions on the bandwidth h in (A1) With Proposition 4.2 this gives (2.20).
5 Acknowledgement 6 Appendix 6.1 Simulations of degree distributions, cluster coefficients and diameters.
Here we report additional simulations of degree distributions, cluster coefficients and diameters. In Section 3, we have presented results for the degree distribution of networks based on the Washington DC bikeshare activity on 7th December 2012. In this section, we will consider the days 18th April 2014 and 10th July 2015, and also compare diameters and clustering coefficients of the simulated and observed networks. As above, using the corresponding estimated parameter value for each of these days, we compute 3840 predictions and compared them with the observed values. The diameter of a network is the longest among the shortest path between two vertices in the network. Typically, in observed networks the diameter is much smaller than the number of vertices (cf. Jackson, 2008). The clustering coefficient is the number of complete triangles (triples of vertices which are completely connected) divided by the number of incomplete triangles (triples of vertices with at least two edges). Note that every complete triangle is also incomplete, hence the clustering coefficient is between zero and one. The clustering coefficient can be understood as the empirical probability that vertices are connected given that there is a third vertex to which both are connected. It has been reported (cf. Jackson, 2008), that in observed networks this number is usually significantly higher than in an Erdös-Rényi network, where the presence of edges are i.i.d. random variables.
Our question here is, Does a network which was simulated by our model look like the observed network? or in other words Could one believe that the observed network is a realization of our model?. To answer this, we consider the three network characteristics mentioned above, and empirically and visually compare the simulated results to the observed data. The heuristic justification underlying this approach is, that, if considered jointly, these three characteristics are able to discriminate between a range of different types of networks (see also Jackson, 2008 andZafarani et al., 2014) We start by presenting results for diameter and clustering coefficient on 7th of December 2012. As described in Section 3, where the degree distribution was discussed, we divide the edges between bike stations in six regimes by considering tour frequencies between the stations on the day. Figure 6 shows the histograms of the simulated diameter in the different regimes. We see that, in 6e (as before in Figure 4e), the simulation and the reality appear to coincide nicely. In other words, for a moderate number of tours our model seems to fit well. It is interesting to note that our model performs differently in the different regimes suggesting that edges with different activity have to be modeled differently. Finally, in Figure 7, we see the histograms of the simulated clustering coefficients. The true value in the corresponding regime is shown in the titles of the plots. Overall, the performance appears reasonable. In particular, in Figure 7d the histogram is nicely centered around the true value. Interestingly, the performance in the fifth regime (l 1 = 5 and l 2 = 12), shown in Figure 7e, is not as good as the others. One explanation for this might be that here different covariates are needed.
In Figure 14a we see one simulated graph compared to the true graph. The color of the edges determine how many tours happened relative the the other edges: The lowest 25% of the edges are colored green, the next 25% yellow, then orange and the highest 25% of edges are colored red. Due to the integral value of the activity it is not the case that exactly 25% of the edges are green and so on. The size of the vertices is relative to their degree. We see that the model is able to find the important (i.e. high degree) vertices. For the edges we see that some red edges are at wrong places. But generally the vertices with high profile edges are recognized. The remaining graphs in Figure 14 show the same comparison for the two other dates under consideration. And we see that the results are similar. Figures 8 till 13 show the results of the corresponding simulations for the other two dates.
Overall the results are similar. It should be pointed out that even though the model is not able to reproduce every feature perfectly accurate, the simulated networks are still very close to the true observation. This becomes more obvious if we remind ourselves that only six parameters were used.

Bandwidth choice
Under our assumptions that the covariates stay constant over the day, it makes sense to consider only integral bandwidth lengths (for us one day has length one). In order to choose the bandwidth, we apply a one-sided cross validation (cf. Hart andYi, 1998 andMammen et al., 2011) approach which was shortly motivated in Section 3 and which we now describe in detail.
Let K and L be kernels fulfilling the assumptions in the paper and denote byθ K (t 0 ) and θ L (t 0 ) the maximum likelihood estimators using K and L respectively. Then, by Theorem 2.1, we get that asymptotically the bias and the variance of the estimators can be written as where the constants C 1 and C 2 depend on the true parameter curve θ 0 and the time t 0 but not on the kernel. Hence, the corresponding expressions forθ L (t 0 ) can be found, just by replacing every K with an L. The decomposition of the asymptotic mean squared error in squared bias plus variance yields the following asymptotically optimal bandwidths h K and h L , minimizing the asymptotic mean squared error: