Connectivity Inference from Neural Recording Data: Challenges, Mathematical Bases and Research Directions

This article presents a review of computational methods for connectivity inference from neural activity data derived from multi-electrode recordings or fluorescence imaging. We first identify biophysical and technical challenges in connectivity inference along the data processing pipeline. We then review connectivity inference methods based on two major mathematical foundations, namely, descriptive model-free approaches and generative model-based approaches. We investigate representative studies in both categories and clarify which challenges have been addressed by which method. We further identify critical open issues and possible research directions.


Introduction
Understanding the operational principles of neural circuits is a major goal of recent international brain science programs, such as the BRAIN Initiative in the U.S. (Insel et al. 2013;Martin and Chun 2016), the Human Brain Project in the E.U. (Markram 2012;Amunts et al. 2016), and the Brain/MINDS program in Japan (Okano and Mitra 2015;Okano et al. 2016). A common emphasis in these programs is utilization of high-throughput, systematic data acquisition and advanced computational technologies. The aim of this paper is to present a systematic review of computational methods for inferring neural connectivity from high-dimensional neural activity recording data, such as multiple electrode arrays and calcium fluorescence imaging.
Why do we need to infer neural connectivity? High-dimensional neural recording data tell us a lot about information representation in the brain through correlation or decoding analyses with relevant sensory, motor, or cognitive signals. However, in order to understand the operational principles of the brain, it is important to identify the circuit mechanisms that encode and transform information, such as extraction of sensory features and production of motor action patterns (Churchland and Sejnowski 1992). Knowing the wiring diagram of neuronal circuits is critical in explaining how such representations can be produced, predicting how the network would behave in a novel situation, and extracting the brain's algorithms for technical applications (Sporns et al. 2005).
The network of the brain can be analyzed at various spatial scales (Gerstner et al. 2014). At the macroscopic level, there are more than a hundred of anatomical brain areas and connection structure across those areas give us an understanding of the overall processing architecture of the brain. At the mesoscopic level, the connections of neurons within each local area, as well as their projections to other areas, are characterized in order to understand the computational mechanisms of neural circuits. At the microscopic level, the location and features of synapses on the dendritic arbors of each neuron are analyzed to understand the operational mechanisms of single neurons.
This review focuses on the mesoscopic level, inferring the connections between neurons in local circuits, utilizing multi-electrode recordings or fluorescence imaging data. Connectivity inference at the macroscopic level, investigated using MRI data, and the microscopic level using electron microscopy data, are beyond the scope of this review, but some of the methods, especially those of model-free approaches, may also be applicable to connectivity inference at macroscopic level. This paper presents an overview of a variety of challenges in network inference and different mathematical approaches to address those challenges. We first review the data processing pipeline of connectivity analysis and identify both biophysical and computational difficulties. From mathematical view point, we classify connectivity inference methods broadly into descriptive, model-free approaches and generative, model-based approaches and explain representative methods in each category. We then examine which methods offer solutions for specific challenges and identify open issues and important 2 research directions.
There have been several recent reviews on specific mathematical frameworks in network connectivity inference, such as the Bayesian approaches (Chen 2013) and the maximum entropy method (Yeh et al. 2010;Roudi et al. 2015). Some reviews focus on macroscopic connectivity analysis using MRI (Friston 2011;Lang et al. 2012;Sporns 2012;. The reports from the First Neural Connectomics Challenge (http://connectomics.chalearn.org) reviews the top-ranked methods for connectivity inference from calcium imaging data (Orlandi et al. 2014a;Guyon et al. 2014).
It is important to distinguish several types of connectivity that have been addressed previously (Aertsen et al. 1989;Friston 2011;Valdes-Sosa et al. 2011). Functional connectivity (FC) is defined as statistical dependence among measurements of neuronal activity. It can be computed using correlation or other model-free methods (see section 4). Effective connectivity (EC) characterizes the direct influence exercised between neuron pairs after discounting any indirect effects. EC is usually computed by optimizing the parameters of a model that is assumed to have generated the observed data (see section 5). Anatomical connectivity signifies existence of actual synapses, either excitatory or inhibitory. Even if there is an anatomical connection between neurons, the connection may not be detected when, for example, the source neuron is inactive or the recipient neuron is irresponsive due to strong inhibition by other neurons. Detection of functional or effective connectivity does not warrant the existence of anatomical connectivity. An example would be two neurons that receive common inputs from a third neuron.
Neural connectivity can be described at difference levels of detail: existence, direction, sign, magnitude, and temporal dynamics. Which level of description is most reliable and useful depends on constraints on the instrumentation and the amount of data available. One aim of this paper is to clarify how those constraints affect the choice of inference and the validation methods for a given application.
Although it is beyond the scope of this review, graph theoretical analysis of the inferred network can play a key role in understanding the interplay between the brain structure and its function (Bullmore and Sporns 2009). Such graph theoretic characterization includes clustering and connectivity degree distribution (Shimono and Beggs 2014;Bonifazi et al. 2009;Yu et al. 2008). These abstract metrics facilitate comparison of the structure of diverse neural populations. For example, Hu et al. (2016) proposed a method to relate the network statistics of connectivity of linear point processes or Hawkes models (see section 5.1.6) to its function.

Data Processing Pipeline
This section presents a generic data processing pipeline, starting with data acquisition and continuing with data pre-processing, network inference, post-processing, and validation of results. Figure 1 shows the overall view of the experimental setup. Figure 1: Generic data processing pipeline, starting with data acquisition and continuing with data pre-processing, network inference, post-processing, and validation of results.

Data Acquisition
Multiple electrodes and fluorescence imaging are two major methods for making high-dimensional measurements of single neuronal activities.

Multiple electrode recording
For in vivo multiple neural recordings, the most commonly used device is the so-called "tetrode," a bundle of four wire electrodes. By implanting tens of tetrodes and applying a spike sorting method, it is possible to record hundreds of neurons simultaneously (Jog et al. 2002;Buzsaki 2004). Linear or matrix arrays of electrodes, often using semiconductor technologies, are also used for recording hundreds of neurons near the cortical surface (Fujishiro et al. 2014). For brain tissue slices and cultures, electrode grids patterned on a plate enable recording of hundreds of neural activities (Fujisawa et al. 2008;Shimono and Beggs 2014). In such in vitro experiments, it is also possible to use intracellular glass electrodes to record sub-threshold membrane potentials of selected neurons in a population (Brette and Destexhe 2012).

Optical imaging
Optical imaging allows activity data gathering from hundreds to thousands of neurons simultaneously, using voltage sensitive dyes (VSDs) or genetically encoded calcium indicators (GECIs). Chemical VSDs offer high temporal resolution, but lack the cell type-specificity that GECIs offer. However, genetically encoded voltage indicators (GEVIs) are in active development (Yang and St-Pierre 2016). Recently GECIs have become increasingly popular because they can be expressed under the control of cell-type specific promoters (Looger and Griesbeck 2011). The weaknesses of GECIs has been their slow temporal response and low sensitivity, but recent GECIs have begun to achieve time constants on the order of ten milliseconds, making detection of single spikes feasiible (Pnevmatikakis et al. 2016).
Fast CCD cameras and confocal or two-photon laser scanning microscopes are commonly used with GECIs. CCD cameras enable simultaneous imaging of all neurons in the focal plane, allowing recording of as many as a thousand frames per second, but measurements are limited to neurons near the surface of the tissue. Two-photon microscopes use infrared light, which is less subject to refraction and excites fluorescent molecules only at the focal point, allowing recording of neurons several hundreds of microns beneath the surface (Lütcke and Helmchen 2011;Dombeck et al. 2009). Most recently, head-mount miniature microscopes using gradient index (GRIN) rod lenses have allowed access to deep neural structures, such as the hippocampus of awake behaving animals (Ziv et al. 2013).

Pre-processing
Pre-processing steps depend on the recording methods and the specific input requirements of the network inference method.
In multiple electrode recording, each electrode can receive signals from multiple neurons and also the signal from the same neuron can be detected by multiple electrodes. Spike sorting algorithms are used to identify the spikes from each neuron by applying principal component analysis, independent component analysis, and any biophysical knowledge of spike shapes and intervals (Shimono and Beggs 2014;Mahmud et al. 2015).
The task in optically imaged data pre-processing is to transform an image sequence into a multi-dimensional time-series of neural activities. Preprocessing steps for optical imaging are: i) image segmentation to identify regions corresponding to each neuron, ii) extraction of the fluorescence trace for each neuron and iii) spike inference (Pnevmatikakis et al. 2016). These pre-processing operations have to deal with light scattering, motion artifacts, and slow fluorescence changes with respect to the underlying membrane potential.

Connectivity inference
Connectivity inference methods can be largely classified into two classes. Descriptive model-free methods are based on descriptive statistics without assuming any particular process that generated the data. On the other hand, generative model-based methods assume a certain mathematical model that generates the data and infer the parameters and structure of the model. We will explain those methods in detail in the subsequent sections.
Most connectivity inference methods require time-series of spikes from each neuron in the population. In some studies, spike inference and connectivity inference are performed using an integrated optimization algorithm directly from time series of fluorescence (Mishchenko et al. 2011). Some other studies use fluorescence signals directly for model-free analysis of connectivity without a explicit spike inference mechanism (Veeriah et al. 2015).

Post-processing
After the connectivity matrix is obtained by applying any of the connectivity inference methods, it is often useful to perform post-processing to achieve a biologically realistic result. For example, inference methods that consider only simultaneous activities yield only symmetric connectivity matrices. A heuristic method to determine the direction is to use the average temporal order of activations of two neurons (Sutera et al. 2014). Another issue is that the inferred connectivity can be a mix of direct causal effects between neuron pairs and indirect effects through other neurons. One way to address this problem is to use matrix deconvolution (Feizi et al. 2013;Barzel and Barabási 2013;Magrans and Nowe 2014), as described in section 6.1.2.
Furthermore, employment of a network inference method depends on the choice of parameters. It is possible to improve both robustness and accuracy of the connectivity matrix by combining several matrices computed using different parameters and/or different inference methods (Sutera et al. 2014;Magrans and Nowe 2014).

Validation
In order to validate the connectivity inference method itself, the use of synthetic data from a network model with a known connection matrix is the first step. For estimating connectivity of real biological neural networks, validation is more challenging, as anatomical or physiological verification of all pair-wise synaptic connectivity is extremely laborious.

Synthetic data
The strength of synthetic data for validation is that we have all information about simulated connectivity and other biophysical parameters of the model. Therefore, we can use standard error metrics to measure the similarity between the inferred connectivity matrix and the one used for data generation.
When the final objective is to infer connection weights, one difficulty with many methods is that they deliver estimated connectivity values with different scales. The relative mean squared error whereŴ ij is the estimate of W ij , can alleviate the scaling problem (Fletcher and Rangan 2014). If the objective is to infer the graphical structure of the network, what matters is the binary existence/nonexistence of connections. The Area Under the Curve of the Receiver-Operator Characteristic (AUROC) is a popular performance metric used in such a case (Guyon et al. 2014;Garofalo et al. 2009;Stetter et al. 2012). The ROC Curve describes the relationship between the False Positive (FP) Rate ( F P F P +T N ) and the True Positive (TP) Rate ( T P T P +F N ) at different thresholds. The perfect classifier has an AUROC of 1, while a random classifier has an AUROC of 0.5. However, this metric may overestimate the performance in highly biased data-sets (Schrynemackers et al. 2013). The Area under the Precision Recall Curve (AUPRC) was proposed as an alternative measure to improve validation accuracy of sparsely connected neural networks (Orlandi et al. 2014a;Sutera et al. 2014). Specifically, it is the area below the curve that describes the relationship between the Precision ( T P T P +F P ) versus the Recall ( T P T P +F N ) at different thresholds. In order to find the best trade-off between T P s and F P s, Garofalo et al. (2009) defined the Positive Precision Curve (PPC), which describes the relationship between the True False Sum (T F S = T P + F P ) and the True False Ratio (T F R = T P −F P T F S+T N ). The peak of this curve can be used to extract a binary connectivity map from a weight matrix inferred by a network inference method. To assess the required recording duration and bin size to achieve a desired reconstruction performance, Ito et al. (2011) proposed the use of the curves of the TPR at a fixed FPR as a function of different recording durations and bin sizes.

Real data
In real neural recording, although the ground truth of the connectivy is not generally available, we can assess the quality of inferred connections using statistical significance testing and cross validation. Significance testing is used to accept or reject the null hypothesis (H 0 ) that a connection between two neurons does not exist (Lizier 2014). To approximate H 0 , we can run our network inference method on many surrogate time-series created by perturbing the training time-series such that it destroys the connectivity information (Fujisawa et al. 2008;Lizier et al. 2011;Shimono and Beggs 2014;Oba et al. 2016). Then, the test itself consists of computing the probability that the inferred connectivity value is generated uniquely by chance. However, sometimes an accept/reject test is not enough, as for instance, when we wish to infer a weighted connectivity matrix or parameters like the synaptic delay or a time constant from a biophysical model. Model structures and/or parameter sets in a certain class can be compared with each other using relative quality measures like the likelihood ratio and other criteria that penalize larger complexity models like the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) (Aho et al. 2014).
An alternative strategy is cross validation, in which the inferred model is tested against a separate test data set that was not used for model inference. A standard method in probabilistic model-based methods is to compute a normalized likelihood of the model for a test data set (Gerwinn et al. 2010). In addition, comparison of any statistical features, such as average firing rates and spike-train statistics of the sample data produced from the inferred model and the real data is helpful in validating the inferred model (Pillow et al. 2005). When a certain graph-theoretic feature of the network is known, for example, by previous anatomical studies, comparison of such features, such as the node degree distribution, can be a helpful way of validation (Bullmore and Sporns 2009).

Challenges
This section summarizes two types of challenges in connectivity inference: biophysical and technical. We first describe complexities arising from biophysical properties of neurons and synapses, and then from technical difficulties due to constraints in instrumentation and computation.

Biophysical Challenges
Apparent connectivity: As mentioned in the Introduction, functional connectivity (FC) and effective connectivity (EC) may not be due to direct synaptic connectivity. A typical example is a common input that activates two or more neurons simultaneously. In such a case, functional connectivity can be inferred between the recipient neurons even if there is no direct connection between them. Other cases are a common input with different time delays or an indirect connection through a third neuron, which can activate two neurons in sequence. In this case, effective connectivity can be inferred even if there is no direct synaptic connection between them.
Directionality: Methods based on simple correlation cannot detect the direction of the causal or temporal relationship. Even when the temporal order of activities is considered, if the time resolution of the measurement or analysis is coarse, ordered activation can appear as simultaneous activation, making it difficult to determine directionality.
Cellular diversity: Many neurons show intrinsic refractoriness such that spike frequency gradually drops even with a constant level of input. Some neurons also show a burst property such that once excited above a threshold, they keep spiking even without excitatory inputs. In such cases, it is not straightforward to discriminate whether a change in the activity of a neuron is due to some network dynamic or the neuron's intrinsic properties (Gerstner et al. 2014: chap. 1). Such refractoriness or burstiness can be categorized as inhibitory or excitatory self-connection, but whether it actually results from self-connecting synapses (autapse) (Bekkers 2009) must be carefully interpreted. Even in a local circuit, there are many different types of neurons with a large diversity of biophysical parameters (Kandel et al. 2000a). Thus a simple assumption of uniform cellular properties may not be valid.
Synaptic diversity: Diversity also applies to connectivity because underlying synapses can have complex and diverse characteristics (e.g. excitatory/inhibitory, facilitatory/depressive, delays, etc.) (Kandel et al. 2000b). Understanding how a given effective network emerges may require the inference of many additional parameters beyond a simple weight matrix.
Non-stationarity: Synaptic weights are subject to both short-term (from tens of milliseconds to few minutes) and long-term changes (from minutes to hours) (Gerstner et al. 2014: chap. 19). Physiological states of neurons can also drift over time, especially under experimental manipulations, as in slice preparations, with electrodes inserted, or with exposure to light. In vitro cultured neural networks often manifest switching between states of low and high firing rates. When using fluorescence imaging, the high firing state can cause erraneous connectivity inference because the low temporal resolution makes it difficult to discriminate between direct and indirect effects (Stetter et al. 2012).

Technical Challenges
Noise: Every instrument is subject to noise. Electrodes can pick up both biophysical noise (e.g. thermal noise) and anthropogenic noise (e.g. electromagnetic interference from power lines) (Van Drongelen 2006). Optical imaging is susceptible to light scattering artifacts, and motion artifacts in awake subjects with in vivo imaging. Even though motion correction programs can track shifts of neurons within an image, movement of cells out of the focal plane is difficult to compensate.
Time/space resolution: Multi-electrode recording can monitor activities of a few hundred neurons with sampling rates on the order of 10 kHz (Shimono and Beggs 2014). Fluorescence imaging can work with a trade-off of space and time, ranging from few hundred neurons recorded at 100 Hz to a hundred thousand neurons, or 80 % of a zebra fish brain, at a rate of 0.8 Hz (Ahrens et al. 2013). The poor temporal resolution of fluorescence imaging has the undesirable feature of mixing indirect causal effects with real direct connectivity effects.
Hidden neurons/External inputs: Even though recent advances in twophoton imaging have enabled us to capture thousands of neurons at one time (Ahrens et al. 2012), it is still difficult to simultaneously record the activity of every single neuron in a circuit. Neglection of hidden neurons can lead to spurious detections of connectivity between neurons connected via the hidden neurons. Even though most experiments take into account sensory stimuli or motor actions in the analysis of neural activities, the brain shows varieties of intrinsic dynamics. Unknown external inputs, especially those common to multiple neurons, can also result in a detection of spurious connections.
Prior knowledge: Prior knowledge about the anatomy and physiology of the neural network under investigation can be leveraged to enhance connectivity inference. How to incorporate prior knowledge about the composition of neuron types in a population (Shimono and Beggs 2014;Sohn et al. 2011) or connection probabilities between different type of neurons (White et al. 1986) to constrain the solution space is an important issue in connectivity inference. Additional data, such as the local field potential (Destexhe and Bedard 2013), can also facilitate understanding of non-stationary behaviors at the microcircuit level.
Scalability: As the number of neurons measured simultaneously grows from hundreds to thousands, the number of potential connections grow from tens of thousands to millions. Therefore, inference methods need to be designed to maximize computational efficiency and to minimize the cost of parallelized implementations (Gerstner et al. 2014: chap. 10).

Model-Free Methods
We first review model-free methods, which do not assume any mechanism to generate the observed data. These methods tend to be simpler than model-based methods, but they are not able to generate activity data for model validation or prediction. We will review model-free methods in two categories. The first is based on descriptive statistics and the second on information theoretic concepts. Table 1 summarizes the major model-free methods and examples that use those methods.
We denote the neural activity data set as is the activity of neuron i at the t th time point. x i (t) may be continuous, for instance, when the values are raw data obtained from multiple electrodes or fluorescence imaging, or it may be binary when data are transformed into a spike train by a spike-sorting algorithm.

Descriptive Statistics
This class of methods utilize statistical measures to capture the degree of connection between neurons from a sample of activities from a neural population.  (2014); Veeriah et al. (2015): simulation data that assumes calcium imaging • Assumes the ground truth connectivity labels of a training data set that has similar properties as the network that we want to reconstruct

Correlation
Correlation indicates the strength of the linear relationship between two random variables that represent two neurons. The most commonly used measure of correlation between activities x i and x j of two neurons i and j is the Pearson correlation coefficient defined as: is the mean activity. When we use correlation ρ ij to perform network inference of neuronal circuits, we are measuring the rate of co-occurring spikes of neurons i and j (Cohen and Kohn 2011) and this rate is interpreted as the functional connectivity strength. While Pearson correlation is computationally least costly, it has a number of drawbacks. It is not able to indicate the causal direction and it is not able to distinguish direct connections from indirect ones. It is also not suited to deal with external inputs. Despite such limitations, a winning solution of the First Neural Connectomics Challenge used correlation as a key component in a more complex method (see section 6.1.2).

Cross correlation
Cross correlation (CC) indicates the strength of the delayed linear relationship between two neurons i and j. Knox (1981) defines CC as "the probability of observing a spike in one train x j at time t + τ , given that there was a spike in a second train x i at time t". Ito et al. (2011) remark that, despite the extensive literature, there is not a standard definition and they discuss the two most popular ones: where the parameter τ defines the delay of neuron j with respect to neuron i, and µ and σ are the sample average and standard deviation, respectively. The second CC definition uses the total number of spikes instead of the averages and standard deviations: where n i and n j are the total number of spikes from neuron i and j respectively. Both cross-correlation definitions tend to be equivalent when µ i and µ j approaches 0. CC is a causal indicator which is able to indicate the direction of the connection. The inference performance of the connection direction is dependent on the instrumentation sampling rate and the choice of the time delay τ . Despite the added capability of being able to detect the direction of connectivity, CC has the same limitations as correlation in dealing with indirect connections and external inputs.
We should also note that the optimal parameter τ to detect a connection can be different for each connection. A measure to address this issue is the Coincidence Index (CI), which combines several CCs computed at different τ s: where r specifies the interval of cross-correlation delays, called the coincidence interval. A large CI indicates a larger reproducibility of correlated spike timing (Tateno and Jimbo 1999;Chiappalone et al. 2007;Shimono and Beggs 2014). Garofalo et al. (2009) evaluated the network inference performance of several model-free methods using data from simulated neuronal networks with only excitatory synapses and including both excitatory and inhibitory connections. According to both ROC and PPC criteria, CC demonstrated a performance just below transfer entropy (TE, see section 4.2.3) in the fully excitatory setting, and below TE and joint entropy (JE, see section 4.2.2) when inhibitory connections were also included. Ito et al. (2011) evaluated the inference performance of the two CC variants discussed in this section and several transfer entropy variants (see sections 4.2.4, 4.2.5 and 4.2.6) using data from simulated neuronal networks with both excitatory and inhibitory connections. According to the ROC criterion, both CC variants showed a performance inferior to all variants of transfer entropy.

Partial correlation
Let R = [R ij ] be an inverse of a covariance matrix Σ of which the (i, j) th component is Σ ij . The definition of partial correlation (PC) between activities of neurons i and j is: 14 The most salient difference between PC and other methods described in this section is that PC takes into account the activities of all neurons in the population to compute a connectivity indicator between each neuron pair. An important property of PC is that, assuming that x(t) = (x 1 (t), . . . , x P (t)) is normally distributed, then P C ij is 0 if and only if neurons i and j are independent given all the rest. Therefore, PC can be used to distinguish a direct effect from an indirect one. Despite this multivariate feature, PC shares the same limitations as correlation and CC in dealing with external inputs, with the additional computational complexity required to invert the covariance matrix. The first prize solution of the First Neural Connectomics Challenge, discussed in section 6.1.1, is a good example of how to use PC for network inference (Sutera et al. 2014).

Information Theoretic Methods
Information theory is a mathematical discipline initiated by Shannon to characterize the limits of information management, including transmission, compression and processing (Shannon 1948). This section presents the application of several information theory measures to the inference of neural microcircuits.

Mutual information
Mutual information (MI) is a measure of the statistical dependence between stochastic variables. MI of activities x i and x j of two neurons is mathematically defined as: This indicator is symmetric; therefore it is unable to identify the direction of the connection. It cannot discriminate direct effects from indirect effects. One way to overcame the directionality limitation is to introduce a delay as in CC, and to consider multiple delays as in CI (Garofalo et al. 2009). In a comparative study by Garofalo et al. (2009) using data from simulated neuronal networks, MI delivered the performance below TE, JE and CC, according to both ROC and PPC criteria.

Join entropy
Joint entropy (JE) is a bi-variate causal measure between the activity of two neurons, testing whether neuron i is a cause of the activity in neuron j.
For each reference spike in x i , the closest future spike on x j is identified and the cross-inter-spike-interval is computed as cISI = t x i − t x j , where tx i and t x j are the time of successive spikes of neuron i and j, respectively. Then JE is defined as the entropy of the cISI distribution: where k indexes the level of cISI. If neurons i and j are strongly connected, then JE is expected to be close to 0 as the cISI distribution becomes sharp. In a comparison by Garofalo et al. (2009), JE showed the worst performance below TE, CC and MI in a fully excitatory setting, but the second best performance below TE when inhibitory connections were also included .

Transfer entropy
Transfer entropy (TE) is a causal indicator defined between the activity of two neurons i and j. The definition of TE was originally formulated as (Schreiber 2000): TE is the Kullback-Leibler divergence between the distributions of the neural activity x j conditioned by its previous activities alone versus conditioned also by previous activities x i . TE can be interpreted as the reduction of uncertainty in future values of x j by knowing the past values of x i , given past values of x j . TE can also be defined as the gain in mutual information of the future activity of neuron j knowing the past activities of both neurons i and j campared to i alone: From this second definition, we can see that TE is a positive measure that takes a low value when the past activity of neuron i does not convey information about the future activity of neuron j.
TE is equivalent to Granger causality for Gaussian variables (Barnett et al. 2009). A nice feature of TE is that it can detect non-linear relationships (Stetter et al. 2012). Although TE does not identify whether the connection is excitatory or inhibitory, the polarity of connection can be tested separately, or often known by the cell types recorded or simulated (Stetter et al. 2012;Orlandi et al. 2014b). The performance of TE in discriminating the causal direction depends whether the sampling rate is faster than the network dynamicse (Fujisawa et al. 2008;Shimono and Beggs 2014)).
In practice, the parameters k and l in Eq. (9) are set to 1 (e.g., Garofalo et al. (2009)), so that we only have to consider a small number of patterns (i.e. 2 3 = 8 for binary data). While increasing k and l can allow detection of delayed effects of a connection, it requires larger amount of data to have reliable estimates of the probability distribution to compute the entropies.
Before the First Neural Connectomics Challenge, TE was arguably the most successful model-free method (Garofalo et al. 2009;Stetter et al. 2012). This success encouraged a number of variants as follows (Ito et al. 2011;Stetter et al. 2012;Orlandi et al. 2014b).

Delayed transfer entropy (DTE)
A limitation of TE using k = l = 1 is that it assumes a constant one-time bin delay between the action potential in neuron i and the post-synaptic action potential of neuron j, which is not a realistic assumption. A suboptimal way to deal with this problem is to create longer time bins at the cost of losing detailed temporal information. Instead, delayed TE (DTE) was proposed to measure TE at a user-defined time delay d (Ito et al. 2011). Compared to the original definition, instead of x i (t − 1), ..., x i (t − l) in Eq. (9), they suggest using a delayed signal x i (t − d), ..., x i (t − d − l + 1).

High order transfer entropy (HOTE)
Ito et al. (2011) considered how to increase k and l to take into account longer spike train history, while avoiding negative effects of the increase in the possible patterns (2 k+l+1 ). The additional parameters d in DTE and k and l in HOTE gives multiple measures for each neuron pair. They evaluated the reconstruction performance by the peak value and the coincidence index (equation (5)) of DTE and HOTE, using a simulated network with both excitatory and inhibitory connections. HOTE CI and DTE CI had better performance than HOTE pk and DTE pk , which in turn were better than TE with k = l = 1.

Generalized transfer entropy (GTE)
As described in section 3, periods of synchronized bursting convey very low connectivity information due to the simultaneous spike of a large percentage of neurons. This phenomenon is especially critical with calcium imaging recordings because of their limited temporal resolution. Generalized TE (Stetter et al. 2012) was proposed to alleviate this problem by two modifications to equation (9): i) to compensate for the slow sampling rate in fluorescence imaging (≥ 10ms), GTE uses presynaptic activity from the same time bin x i (t) instead of the previous one x i (t − 1); ii) to discard the poor connectivity information conveyed by synchronized bursting periods, GTE restricts the computation of TE to those time bins with an overall network activity below a given user-defined threshold.
Just like TE, GTE is not able to differentiate between excitatory and inhibitory connections. The application of GTE to simulated neural networks with excitatory connections shows an improved performance with respect to TE, and modified versions of CC and MI implementing the same generalization (Stetter et al. 2012). Orlandi et al. (2014b) showed that GTE is able to reconstruct inhibitory connections; however, its reconstructions performance is below its reconstruction performance of excitatory connections.

Information gain
Information gain (IG) is a less known causal indicator between the activity of two neurons i and j. Its definition is: where H is the entropy, x i is the activity vector of the source neuron and x j is the activity vector of the target neuron. In the case of calcium imaging recordings, it may be convenient to employ the same strategy as with GTE by using the sample from the same time bin x i (t) instead of the previous one x i (t − 1). This connectivity indicator was used as one of multiple indicators combined by one of the top solutions of the First Neural Connectomics Challenge (Czarnecki and Jozefowicz 2014).

Supervised Learning Approach
Motivated by recent successes of deep neural networks in challenging patter classification problems, a new approache has been proposed to apply convolutional neural networks (CNNs) for prediction of the existence of synaptic connectivity (Romaszko 2014;Veeriah et al. 2015). In this approach, the time series data of spike trains or fluorescent signals of a pair or more of neurons are given as the input and a CNN is trained to classify if there is a connection between the neurons.
A natural limitation of this approach is that training requires sufficient amount of ground truth data about the existence of connectivity. Although this approach has been successful with simulated data (Romaszko 2014;Veeriah et al. 2015), it is hard to apply them directly to real data, for which experimental identification of synaptic connectivity is highly costly.

Model-Based Methods
Now we review model-based methods, in which the connectivity is estimated by explicitly modeling the data generation process. The basic paradigm of the model-based method is: 1) to assume a generative model that generates neural data, and 2) to determine the model parameters to fit observed data. Table 2 summarizes the major model-based methods and examples that used those methods.
In the following, x(t) = (x 1 (t), . . . , x P (t)) represents a set of P signals observed at the t th time point in a discrete time domain, where x i (t) denotes the i th neuron's activity at that time. x i (t) may be continuous when the measurement is raw data collected from calcium imaging and multiple-electrode recording, or binary when the data is transformed into a spike train by a spike-sorting algorithm. We denote a generative model p θ (x), where θ is a parameter vector, including the connection weights.

Generative Models 5.1.1. Autoregressive model
A basic example of a generative model is the autoregressive (AR) model mathematically expressed as where K is the degree of the model, A ij (k) is a parameter called an AR coefficient, A i0 is a bias term, and N (·|µ, σ 2 ) denotes the Gaussian distribution with mean µ and standard deviation σ (For simplicity, we assume that σ is known in this example). By integrating two equations in Eq. (12), x i (t) can be regarded as a sample according to the following conditional distribution: where θ ≡ {A i0 |i = 1, . . . , P } ∪ {A ij (k)|i, j = 1, . . . , P ; k = 1, . . . , K} is a set of parameters in this case. After determining the parameter θ, if A ij (k) for all k is exactly or very close to zero, it implies that there is no direct interaction from the j th neuron the i th neuron. In contrast, if A ij (k) for some k deviates enough from zero, the j th neuron x j can directly affect the i th neuron x i with a k time-step delay.
Even though neuron's dynamics is usually nonlinear, for the virtue of simplicity, this method has been used in a wide range of neural data analysis, in addition to calcium imaging and multiple electrode recording (Harrison et al. 2003;Valdés-Sosa et al. 2005;Franaszczuk et al. 1985;Smith et al. 2011).

Generalized linear model
When x i (t) is a binary variable indicating whether a spike is generated at the t th time step, the generalized linear models (GLM) (Song et al. 2013) provides a tractable extension of the AR models. A GLM describes spike generation as a point process as where Ber(·|p) denotes the Bernoulli distribution with an event probability of p. φ(·) is a so-called an inverse link function, for which the exponential function φ(x) = exp(x) or the sigmoid function φ(x) = (1 + exp(−x)) −1 ) is often used.

Stochastic leaky integrate-and-fire model
A stochastic leaky integrate-and-fire (LIF) model (Paninski et al. 2004;Koyama and Paninski 2010;Isomura et al. 2015), is one of the most widely used model for analyzing the behavior of spiking neural networks (Gerstner et al. 2014). The model assumes that the subthreshold membrane potential of the i th neuron, denoted by V i , evolves according to the following stochastic differential equation: where g i is the membrane leak conductance and dW i (t) is an increment of a Wiener process. I ij is an influence from the j th neuron to the i th neuron, which is often assumed to be where κ ij (s) represents the effect of the the j th neuron to the i th neuron after a time delay s and t (f ) j is the f th spike of the j th neuron. In the hard-threshold version (Paninski et al. 2004), whenever V i (t) goes over a threshold V th , the neuron generates a spike. In the soft-threshold case (Koyama and Paninski 2010;Isomura et al. 2015), the probability for the neuron to generates a spike in a small time interval dt is given by where f (·) is a nonnegative intensity function. When σ i (t) = 0 in Eq. (16), the solution is given by If we assume that κ ij (·) is the dirac's delta function and that t is much greater than 1/g i . Then, Eq. (19) can be approximated by By choosing f (V ) ∝ exp(−β(V − V th )) and discretizing the time so that dt = 1, the combination of Eqs. (20) and (18) reduces to the GLM (15) with A i0 = −βV th , A ij (k) = β exp(−g i k), and exponential inverse link function φ(·) = exp(·). For more detailed discussion, please refer to (Gerstner and Kistler 2002;Gerstner et al. 2014;Paninski et al. 2007).

Network likelihood models
Network likelihood models (NLMs) (Okatan et al. 2005;Stevenson et al. 2009) are often adopted as a generative model for spiketrain data in a continuous-time domain. Let N i (t) be the total number of the ith neuron's spikes observed before time t. In NLMs, it is assumed that N i (t) follows an inhomogeneous Poisson process P oisson(·|λ i (t)) with the conditional intensity function where is a convolution of a linear filter ξ k (·) and a spiking history of the j th neuron. A typical example of the linear filter is a rectangular windows of duration W defined as NLMs can be converted into GLMs if the spiking history N i (t) is discretized into bins with a fixed window W , and W is so small that any bin has at most a single spike.

Calcium fluorescence model
When x i (t) is a continuous variable indicating the intensity of calcium fluorescence, we have to consider the fact that the fluorescence signal has a fast initial rise upon an action potential followed by a slow decay. A standard model to reflect this feature is as follows (Mishchenko et al. 2011): where z i (t) denotes the intracellular calcium concentration and s i (t) is a spike indicator variable such that s i (t) = 1 if the i th neuron fires at time t and s i (t) = 0 otherwise. θ = {α i , a i , b i , τ i |i = 1, · · · , P } comprises the parameters of the model. Nonlinearities such as saturation can also be modeled as proposed by . In general, we cannot directly observe the variable s i (t) in this setting. To cope with this issue, Eqs. (24) and (25) are often combined with Eqs. (15a) and (15b) in which x i (t) is replaced by s i (t).
5.1.6. Hawkes process model  propose a Hawkes multivariate process to model a network of purely excitatory neurons. The model is considered as a generalized linear model (section 5.1.2) with Poisson observations and a linear link function. The network model is defines by a binary matrix that indicates the existence of a directed connection between two neurons, a second weight matrix to represent the strength of each connection, and a vector specifying the transmission delay distribution for each directed connection.  enhanced the computational performance of their former approach (Linderman and Adams 2014) with a new discrete time model to assume that neurons do not interact faster than an arbitrary time unit and a variational approximation of the former Gibbs sampling solution in order to make the model fully conjugate.

Dynamic Bayesian Network
Dynamic Bayesian Networks (DBN) (Murphy 2002) extend Bayesian Networks for time-series modeling. A DBN is usually defined as a directed acyclic graph where nodes represent random variables at particular times, and edges indicate a conditional probability dependence P (x i,t |x j,t−k ) where where x i,t denotes the i th neuron's activity at that time t. Eldawlatly et al. (2010) demonstrated the feasibility of using DBM to infer the effective connectivity between spiking cortical neurons simulated with a generalized linear model (see section 5.1.2). They use a simulated annealing algorithm to search over the space of network structures and conditional probability parameters. Patnaik et al. (2011) present a significantly faster fitting algorithm that consists of identifying, for each node, parent-sets with mutual information higher than a user-defined threshold. An upper bound of the Kullback-Leibler between the inferred distribution and the true distribution is defined as a function of the user-defined threshold.

Maximum entropy model
The maximum-entropy model (Yeh et al. 2010;Roudi et al. 2015) assumes that the network state probability distribution is given by an exponential function of the network energy E[S i ] = E[(x 1 , ..., x n )] such that the entropy is maximized while satisfying any statistical constraints. When the first and second-order statistics are given, the state probability distribution is given by (26) This is an extension of the Ising model (McCoy 2010) with spatial connections potentially occurring between any neurons as well as temporal correlations.
The main limitation of the maximum entropy model is its computational complexity. Recent studies demonstrated its applicability to a few tens of neurons using only first and second-order statistics without temporal correlations (Yeh et al. 2010).

Estimation of Model Parameters 5.2.1. Maximum likelihood method
The standard method to determine the model parameters is the maximum likelihood (ML) method. The likelihood of a parameter vector θ given a data set D is defined as p(D|θ), the probability of reproducing the data. Here we denote denotes the negative log-likelihood function as J(θ|D) = − log p(D|θ). In the ML method, the parameters are chosened such that When the generative model uses a Gaussian distribution for observed data, the ML method reduces to the least square method. For example, in the case of the AR model ( (12)), maximization of the log likelihood is equivalent to minimization of the sum of squared residuals where D ≡ {x i (t)|i = 1, . . . , P ; t = 1, . . . , T } denotes the observed data set.
This optimization can be achieved analytically in the case of least square problem, or in general by iterative optimization algorithms such as the gradient descent methods. When the model is represented using a set of hidden variables, a standard way is to use the expectation-maximization (EM) algorithms.

Regularization and Bayesian inference
A disadvantage of the ML method is that it often suffers from overfitting when the number of parameters is large relative to the amount of data. A common way to deal with this issue is to introduce regularization to the parameters. From Bayesian statistical viewpoint, it can be considered as assuming a prior distribution for the parameters.
The objective function with a regularization term is given as where R(·) is a non-negative function and λ is a constant that controls the strength of regularization. The most common regularizer is L2-norm regularizer, or the Ridge regularizer, where r indexes the element of the set of parameters. In the Bayesian framework, this is equivalent to assuming a Gaussian prior distribution for the parameters. Another common regularizer is the L1-norm regularizer, or least absolute shrinkage and selection operator (LASSO) regularizer, which favors sparse solution with many parameters being zero. From Bayesian viewpoint, this is equivalent to assuming a Laplace (exponential in absolute value) prior distribution. Beside introducing regularization, Bayesian inference has several advantages over the ML method. For example, if the model has hidden variables, Bayesian inference is often used for estimating them as well as the model parameters (Mishchenko et al. 2011). Also, the reliability of each value in the parameter space can be evaluated as the posterior distribution. The high density region of the posterior distribution is called the Bayesian credible region, which can be used as an alternate to the confidence interval in the statistical test (Bowman et al. 2008). Another advantage is that the marginal likelihood, defined by the Bayes' theorem, can be used for a criterion to select the best of several possible models (Friston and Penny 2011) (Section 2.5).

Approximate Bayesian inference methods
While Bayesian inference offers favorable features as above, exact computation for the posterior distribution and the marginal likelihood is often analytically intractable. Thus it is it in practice important to select an approximation method. In the following, we briefly review approximation methods for the Bayesian inference. Details of their derivations and algorithms is beyond the scope of this paper and can be found in other articles (Chen 2013;Murphy 2012).

Monte Carlo Sampling approximates the target distribution (or value)
as an aggregation (or average) of the finite number of random samples. Gibbs sampling (Casella and George 1992), Metropolis-Hastings (Hastings 1970) and sequential importance re/sampling (Liu and Chen 1998) algorithms are typical examples of this class. 2. Laplace Approximation (Raftery 1996) is a deterministic approximation applicable to cases in which the MAP estimator can be easily obtained. The central idea is to approximate the target distribution as a multivariate Gaussian distribution using the Hessian matrix (the second-order partial derivatives) of the logarithm of the posterior distribution; thus, the Laplace approximation is not suitable for cases in which the posterior distribution is multi-modal or asymmetric. 3. Variational Bayes (Attias 1999) is another deterministic approximation method derived from the variational mean-field theory in statistical mechanics. Let us consider a case in which we want to obtain the posterior distribution p(Z|D), where Z = (z 1 , . . . , z m ) is a set of all unknown variables (i.e. hidden variables and model parameters) such that z i (i = 1, . . . , m) is a partition of Z. Also, consider the probability distribution family of Z that can be factored as q(Z) = M i=1 q(z i ). In the variational Bayes method, we approximate p(Z|D) by optimizing q(Z) to minimize the KL divergence between q(Z) and p(Z|D). Optimization is achieved by an algorithm similar to the expectation and maximization (EM) algorithm. Variational Bayes is suitable for cases in which the joint distribution p(D, Z) is an instance of the exponential families (Wainwright and Jordan 2007).

Case Study
In this section, we review several examples of connectivity inference study, from both model-free and model-based approaches. We focus on the challenges addressed by each of the studies at each stage of the data processing pipeline (Figure 1).

Model-Free Approaches
The First Neural Connectomics Challenge (Guyon et al. 2014) encouraged the development of very diverse solutions. Remarkably, the top three solutions were all model-free methods. The organizers provided simulated neural network datasets for training and testing, as well as key references about the problem and sample code to get started. Neural network datasets were simulated using leaky integrate and fire neurons with short term synaptic depression implemented in the NEST simulator and the spike trains were transformed to fluorescence time-series using a fluorescence response model of calcium markers inside neurons. Each dataset consisted of one-hour timeseries of neural activities obtained from fluorescence signals sampled at 20ms intervals with values normalized in the interval [0, 1]. Organizers also provided information about the position of each neuron in an area of 1 mm 2 and inter-neuron connectivity labels (i.e. excitatory connection or no connection). A validation dataset was provided for performance comparison by the Area Under the ROC Curve (AUC; see section 2.5). Finally, the score on a separated test dataset was used for the final ranking of teams.
The three best solutions and the baseline method provided by the challenge organizers all consisted of several components: data pre-processing, main connectivity inference, and post-processing steps as discussed in section 2 (Figure 2 ). Below we review these top-ranked methods.

Partial-correlation and averaging
The first prize solution (Sutera et al. 2014) achieved a performance of 0.94161 (AUC score). Its pre-processing consisted of four steps: i) low-pass filtering to remove noise, excessive fluctuations, and light scattering artifacts, ii) backward differentiation for spike detection, iii) hard thresholding to eliminate small values below a positive parameter τ , and iv) weighting the importance of spikes as a function of the inverse of network overall activity. The motivation for this last step was to eliminate the effects of synchronized bursting periods, much as generalized transfer entropy does (see section 4.2).
The main connectivity inference employed the Partial-Correlation (PC) matrix (Section 4.1), using the real valued signal computed in the previous pre-processing steps. The post-processing consisted of two steps: i) Several PC matrices were averaged according to different τ s and different low pass filters to increase the robustness of the solution. ii) The symmetric average PC matrix was transformed into a directional indicator by multiplying each edge by an orientation weight computed according to the average activation delay between each neuron pair.
The main difficulty in scaling this solution to larger neural networks is the computation burden for calculating the inverse matrices. As an example, the computation of the challenge solution connection matrix with 1000 neurons took 30 hours on a 3 GHz i7 desktop PC with 7 GB of RAM.

Matrix deconvolution and adaptive averaging
The major features of the second prize solution (Magrans and Nowe 2014) are that it used matrix deconvolution for eliminating indirect connections and introduced learning to optimally combine several connectivity matrices rather than just using simple averaging. The pre-processing pipeline consisted of three steps: i) spike train inference based on the OOPSI algorithm ), ii) hard thresholding according to a parameter value τ , as in the first prize solution, and iii) removal of time segments with overall network activity above a given threshold θ.
Connectivity inference employed Pearson correlation between the real valued signal computed from the pre-processing. A network deconvolution algorithm (Feizi et al. 2013) was then used to eliminate the combined effect of indirect paths. If the effect of direct connection is given by matrix W d , under the assumption of linearity, the combined effects of all the direct and indirect connections follows From this relationship, the matrix deconvolution method estimates the direct connection matrix W d from the observed connections matrix W o by Finally, the connectivity matrices computed according to different values of τ and θ are combined, but unlike the first prize solution, using a function learned with a non-linear regression method for optimal performance. Computation of the challenge solution took nearly 48 hours on a 3 GHz i7 desktop PC with 32 GB of RAM. The most significant limitation of this solutions is its high computational cost. It does not try to infer self-connections. It does not try to identify the causal direction. It assumes that both training and testing networks have similar statistical properties.

Convolutional neural network approach
The third prize solution (Romaszko 2014) proposed to automatically extract features from binary spike train pairs using a Convolutional Neural Network (CNN). Pre-processing consisted of three steps: i) spike train inference using discrete differentiation, ii) hard thresholding as in the first prize solution, and iii) spike train normalization in [0,1]. Feature extraction was done with a CNN based on Lenet5 (LeCun et al. 1998) followed by a softmax layer that produced binary output in the event of a connection. This method was not designed to evaluate directions of connections.
Input data consisted of binary matrices of 3 by 330 where rows 1 and 2 corresponded to segments of the spike trains of two neurons. Row 3 contained information about the overall network activity, making it possible to identify synchronous bursting episodes. Network training was done with gradient descent.
The computational cost for this solution is very high. For instance, the challenge solution took more than 3 days on 8 server machines working in parallel, where each machine was equipped with 32 GB of RAM and GPU unit with 2496 cores. In addition to the extremely high computational cost, this solution has the same limitations than the second prize solution.
After the Connectomics Challenge, Veeriah et al. (2015) further advanced the CNN approach by integrating both the pre-processing step of spike train inference and connectivity inference into a single neural network architecture to surpass the first prize performance using the same dataset. Its architecture consisted of two sub-networks and a final classification layer: i) a convolutional neural network with max-pooling layers responsible for identifying relevant shapes in the fluorescence time-series with tolerance for minor time translations, ii) a recurrent neural network (RNN) to model temporal sequences of relevant events. These are duplicated to capture the features of each neuron pair. Finally, a dynamically programmed layer aligned the RNN outputs and computed a connection probability for each neuron pair. Remarkably, this method does not require a separate pre-processing step to handle the synchronized bursting phenomenon.
Although the concept of an end-to-end artificial neural network is appealing, the performance comparison in Veeriah et al. (2015) deserves further consideration. Their reconstructed performances for generalized transfer entropy and partial correlation methods are lower than the performance documented during the First Neural Connectomics Challenge using the same data sets (Guyon et al. 2014;Orlandi et al. 2014a;Sutera et al. 2014).
While these CNN-based approaches presented good performance with simulated data, it remains to see if such classifiers generalize well to real data, for which ground truth training data are rarely available.

Other model-free approaches
Besides the main results of the First Neural Connectomics Challenge, several in vivo and in vitro studies rely on model-free methods. Fujisawa et al. (2008) use silicon microelectrode arrays for in vivo recording from layers 2,3 and 5 of the medial prefrontal cortex in rats during a working memory task. Connectivity inference is performed by identifying sharp spikes or troughs in the cross-correlograms. The connectivity significance is assessed by first creating several sets of slightly perturbed spike trains and then computing the statistical significance of the original cross-correlograms with respect to those constructed with the perturbed data. (Cohen and Kohn 2011) discuss recent surge in cortical processing studies enabled by recording technologies like microelectrode arrrays and calcium imaging (see section 2.5.2). They summarize possible causes of discrepant findings in several correlation based studies due to technical reasons like poor spike sorting; experimental factors like different external stimulus, different time bins and/or spike train durations or confounded effects due to non-stationary internal states. They suggest that multivariate point process models provide a more complete and statistically principled approach to cope with the challenges of inferring the connectivity structure.
6.2. Model-Based Approaches 6.2.1. Generalized linear models Pillow and colleagues applied GLM frameworks for a stochastic integrateand-fire neuron model (Paninski et al. 2004) to the data of a complete neural recording from ganglion cells in a retina slice (Pillow et al. 2008). They could construct a model including both the stimulus response and cross-connections of all neurons and showed through model-based analysis that correlated firing can provide additional sensory information.
Stevenson and colleagues derived a Bayesian inference framework for a spiking neuron model with arbitrary synaptic response kernels (Stevenson et al. 2009). They applied the method to neural recording data from monkey motor cortex and used the log likelihood and correlation coefficient criteria for cross validation. They further used infinite relational model clustering (Kemp et al. 2006) to detect cluster structures among recorded neurons.
While Stevenson et al. (2009) used arbitrary synaptic kernel functions with discrete time bins, Song and colleagues (Song et al. 2013) proposed the use of a set of kernel functions, such as B-splines, with which smoothness of kernel functions are built-in with a relatively small number of parameters. Oba et al. (2016) propose a GLM framework combined with empirical Bayesian testing. Null samples are created by time-shifting real spike trains for a sufficiently large time lag. They show an improved computational performance without decreasing inference performance on a simulated neural network. They also apply this method to real calcium imaging data from a cultured slice of the CA3 region of a rat hippocampus.

Calcium fluorescence model
In applying a connection estimation method to calcium fluorescent imaging data, a basic way is to estimate spike trains from fluorescence waveforms by spike deconvolution and then to apply a spike-based connectivity inference algorithm. Mishchenko et al. (2011) proposed a framework combining both steps of spike estimation and connection estimation into a unified stochastic generative model. Fletcher and Rangan (2014) proposed a computationally less expensive variant of this approach.

Challenges and Solutions
While model-free methods usually address pre-and post-processing separately from connectivity inference, model-based methods often address issues such as noise and connection directionality in the main inference process by incorporating those factors into generative models.
The distinction between apparent and real connections, as well as their directionality, is incorporated into neural dynamics equations in model-based methods. Weights most consistent with the data are selected by the maximum likelihood or maximum posterior probability criterion. Among modelfree methods, transfer entropy methods address connection directionality directly during the inference, while removal of apparent connections can be addressed by matrix deconvolution in post-processing (Magrans and Nowe 2014). Partial-Correlation can also remove aparent connections but is not able to identify the causal direction. Sutera et al. (2014) proposes a postprocessing step to identify the causal direction. Table 3 summarizes the solutions that were devised to overcome challenges along the data processing pipeline.

Non-stationarity
Modeling synaptic plasticity is a key to understand how memory and learning mechanisms are realized in neural circuits. Recent solutions to this challenge incorporate synaptic dynamics into a GLM framework and improved scalability within Bayesian ) and convex optimization settings (Stevenson and Koerding 2011). Using a model-free approach to understand the evolution of synaptic weights, Wollstadt et al. (2014) proposed a transfer entropy estimator that requires an ensemble of spike trains from equivalent and independent experiments.
Synaptic plasticity is not the sole source of non-stationarity. Time periods when a large percentage of neurons fire at a high rate is an additional nonstationary phenomenon and a source of confusion for most network inference methods. Generalized transfer entropy (Section 4.2.6), screens out the use of non-informative time periods. A similar selection could be developed for other model-free methods.
In section 6.1 we also discussed how winning solutions of the First Neural Connectomics Challenge chose a modular approach with a dedicated preprocessing step to remove non-informative time periods before applying different connectivity indicators. We further discussed a deep neural network solution by (Veeriah et al. 2015) that surprisingly, despite the lack of any pre-processing step or any specific design feature to remove high-rate time periods, is able to achieve improved performance with respect to the winning solution. A generalization of the above bi-modal switching behavior can also be accommodated with a model-based approach using a hidden Markov model .

Architecture
Pre/post-processing methods have the potential to improve the network reconstruction performance across different methods. For instance, modular architectures like the top solutions of the connectomics challenge (Section 6.1) implement a number of pre-processing steps to reduce noise and to infer spike trains. However, in applying such modular approaches, care has to be taken not to lose valuable information about the underlying direct connectivity and the directionality. On the other side of the spectrum, the approach using a deep neural network (Veeriah et al. 2015) is unique in providing solutions for multiple pre-processing and post-processing steps within a coherent modelfree architecture.
In the model-based approach, Mishchenko et al. (2011) and later Fletcher and Rangan (2014) employed time-series of neural activities obtained from fluorescence signals as inputs, while the spike times at an arbitrary sampling rate were additional latent variables. Table 3 summarizes the solutions that were devised to overcome each of the challenges. We can observe that each solution was implemented either as a pre/pro-processing step or within the main inference method.

Scalability
A major drawback to model-based methods is their scalability. The more sophisticated the model, the more parameters need to be estimated, which requires more data sampled from a stationary distribution. Prior knowledge, such as sparseness in connections, can be addressed in model-based methods by assuming a prior distribution in the model parameters and by applying Bayesian inference. Introduction of a sparseness prior, or an equivalent regularization term, can make inference from smaller samples more reliable, but this often increases the computational burden.

Hidden Neurons
While neural recording methods are progressing rapidly allowing wholebrain imaging with cellular resolution in smalls animal like Caenorhabditis elegans (Nguyen et al. 2016) and zebra fish (Ahrens et al. 2013), monitoring all neurons in the brain is still not possible. In typical calcium fluorescent imaging, neurons in only one section of a cortical column are recorded. Neglecting hidden neurons within the network, or unobservable external inputs, can infer erroneous connections. Methods for addressing this hidden input problem are still in early stages of development (Roudi et al. 2015).
The issue of hidden nodes, however, is not the only problem in neural connectivity inference. Similar problems have been addressed in gene regulation and signaling cascade networks, for examples, and methods that have been developed in other fields of computational biology and network science may provide helpful intuition and guidance (Su et al. 2014).
The challenge of hidden nodes groups different application scenarios under the same name, but the common denominator here is a generative model that consists of observed and hidden components. The GLM (Section 5.1.2) and maximum entropy settings (Section 5.1.8) are the most common approaches to describe the observed neurons. However, the key differential contribution in all cases is the hidden neurons model. A single, latent process can model the hidden inputs as random effects, enhancing network inference among observed neurons by avoiding false connections due to common hidden inputs (Kulkarni and Paninski 2007;Vidne et al. 2012). A more sophisticated group of contributions proposes multiple latent processes, for instance to study the feasibility of biophysically plausible hypotheses about how multi-level neural circuits are able to learn and express complex sequences (Rezende et al. 2011). Switching behavior, described in the previous section, can also be seen as a special case of the hidden nodes challenge.

Incorporating Prior Knowledge
Detailed anatomical synaptic maps (Yook et al. 2013), and connectivity maps between cell types across cortical layers (Potjans and Diesmann 2014) are valuable sources of prior information that we should exploit to improve both inference and computational performance. A straightforward modelbased approach would be to exploit anatomical prior information using more sophisticated regularizers like the adaptive elastic net (Zou and Zhang 2009), by embedding anatomical information in the adaptive weights instead of computing them using ordinary least squares (Wu et al. 2016). Prior information could also be incorporated as a post-processing method using graph sparsification algorithms to preserve certain graph theoretical properties (Ebbes et al. 2008;Lindner et al. 2015).
A possible way to deal with hidden neurons when analyzing cortical microcircuits is to take advantage of the highly replicated structure across layers and cortical columns. A possible framework to implement this idea, proposed by (Kim and Leskovec 2011), infers the parameters of a Kronecker model. In other words, a low-dimensional connectivity graph that approximately selfreplicates across observed and the hidden parts. Within this application, the exploitation of prior knowledge could enhance both computational and inferential performance.
The network inference problem could also be considered a process of selecting for each neuron a sub-set of pre-synaptic input neurons. Therefore, it would be reasonable to explore the possibility of applying feature selection algorithms (Liu and Motoda 2012) to the problem at hand. From this point of view, multi-task feature selection algorithms (Obozinski et al. 2006;Wang et al. 2016;Zhou et al. 2010) may be another interesting way to take advantage of the highly replicated structure across layers and cortical columns. These algorithms propose clever ways to jointly learn features across several tasks, maximizing information sharing while minimizing negative transfer. Therefore, application of these methods has the potential to deliver superior inference and computational performance with respect to single-task learning approaches while using the same amount of data.
A common assumption of many inference methods is a simplistic model structure that does not represent the true synaptic and cellular diversity of local neural microcircuits (see section 3.1). Future research methods should aim for approaches able to fit more biophysically plausible models. Recent wide-field calcium imaging of thousands of neurons over millimeters of brain tissue (Mohammed et al. 2016), the large number of latent parameters to model hidden neurons and non-stationarity, are all sources of increasing computational complexity and a strong impetus to continue improving the computational efficiency of network inference methods (Lee et al. 2016). Table 3: Summary matrix of the solutions that were devised to overcome each of the challenges. We can observe that each solution was implemented either as a pre/proprocessing step or within the main inference method.

Conclusion
In this paper, we reviewed methods for inference of neural connectivity based on activity recordings from a large number of neurons. We first identified biophysical and technical challenges along the data processing pipeline and then formulated model-free and model-based approaches for the core process of connectivity inference. We further investigated in previous works how those challenges were addressed using what methods. As a result, we identified favorable methods issues that deserve further technical developments, most notably coping with hidden neurons.
Connectivity inference itself is an interesting and deep mathematical problem, but the goal of connectivity inference is not only to precisely estimate the connection weight matrix, but also to illustrate how neural circuits realize specific functions, such as sensory inference, motor control, and decision making. If we can perfectly estimate network connections from anatomical and activity data, then computer simulation of the network model should be able to reproduce the function of the network. But given inevitable uncertainties in connectivity inference, reconstruction of function in a purely data-driven way might be difficult. How to extract or infer a functional or computational network from a data-driven network, or even to combine known functional constraints as a prior for connectivity inference, is a possible direction of future research.