BrainFrame: A node-level heterogeneous accelerator platform for neuron simulations

Objective: The advent of High-Performance Computing (HPC) in recent years has led to its increasing use in brain study through computational models. The scale and complexity of such models are constantly increasing, leading to challenging computational requirements. Even though modern HPC platforms can often deal with such challenges, the vast diversity of the modeling field does not permit for a single acceleration (or homogeneous) platform to effectively address the complete array of modeling requirements. Approach: In this paper we propose and build BrainFrame, a heterogeneous acceleration platform, incorporating three distinct acceleration technologies, a Dataflow Engine, a Xeon Phi and a GP-GPU. The PyNN framework is also integrated into the platform. As a challenging proof of concept, we analyze the performance of BrainFrame on different instances of a state-of-the-art neuron model, modeling the Inferior- Olivary Nucleus using a biophysically-meaningful, extended Hodgkin-Huxley representation. The model instances take into account not only the neuronal- network dimensions but also different network-connectivity circumstances that can drastically change application workload characteristics. Main results: The synthetic approach of three HPC technologies demonstrated that BrainFrame is better able to cope with the modeling diversity encountered. Our performance analysis shows clearly that the model directly affect performance and all three technologies are required to cope with all the model use cases.

recent years has led to its increasing use in brain study through computational models. The scale and complexity of such models are constantly increasing, leading to challenging computational requirements. Even though modern HPC platforms can often deal with such challenges, the vast diversity of the modeling field does not permit for a homogeneous acceleration platform to effectively address the complete array of modeling requirements. Approach: In this paper we propose and build BrainFrame, a heterogeneous acceleration platform that incorporates three distinct acceleration technologies, an Intel Xeon-Phi CPU, a NVidia GP-GPU and a Maxeler Dataflow Engine. The PyNN software framework is also integrated into the platform. As a challenging proof of concept, we analyze the performance of BrainFrame on different experiment instances of a state-of-the-art neuron model, representing the Inferior-Olivary Nucleus using a biophysically-meaningful, extended Hodgkin-Huxley representation. The model instances take into account not only the neuronal-network dimensions but also different network-connectivity densities, which can drastically affect the workload's performance characteristics. Main results: The combined use of different HPC fabrics demonstrated that BrainFrame is better able to cope with the modeling diversity encountered in realistic experiments while at the same time running on significantly lower energy budgets. Our performance analysis clearly shows that the model directly affects performance and all three technologies are required to cope with all the model use cases. Significance: The BrainFrame framework is designed to transparently configure and select the appropriate back-end accelerator technology for use per simulation run. The PyNN integration provides a familiar bridge to the vast number of models already available. Additionally, it gives a clear roadmap for extending the platform support beyond the proof of concept, with improved usability and directly useful features to the computational-neuroscience community, paving the way for wider adoption.

Introduction
In-vivo and in-vitro experiments are a traditional tool of neuroscientific research. They are powerful experimentation methods, but are also time-consuming and not always reliable. A number of factors can contaminate results like, for example, the influence of anesthesia in in-vivo experiments. What is more, most systemic neuroscientific phenomena require the monitoring of biological systems of very large scale and many such techniques do not allow for this kind of study. Computational neuroscientists use Spiking Neural Networks (SNNs) to cope with such issues.
By incorporating SNN models of varied complexity (which themselves are derived by biological experiments) they create predictive simulators that can test their scientific hypotheses and drive more targeted, thus more reliable and refined, biological experimentation [1].
A major challenge of executing such simulations is the sheer computational complexity that many SNN models entail, compared to simpler modeling classes. Traditional methods of computing, in which the common simulation tool-flows (such as MATLAB or specific neuromodeling tools like NEURON or Brian) are executed, are not up to the task of simulating neural networks of realistic sizes and high detail within a reasonable timeframe for brain research. High-Performance Computing (HPC) has been recently recognized as being able to provide a variety of solutions to cope with this limitation [2][3][4][5][6][7]. Unfortunately, the challenge of executing such simulation applications does not stop just at providing the necessary computational power.
In scientific applications such as neuronal simulations, modeling accuracy has a direct impact on simulation speed. The variety of options of viable SNN models used in studies is significant. Every type of model has scientific merit, depending on the subject under study, and models exhibit different characteristics when treated as computational workloads [6,8]. Modeling features like the inter-neuron connectivity density (the modeling of which also varies according to the biological system under study) can break the embarrassingly parallel (data-flow compatible) nature that most neuron models have, significantly changing the behavior of the application.
Depending on the desired model characteristics, we identify two general types of simulations that are relevant in neuroscientific experiments. The first one has to do with highly accurate (biophysically accurate and even accurate to the molecular level) models of smaller-sized networks that requires realtime or close to real-time performance. These kinds of experiments can be used with artificial real-time setups or brain-machine interfaces (BMI) and are closely related to brain-rescue studies (TYPE-I experiments). The second type involves the simulation of largeor very large-scale networks in which accuracy can often be relaxed. These experiments attempt to simulate network sizes and connection densities closely resembling their biological counterparts (TYPE-II experiments) [5] [9]. This, in combination to the variety of models commonly used, makes for a class of applications that vary greatly in terms of workload, while also, depending on the case, requiring high throughput, low latency or both. A single type of HPC fabric, either software-or hardware-based cannot cover all possible use cases with optimal efficiency.
A better approach is to provide scientists with an acceleration platform that has the ability to adjust to the aforementioned variety of workload characteristics. A heterogeneous system that integrates multiple HPC technologies, instead of just one, would be able to provide this. In addition, a framework for a heterogeneous system using a popular user interface for all integrated technologies can also provide the ability to select a different accelerator, depending on availability, cost and performance desired.
Such a hardware back-end must overcome additional challenges to be used in the field. It requires a front-end which should provide two crucial features: • An easy and commonly used interface through which neuroscientists can employ the platform, without the constant mediation of an engineer.
• A front-end that can reuse the vast amount of models already available to the community.
In this paper, we propose a framework for an heterogeneous acceleration platform for computationally challenging neuroscientific simulations called Brain-Frame. By using this system, we demonstrate the effect of model characteristics on performance and thus make a concrete case for the significance of employing heterogeneity in HPC systems used the field of computational neuroscience. To this end, we use a state-of-the-art, extended Hodgkin-Huxley (biophysically-meaningful) model [10] of the inferior-olivary nucleus (abbrev. In-fOli) as a benchmark to evaluate the framework. We chose this model as a respective workload of such neuron representations, as their efficient simulation poses a significant engineering challenge. Even though this model is not the most biophysically accurate representation in the field, it is one of the most accepted and widely used models for brain simulations. We evaluate BrainFrame using three distinct instances of the workload, each differentiated by the presence and complexity of the neuron interconnectivity modeling, leading to vastly different computational requirements, while still reflecting realistic neuroscientific experiments. We propose a front-end for the framework based on the PyNN language [11]. PyNN has been widely adopted by the computational-neuroscience community and has direct integration with many other well-known neuron modeling frameworks, covering both aforementioned features that such a front-end would require.

The Inferior Olive
The inferior-olivary nucleus forms an intricate part of the olivocerebellar system, which is one of the most dense brain regions and plays an important role in sensorimotor control. Activity in the inferior olive probably only directly triggers movements, when it is synchronized among multiple neurons [12,13]. In addition, the olivary neurons can provide rhythm and coordination signals for motor functions [14]. It is considered to be imperative for the instinctive learning and smooth completion of motor actions [15]. The olive provides one of the two main inputs to the cerebellum through the climbing fibers.
What makes the inferior-olive neurons special is their dense interconnection through electrical connections called gap junctions (GJs), which differ from typical synapses in that they are purely electrical. The gap junctions facilitate the synchronization behavior between the olivary neurons and, subsequently, influence the synchronization and learning properties of the entire olivocerebellar system [14].

The InfOli Workload
In this work, a detailed inferior-olive (InfOli) model is considered, which was originally developed by De Gruijl et al. [1]. It implements a neuron with three distinct compartments, the dendrite, the soma and the axon. Within the dendrite, the model also includes gap junctions (thus the characterization as "extended"), while the cell output represents the input to the climbing fibers (Figure 1b). The GJs are associated with important aspects of cell behavior as they are not just simple connections; rather, they involve significant and intricate electrical processes, which is reflected in their modeling details.
Every compartment includes a number of state parameters denoting its electrochemical state and the neuron state as a whole. The neuron states are updated at each simulation step; every new state update is based upon: The neuron state of the previous simulation step of the executed neuron, the previous dendritic states coming from the GJ connectivity and the externally evoked input to the network, representing the input coming from the rest of the cerebellar circuit.  The three compartments and GJs are evaluated/updated concurrently at each simulation step. The model is calibrated with a simulation time step of δ = 50 µsec. Simulations steps are identical to each other in terms of operations performed. This simulation step also defines the real-time behavior of the whole network. Figure 1a depicts a representation of the InfOli network model. This network model defines effectively a transient simulator through computing discrete output axon values in time steps which, when integrated in time, recreate the output response of the axon (Figure 1c).
The InfOli network must be synchronized in order to guarantee the correct exchange of previous dendritic data within a step. Thus, the execution can only be parallelized in space (simultaneous evaluation of neurons within a simulation step), but not in time (parallelization of multiple simulation steps). The cells -even when not actively spikingpresent an oscillatory behavior, thus affecting network synchronization. As a result, event-driven execution of the network model is not an option.
By profiling the application using a operation and memory-access profiler [16], it is revealed that the GJs have great impact on the total model complexity. As seen in Table 1, the total number of floating-point (FP) operations needed for simulating a single step of a single cell including a single GJ are 871 ‡. For many complex experiments, it is not the number of connections but, rather, the connectivity density (C) that is indicative of neuron interconnectivity. That is, the average percentage of the total neuron inventory to which neuron cells are connected (measured in %), ‡ Table numbers have been updated to amend a profiling mistake reported in previous work [8].

Application Use Cases
For our analysis, we employ three use cases drawn from [8], which are representative of the memory and computational requirements of the InfOli workload. All of the use cases are realistic instances of the InfOli application and have neuroscientific merit. They can also be considered as plausible instances of multi-compartmental modeling using HH models with various cases of modeled inter-neuron connectivity. The application allows for the connectivity of the InfOli network to be programmable by the user before the simulation is deployed. Network connectivity (when present) is defined by an N × N connectivity matrix (where N : Network size) of FP weights signifying the weight of each connection. Weights are used in the GJ computations to calculate the connection impact on each neuron. The three use cases focus around the biological complexity of the GJs: (i) InfOli with Realistic Gap Junctions (RGJ) -InfOli HH cells modeled with (biophysically) realistic GJ kinematics as presented in [1]. The highest amount of detail is included in the GJ modeling.
(ii) InfOli with Simplified Gap Junctions (SGJ) -InfOli HH cells modeled with simplified GJ kinematics. This constitutes a simpler connectivity compared to the RGJ use case.  (iii) InfOli with No Gap Junctions (NGJ) -InfOli HH cells modeled without GJ kinematics modeling. This is the simplest use case, whereby the neurons are modeled as independent computational islands.
In Figure 2, we see the amount of FP operations, based on the aforementioned profiling of the InfOli application. The FP operations are calculated for each of the aforementioned use cases for different connectivity densities. From the same profiling run we can derive the compute (in FLOPS) to memory (in single-FP memory accesses) ratio for the application, that reveals whether each use case is computation-or memory-bound ( Figure 3).

InfOli with Realistic Gap Junctions (RGJ)
This use case represents a fully featured version of the InfOli application. The complex Gap Junction dominates the computation in this use case. GJs here are implemented as a very specific representation of the biological nucleus (Algorithm 1). Each cell C in a population of N cells accumulates the influence of an interconnected cell to it (through a GJ) by subtracting its own dendritic voltage (prevVdend) from The compute-to-memory-access ratio (from Figure 3) suggests also that this use case is strongly computation-bound for all connectivity cases: With increasing problem sizes, the computations increase at a much faster pace than the memory-access requirements.

InfOli with Simplified Gap Junctions (SGJ)
The level of detail as in the RGJ case is useful for many modeling experiments but is also an overkill in many other cases that more simple rudimentary connection are involved (like simple synapses that accumulate inputs). Lighter workloads are represented by the SGJ case. We assume a use case of the InfOli application that simplifies the connection between neurons to a few simple input accumulators. The computations per simulation step are now 859 * N + 4 * N 2 * C. The accumulation is parameterized using the weight that is assigned to each connection between two neurons, thus the connectivity information needs to be accessed the same way as is in the RGJ case. The actual FP operations are reduced by about one order of magnitude compared to the previous use case (see Figure 2). Yet, the connectivity aspect still disrupts the pure dataflow nature of the model. A high compute-to-memory ratio is seen here as well, since the computations still increase at a faster pace than the memory requirements.

InfOli with No Gap Junctions (NGJ)
This is the case where the application becomes purely dataflow and can achieve the greatest parallelism possible. The processing requirements scale almost linearly with the network size and, compared to the other use cases, fewer computations are needed, as shown in Figure 2 (computations per simulation step : 859 * N ). As we can see in Figure 3, although the NGJ use case shows that computation is still the most important aspect of the application, both computation and memory access scale linearly and at a similar pace.

HPC Fabrics and Implementation
Our heterogeneous platform incorporates three accelerator fabrics; a Maxeler Maia Data-Flow Engine (DFE) board [17], an Intel Xeon Phi 5110P CPU [18] and a Maxwell-based Titan X GPU by NVidia [19] (Table 2). All there boards are PCIe-based which is how they communicate with the host system. The use of PCIe interfaces ensures that composition of BrainFrameenabled machines can been easily tailored on a per-case basis depending on the availability of funds and hardware resources of a research laboratory. Different types and mixes of PCIe-based accelerators can be selected.
The Maia DFE is a Maxeler HPC technology based on reconfigurable hardware. Its tool flow is designed and optimized to accommodate the acceleration of dataflow applications; that is, applications with the bulk of their implementation using purely raw computations with the absence (partially or totally) of branching execution or feedback paths. The Maxeler tools can exploit the nature of dataflow applications to implement uniquely massive pipelines, maximizing the throughput and overall performance. The DFE boards also incorporate a high-bandwidth, multichannel, highly parallel, customizable interface to the onboard DRAM memory resources (up to 96 GBs) making it ideal for scientific applications. The DFE board used in our experimental setup is a 4th-generation Maia-DFE board implemented using an Altera Stratix V 5SGSD8 chip.
The Xeon Phi is a Many Integrated Core (MIC) architecture co-processor which features 61 cores, each capable of supporting up to 4 instruction streams. The generation of Phi cards used in this work, named Knight's Corner, are programmed using well-known programming tools such as OpenMP and OpenCL. However, and in contrast to GPU mentality, the Phi can also be thought of as an accelerator that can act as a stand-alone processor and even features its own Operating System. This is expected to increase memory-consistency and cache-coherency delays.
GP-GPUs have also been prominent in the HPC domain and in scientific computing in particular. The Titan X includes 3,072 CUDA micro-cores, which are used to parallelize computation execution, and 12 GB of on-board RAM. GPU implementations also benefit from the generally good adoption of the NVidia CUDA-library open environment that allows porting of applications with similar ease to the Phi OpenMP and OpenCL frameworks. GPUs also come at a relatively lower cost than the other two accelerator types. However, as opposed to the the Xeon Phi, a GPU cannot act as its own host increasing communication delays between host and accelerator during execution.
Lastly, it must be noted that BrainFrame is to be used in scientific research that is very dynamic  and fast-paced. The goal is not to over-optimize the different accelerator implementations, but to propose and maintain a balance between the programming effort and optimization needed, resulting in shorter development times for cutting-edge research tools. In real research, such development times should be kept short so as not to delay the scientific process.

Infoli on the Maia DFE
The DFE implementation of the InfOli application can be seen in Figure 4 and is a more advanced version of the work done in [20]. New features include the addition of programmable connectivity and programmable neuron state by the user between experiment runs without the need to resynthesize the design. The design implements three pipelines on the DFE hardware to accelerate the application, one for each part of a neuron (Dendrite, Soma, Axon), executing the respective computations. The state parameters for each neuron are stored on separate BRAM blocks for fast reading/updating of the network state, as they are the data that are most used throughout the experiment execution. Since every new neuron state is dependent only on the network state of the pre-vious simulation step, a single copy of each neuron state is required at any point during execution. The input stream to the DFE kernel originates in the on-board DRAM and represents the evoked (external) inputs, used in the dendritic computations comprising the network input. The initialization data are also streamed in from the on-board memory only once at the start of execution. The size of the connectivity matrix makes it impossible to store on the on-chip memory. It is, thus, placed on the on-board RAM and streamed in batches dictated by the computations. The kernel output is streamed back to the on-board memory and -at the same time -is updated in the (on-chip) BRAM blocks of the DFE. The program flow is tracked using hardware counters monitoring GJ loop iterations (except for the NGJ case), the neurons executed and the number of simulation steps concluded. The data flows through the DFE pipelines with each kernel execution step (or tick ) consuming the corresponding input or producing the respective output and new state at the correct execution points according to the hardware counters. DFE execution naturally pipelines the execution of different neurons within one simulation step. Simulation steps are not themselves directly parallelizable, as every neuron must have the previous state of all other neurons available for its GJ computations (only in the RGJ or SGJ cases) before a new step begins. The DFE pipeline is, thus, flushed before a new simulation step begins execution. This dependency is lifted when in the NGJ case. The GJ calculations form a loop that must finish before the rest of the dendrite-compartment state is calculated. The rest of the dendrite pipeline does not produce valid data for the operation ticks that the GJ influence is being calculated. This delay is partially amortized by using hardware loop unrolling on the GJ calculations, but only to the point that the available chip area allows it. Additionally, in use cases where programmable connectivity is included, the ticks for the evaluation and execution of a GJ connection are always spent regardless of whether a connection actually exists or not. Thus, this implementation cannot benefit from a smaller connectivity density in terms of performance. On the other hand, since one synthesized design can account for all possible connectivity scenarios, the DFE implementation can guarantee predictable performance under all use cases.

Infoli on the Xeon Phi
The InfOli application on the Intel Xeon Phi co-processor, depicted in Figure 5, is based on a typical shared-memory implementation. The application uses the OpenMP library to spawn threads, which can work in parallel. As the Xeon Phi 5110P uses one core to handle OSrelated tasks and each core features multithreading technology that can service up to 4 instruction streams simultaneously, the InfOli application on the Xeon Phi uses up to 60 × 4 = 240 OpenMP threads. Each thread is programmed to handle a part of the neuronal network (sub-network), which is partitioned as uniformly as possible to prevent workload imbalances.
In each simulation step, every OpenMP thread computes its sub-network's state. This process is further broken down into two tasks. Initially, the subnetwork needs updated information from the rest of the network, specifically the dendritic-membrane voltage of the other neurons connected with this sub-network (recall Algorithm 1). Thus, each OpenMP thread accesses memory space shared by all threads so as to collect data from other neurons, with the purpose of re-evaluating the state of its sub-network's GJs. In this task, shared-memory accessing can cause stalls in thread operations due to issues such as memory contention.
Upon completion of its first task, each OpenMP thread updates the compartmental states of each neuron in the sub-network. Each of the neuron's three compartments is re-calculated (dendrite, soma and axon). The dendritic compartment specifically uses  the updated GJ states evaluated in the previous task in order to assess the incoming current from connected neurons. As already explained, this particular process demands an amount of operations that increases significantly with neuron population in the case of densely connected networks, as we would expect with the increasing computational demands of GJs.
After performing its two tasks for the entirety of its sub-network, each OpenMP thread begins the process anew for the next simulation step, until there are none left. Under this paradigm, the threads operate constantly within the "timeframe" of the same simulation step. They sync before the execution of a new simulation step, so that stale data from previous simulation steps cannot be exchanged during GJ computation. This behaviour is enforced due to the stiffness of the eHH-model equations, which can be thrown off-balance even by small changes in the numerical data within a single time step. Under a more relaxed model (e.g. a typical HH model), some staleness in data exchange would be more tolerable and the user would be able perform thread synchronization less frequently in order to trade precision for execution speed. Furthermore, it should be noted that the implementation described assumes that the entire network is large enough to be partitioned in 240 parts. When dealing with smaller networks, the implementation utilizes less than the maximum amount of the platform's assets, since it is designed to require at assign one neuron on each OpenMP thread. Figure 6, we can see the InfOli implementation on the GPU. The execution flow includes two stages, a pre-compute and a compute stage.

Infoli on the Titan X GPU In
In the pre-compute stage, the host initializes the neuron states and the external input currents for the entire simulation duration. It allocates global memory on the device to store the current-step neuron states, next-step neuron states and the external input currents. At the end of this stage, the host copies the required data for simulation onto the GPU. Similarly to the other two accelerator implementations, the current-step dendritic voltages of all cells are accessed frequently as they are used to determine the GJ influence. To reduce memory latency, they are bound to the GPU texture memory. The texture memory is a cached memory on the GPU used to reduce memory latencies when the application has specific memory-access patterns. Writes to texture memory, during the compute stage, are conducted only after all computations of a simulation step have finished. It must be also noted that after the pre-compute stage, no data is transferred from the host to the GPU; the GPU contains all necessary information for the simulation.
During the compute stage, the neuron calculations are performed and the new states are persistently stored throughout the simulation duration.
To compute the new states for a single simulation step, the host launches a CUDA kernel on the GPU device. Before simulation, the kernel is configured for a particular use case (RGJ, SGJ or NGJ) and interneuron connectivity scheme (if applicable). The kernel is executed by a two-dimensional grid of CUDA threads on the device. Threads are executed in parallel by the CUDA micro-cores of the GPU. Every InfOli cell of the model is mapped to a corresponding thread that calculates the states of the neuron. On kernel completion, the host receives the calculated result of the simulation step from the device. The host uses two operation streams to issue the kernel execution and data-transfer operations to the GPU. A kernel in one stream is launched only when the kernel in the other stream has completed. Thus, when one stream is computing the currently executing simulation step, the other stream is performing the necessary data transfers to the host from the GPU. Since the texture memory is updated only after the kernel completes execution, data coherency is maintained. Thus, computation of the current-step neuron states and data transfer of the previously computed states overlap, effectively hiding Host-to-GPU transfer delays.

BrainFrame & the PyNN Front-End
PyNN is a Python package that facilitates the interchangeability and the study of different simulation environments within the computational neuroscience community [11]. It allows for simulator-independent specification of neuronal-network models and already supports many popular simulators like NEURON, NEST, PCSIM , Brian, and so on.
The PyNN API supports modeling at multiple levels of abstraction, both at the neuron level and the network level. It provides a library of standard neuron, synapse and synaptic-plasticity models and a set of commonly-used connectivity algorithms while also supporting custom user-defined connectivity in a simulator-independent fashion.
We integrated the three accelerator fabrics as back-ends on the BrainFrame system using PyNN as a front-end.
The PyNN integration provides the neuroscientific community with easy access on the accelerators without constant mediation from the acceleration engineer while also providing an interface for the already established models to be used with the new heterogeneous acceleration back-end. These characteristics of PyNN can have decisive impact on the adoption of BrainFrame by the community.
As a proof of concept for the front-end of the BrainFrame platform, we have added the InfOli model the library of standard PyNN models. Following the PyNN paradigm, the user initially selects the simulator -in our case our BrainFrame simulator -and then proceeds to select the neuron model, in our case the Inferior-Olive model. A population of neurons using the chosen model is then generated, determining the inter-neuron connectivity type and, finally, a projection of the specified neuronal network is created.
The main difference between the proposed PyNNbackend substrate and the typical simulator backends within the PyNN environment is an additional selection step. In this step, a decision about which of the three alternative acceleration fabrics will be used for a specific experiment is made, based on the available hardware and the characteristics of the simulated neural network.
A conceptional view of the architecture of the PyNN BrainFrame module is shown in Figure 7. For the simulator kernels to communicate with the PyNN frontend, a intermediate BrainFrame-specific PyNN module (pynn.brainframe) is required that implements and extends common methods and objects like the neuron models, synapse models and projection methods and objects. In the case of the proposed BrainFrame module, we implemented objects and methods: i) for the initialization of the simulator, ii) for the description of the neuronal network in PyNN, and iii) for controlling the simulation execution. In some cases, an additional interpreter module is needed to translate these Python objects and parameters to each simulator's native parameters and language. For our system, we developed PyHet -the BrainFrame-specific Python interpreter -which serves the aforementioned role and also implements the accelerator selection.
The final BrainFrame System will be implementing more generic kernel libraries that will be used by the PyNN front-end to simulate user defined models. That way, the accelerator implementation will be com-  pletely transparent to the user and predictions can then be made based on the analysis of the individual kernels that can guide the selection algorithm.

Results
In this section, we present a thorough performance analysis of our heterogeneous BrainFrame platform. The goal is to evaluate the platform and give a clear view on how each accelerator performs when running various instances of the InfOli use cases, validating the usefulness of a heterogeneous HPC simulation framework for computational neuroscience.
The performance analysis also acts as a guide for proposing an accelerator-selection algorithm.
To validate the correct functionality of the separate accelerator implementations, we use a simple experiment that recreates a typical response that is found in the inferior-olive network (axon response). In this experiment, each cell produces a so-called complex spike, seen in Figure 1c, from all simulated cells. 6 seconds of brain time are simulated, which translates to 120,000 simulation steps. The complex spike is produced by applying a small current pulse as input to all InfOli cells at the same instance after program onset, for about 500 simulation steps (or 25 ms, in brain time). Despite being rudimentary, this experiment is easy to validate, provided all neurons are initialized with the same state, and also gives a good indication whether synchronization between neurons is correct, thus validating cell interconnectivity (when present).
As mentioned in the introduction, we identify two distinct tracks that can be followed in conducting neuroscientific experiments, both covered in this evaluation. We perform one batch of measurements ranging from 96 to 960 neurons representing smallscale, real-time TYPE-I experiments, and a second batch ranging from 960 to 7,680 neurons representing larger-scale TYPE-II experiments). We consider (by consulting our neuroscience experts) the minimum meaningful network size for experiments to be around 100 neurons, thus our measurements for TYPE-I experiments begin at 96 neurons. The evaluation is focused the performance of single-node accelerators, thus a network-size cap is set by the smallest maximum network supported by each of the three accelerator fabrics: in this case, the DFE fabric limits network sizes to 7,680 cells.

Performance Evaluation
All performance measurements concerning the Xeon Phi have been carried out through the VTune Amplifier XE 2015 profiling and analysis tool by Intel. Timing measurements on the Maia DFE were taken by measuring the DFE-kernel time inlined within the host code using timestamps before and after the kernel call. Since, the host code (in the CPU) is blocking, only the DFE kernel is active during measurements. The time includes the kernel execution (processing and DRAM data-exchange delay) and the activation delay of the FPGA device. This activation takes about 1 ms, which is negligible compared to the overall execution time that takes several seconds to several minutes in our test experiment. GPU kernel-time measurements were taken using the CUDA Event API.

TYPE-I Experiments
Starting with the analysis for TYPE-I experimentation, in Figure 8 we plot the execution time of a single simulation time-step (50 µsec) for the most demanding use case, that of the RGJ with 100% connectivity density. Even though still not the most common case, a brain-simulation platform must support such high interconnectivity densities for certain TYPE-I experiments. The DFE exhibits the best performance for all tested network sizes. The     Xeon Phi is a close second due to the local-memory delays and the less efficient use of its parallel threads: These network sizes are not large enough to provide sufficient parallelism for the Phi threads to be fully utilized. The GPU, on the other hand, has difficulties to cope with the computational intensity of the GJs, which involve mostly division and exponent FP calculations. Since each CUDA thread executes a single neuron, it cannot exploit any potential parallelism in the GJ calculation. This, alongside the fact that the CUDA threads are underutilized at such network sizes, impacts performance drastically.
The inefficiency of the Titan X GPU in performing the realistic GJ computations is clearly revealed in the SGJ case, next (see Figure 9). In this use case, that the most demanding GJ calculations are dropped, the GPU presents excellent scalability as the problem size increases, compared to the RGJ case. The Xeon Phi, on the other hand, still suffers from core-to-local-memory synchronization delays even though the actual calculations are much simpler now. The DFE needs to spend the same amount of operation ticks as in the RGJ case to evaluate the connection influence, even though it does enjoy gains in performance because of the simpler calculations involved (achieving higher operation frequencies, larger GJ computation parallelism and shorter pipelines). As a result, both latter accelerators show similar scaling properties to the RGJ case. In contrast, the GPU scores performance benefits in the SGJ case compared to the robust DFE for network sizes above 480 neurons.
Next, it is interesting to evaluate the three accelerators for connectivities of lower than 100% density.
Although not relevant for the DFE which maintains the same implementation for any connectivity density, smaller densities can influence the Xeon Phi and the GPU performance considerably. In Figure 10, we plot the execution time of a single simulation time-step for 25%, 50% and 75% connectivity densities, under the RGJ case. The GPU delivers significant gains but the inefficient GJ execution still causes it to perform worse than DFE, even though the latter operates as in a 100%-density simulation. The Xeon Phi, on the other hand, manages to achieve enough performance gains to become faster than the DFE for sufficiently large problem sizes; that is, sizes ≥960 neurons for 75% density, ≥864 neurons for 50% density and ≥672 neurons for 25% density.
Under the SGJ use case (Figure 11), we see similar trends as for the 100% SGJ use case: The GPU exhibits great scalability and is the best option for network sizes higher than 480 neurons. Besides, the DFE remains the most beneficial option for networks smaller than 480.
Under the NGJ case (no connectivity), for TYPE-I experiments, the results point to the DFE as the uniformly best option.
In the complete absence of inter-neuron connectivity, the application becomes a purely dataflow workload, fully compatible for acceleration on a DFE, which is tailor-made for such cases, providing significant benefits over both the Xeon Phi and the GPU (see Figure 12).
Lastly, recall that for TYPE-I experiments, realtime speeds are often desired. Table 3 presents the real-time achievable networks for each use case. The results show that, for real-time experimentation, the DFE accelerator is the best option across the board. In contrast, and as mentioned in our previous analysis, the GPU and Xeon-Phi parallel threads tend to be underutilized at such small network sizes, even though most of the delays of using them are present. Thus the DFE -using fine-grain super-pipelined kernels -can achieve meaningful network sizes at real-time speeds under all use-case instances, according to the objective set in the introduction (≥ 100 cells). For low (≤ 50%) or zero densities, the GPU and Xeon Phi come close to the real-time objective, yet it is interesting to note that the DFE can even support real-time experimentation for TYPE-II experiments under the NGJ case.

TYPE-II Experiments
For TYPE-II experiments, the trends under the RGJ case with 100% connectivity change significantly (see Figure 13). Here, the massive explosion of the GJ computations begins to stress the parallelization capabilities of both the Xeon Phi and the DFE. The DFE's efficient parallelization of the GJs relies mostly on its ability to unroll the GJ loop on the FPGA hardware, allowing for more iterations to finish per operation tick. However, the achievable unrolling factor is limited by the available chip area. For network sizes above 1,000 neurons, the DFE compiler is forced to reuse a lot of resources in time (as the unrolling factor is reduced with increasing network sizes). In effect, the dataflow paradigm gradually degenerates to a sequential execution, making the application less scalable on the DFE. The Xeon Phi follows a similar trend, as the communication overhead between cores (which are interconnected through a moderately efficient ring topology [2]) increases, leading to similarly diminished scalability. Opposite to these accelerators, GPU scalability is largely improved. The GPU is underutilized until all CUDA cores are used (3,072) simultaneously, so for experiments over 3,000 neurons scalability is gradually improving. As a result, the GPU becomes the better performing solution (surpassing the DFE) for network sizes of 4,800 neurons and above.
For lower connectivity densities under the RGJ case, we observe similar trends, although the Xeon-Phi scalability is slightly better because of the lower interconnectivity (see Figure 15). Thus, the Xeon Phi retains the advantages it has for lower than 100% densities, compared to the DFE. Still, the effect of the inter-core communications is present allowing for the GPU to overtake the Xeon Phi for network sizes above 4,800 neurons (for densities of 50% and 75%) and above 3,840 neurons (for 25% density).
Under the SGJ case, the DFE and Xeon Phi follow similar trends, although they are less pronounced (see Figures 14 and 16). As in the RGJ case, the GPU maintains its lead over the other two accelerator types for all tested network sizes and connectivity densities. Finally, in the NGJ case, the situation is the same as with TYPE-I experiments: The purely dataflow nature of the application allows the DFE to once more score the best performance across the board (Figure 17).

Accelerator-Selection Algorithm
The performance analysis discussed above can now be used to formulate a simple accelerator-selection algorithm for BrainFrame automatically choosing the best-suited accelerator fabric based on the problem parameters: mainly, connectivity detail (biophysically realistic: RGJ, simple: SGJ and not present: NGJ), density and network size. Figure 18 shows the selection for our use-case instances. The RGJ case selection, which presents the most complex case in terms of accelerator choice, shifts between all three options depending on the connectivity density. For     the SGJ case, the GPU is always the accelerator of choice, while for the NGJ case the DFE yields optimal results under all experiment parameters. Lastly, if the experiment is flagged as a real-time experiment, the algorithm exclusively chooses the DFE to accelerate the application, as it is the only clearly viable accelerator for real-time experiments.
As a simple example of how this selection can speed up experiments, we can assume a scenario where several batches of RGJ experiments need to be executed for various network sizes. Let us assume that each batch includes 5 experiments each with gradually decreasing connectivity density (100%-75%-50%-25%-0%) and that each experiment in a batch simulates 40 seconds of brain time. The time saving in this example by using the BrainFrame system compared to homogeneous systems that integrate only a single accelerator type can be seen on Table 4.
The BrainFrame system can achieve significant benefits compared to the single-fabric systems that can range up to 86% execution-time reduction. On average, assuming the total runtime of all batches, the BrainFrame system can achieve 40% reduction compared to a DFE-only system, a 10.7% reduction to a GPU-only system and a 20.2% reduction compared to a Phi-only system. If we consider the nominal scenario of TDP power consumption we can also present an estimation of the energy benefits of using BrainFrame compared to the single node accelerators for our example. The energy saving for specific batches on the example are between 20.4% to even about 91.4%. For the all experimental batches the energy saving is between 27% to 45.9%. Reduction in energy consumption can greatly reduce operation and maintenance cost especially within a datacenter environment.
Although these figures will vary based on the particular accelerator instances used for the experiments, they give a rough estimate of the time savings that can be obtained by carefully selecting the accelerators for the various experiments. This selection can be easily extended/updated as new features and more generalized model libraries are added for acceleration (making the selection predictive for general cases) or as each acceleration technology is updated in the future.

Discussion
There are numerous related works that propose employing HPC solutions for the acceleration of SNNs. Such solutions include hardware-based solutions, like reconfigurable hardware, as well as software solutions using GPUs and less often many-core processors platforms, such as the Xeon Phi. Simpler modeling has found a good match on GPU-based systems, such as Izhikevich and I&F modeling [7,21]. Higher biophysically meaningful modeling, like the extended-HH model, seems to be a much more difficult problem to solve with GPUs, especially for real-time experimentation [3].
The Xeon Phi has also been used very successfully for bio-inspired neural networks, such as Convolutional Neural Networks for Deep Learning Systems [22]. On the other hand, similar difficulties to the GPUs in the acceleration of the complex HH models, are identified with Xeon Phi platforms even for less densely interconnected networks [2].
FPGA-based solutions have been especially prominent in accelerating neuron applications, with impressive results specifically for biophysically meaningful modeling and real-time performance for such networks [4,5,23]. It is also revealed in related works conducting performance analysis, that an FPGA's potential benefit varies greatly between SNN types, even without taking into account connectivity modeling that can decisively change the workload characteristics [6].
Recently, we have also seen use of DFEs for accelerating computational-neuroscience applications. On purely dataflow neuromodeling applications, the DFE can have great benefits for both large-scale networks and real-time networks performance [24]. Even in the cases of HH neurons that include highly accurate interconnectivity modeling (disrupting the purely dataflow nature), the DFEs can accomplish greater benefits than traditional control-flow-based FPGA acceleration [20].
These works, however, present just one-off implementations of specific application instances, on a specific acceleration platform and most also ignore the variety of synapse-modeling alternatives and its influence on the applications. Biophysically accurate models of biological systems, such as the ones using the HH formalism, are comprised mostly of a set of computationally challenging deferential equations often implementing an oscillatory behavior. If neurons are simulated as independent computational islands (NGJ case), then dependencies between the equations do not arise, allowing divide-and-conquer, data-flow and event-driven acceleration strategies to be used very efficiently. The moment interconnectivity between oscillating neurons is also modeled (like GJs, input integrators, STDP synapses etc), the cells become coupled oscillators. The embarrassingly parallel and dataflow nature of the application is then broken. All neuron states need to be completely updated at each simulation step to retain correct functionality. This requirement, in turn, enforces the use of cycleaccurate, transient simulators and forbids event-driven implementations. As a result, a single HPC fabric cannot be a universal solution to the problem since it is unable to cover all the aforementioned requirements efficiently, as our analysis also reveals.
The above difficulties strongly hint on why most of the computational-neuroscience community has so far meticulously avoided employing HH models and multicompartmental models with complex connections on large problem sizes using conventional computing machines. The eventual use of biophysically plausible neurons and connections on a larger scale is anticipated to contribute substantially in explaining biological behavior.
Even though the details of the most important system behaviors of the modeled systems must revolve around very specific characteristics of the networks, thus can possibly to be revealed by generally simpler representations, the computational neuroscientist cannot know beforehand which of the numerous dynamics revealed from the biological measurements (from which the models arise) can be safely abstracted. Studies seeking to reveal system behavior need to start with complex representations before they know enough to drop back to more simplified models.
Additionally, most related works seem to suffer from a limited re-usability value due to their (often inexistent) user interface. They ignore the challenge of the neuroscientific community adopting the proposed platform and very few propose solutions to that end. Beuler et al. [25] developed a graphical interface alongside their FPGA-based simulator. Although it does provide ease of use in experiments, it is still confined to only one platform and only one application with limited flexibility to be the basis of a more widely adopted system. Weinstein et al. [26,27] took the approach of developing their own modeling language to interface to their acceleration library, the DYNAMO compiler. Despite the limitation of using only FPGAs as the back-end platform, the DYNAMO compiler is a technically complete solution. Unfortunately, it failed to achieve wide adoption by the scientific community as it requires learning a new language and, additionally, the non-trivial process of porting all existing neuron models to the new coding paradigm.
PyNN has also been used in the past to tackle the issue of user interface. One such a system is SpiNNaker [28], which uses PyNN to interface to a neuromorphic network comprised by a many-core system based on ARM processors. SpiNNaker though is focused on simpler modeling paradigms that do not model specific biophysical neuron properties and have more tractable computational requirements. The most promising solution, both in terms of usability and computational ability, was proposed by Cheung et al. [29] with NeuroFlow. In this work, the researchers integrated PyNN to their DFE-based hardware library.
Neuroflow also provides a very complete library of IPs in the back-end, covering a great portion of possible applications. Yet, the system is still integrating a single acceleration platform. What is more, the performance and efficiency analysis is only presented for a single use case of a generally simpler model (Izhikevich) and with connectivity modeling of medium complexity (STDP) and relatively lower density (about 10%). The behavior and performance of the system for the rest of the supported features is not self-evident and is expected to be significantly different, especially for accurate modeling such as the HH and with high connectivity densities, as shown by our performance analysis on the DFE platform. Furthermore, many of the performance benefits are accomplished using event-driven simulations (neurons are evaluated only when their inputs are triggered), that cannot always be employed, as discussed earlier.
To the best of our knowledge, no prior work has considered an heterogeneous acceleration system for coping with the variability of the applications in the field. Additionally, the PyNN integration provides a familiar interface to the neuroscientific community, thus making BrainFrame a complete solution for a node-level heterogeneous system. Even though the current work introduces BrainFrame in a single-node setup that integrates all three accelerator fabrics using the PCI-e interface, the BrainFrame paradigm is primary designed to support multi-node setups. Such setups can be facilitated in the now up-and-coming heterogeneous datacenters, provided crucial aspects such as low-latency interconnects are tackled. Such a development would lead to a dramatic increase in the size of network populations supported at tractable simulation times, while also providing a way for smallmedium-sized labs to use BrainFrame as a service. Thus, enabling them to exploit the benefits of such an HPC platform without suffering the cost of creating and maintaining a local setup.

Conclusions
In this paper, we have proposed BrainFrame, an heterogeneous acceleration platform to serve computational-neuroscience studies in conducting the variety of real experimentation often required for the study of brain functionality. We have focused our analysis on biophysically-accurate neuron models, as such models are considered essential for the deeper understanding of the system properties of biological brain networks. In order for the BrainFrame system to cope with the demand for high ease of programming use as well as the computational requirements of the field, we have presented a proof-of-concept HPC platform that integrates three accelerator technologies already proven in brain simulations. The performance analysis of the system employing use cases that take into account connectivity density and modeling complexity, has revealed that all three fabrics are essential within such a powerful simulation platform so as to optimally serve all possible experimentation cases. The platform, thus, achieves efficient large-network experiments as well as real-time performance for meaningful network sizes (≥ 100 cells).
BrainFrame is complemented, finally, with a PyNN front-end so as to tackle the much sought usability objective.
The PyNN front-end makes the heterogeneous platform immediately accessible to a multitude of prior modeling works, which is an essential strategy for the wide adoption of complex HPC platforms in the neuroscientific community. Furthermore, building on the elegant PyNN infrastructure, a simple accelerator-selection algorithm has also been developed for automatically identifying the most suitable HPC fabric (Xeon Phi, GPU, DFE) per neuroscientific experiment and has been integrated in BrainFrame. Last but not least, all accelerators use PCIe slots to connect to the host system, which greatly amplifies the platform flexibility and permits adjusting the platform hardware depending on the funds and hardware resources available to a research lab wishing to use BrainFrame.