Neural Parallel Engine: A toolbox for massively parallel neural signal processing

https://doi.org/10.1016/j.jneumeth.2018.03.004Get rights and content

Highlights

  • A toolbox for massively parallel neural signal processing on GPU has been developed.

  • It provides multiple algorithms for efficient spike detection and sorting on thousands of channels.

  • It offers 5× to 110× speed improvement compared to CPU.

  • A MATLAB interface is created for efficient integration into existing workflows.

Abstract

Background

Large-scale neural recordings provide detailed information on neuronal activities and can help elicit the underlying neural mechanisms of the brain. However, the computational burden is also formidable when we try to process the huge data stream generated by such recordings.

New method

In this study, we report the development of Neural Parallel Engine (NPE), a toolbox for massively parallel neural signal processing on graphical processing units (GPUs). It offers a selection of the most commonly used routines in neural signal processing such as spike detection and spike sorting, including advanced algorithms such as exponential-component-power-component (EC-PC) spike detection and binary pursuit spike sorting. We also propose a new method for detecting peaks in parallel through a parallel compact operation.

Results

Our toolbox is able to offer a 5× to 110× speedup compared with its CPU counterparts depending on the algorithms. A user-friendly MATLAB interface is provided to allow easy integration of the toolbox into existing workflows.

Comparison with existing methods

Previous efforts on GPU neural signal processing only focus on a few rudimentary algorithms, are not well-optimized and often do not provide a user-friendly programming interface to fit into existing workflows. There is a strong need for a comprehensive toolbox for massively parallel neural signal processing.

Conclusions

A new toolbox for massively parallel neural signal processing has been created. It can offer significant speedup in processing signals from large-scale recordings up to thousands of channels.

Introduction

The human brain contains billions of neurons in trillions of synaptic connections. Each neuron may play a specific role in neural information processing. Therefore, to understand neural mechanisms at the circuit level, monitoring the activity of as many neurons as possible is important. Some studies have already achieved the recording of 512 channels in behaving animals (Fiáth et al., 2016, Xie et al., 2016). A prototype comprising of 1028 electrodes has also been demonstrated in vivo (Torfs et al., 2011). For in-vitro MEAs, where area and power constraint are less of a concern, the total channel count has already reached 81,920 channels (Johnson et al., 2012, Obien et al., 2015).

One of the major challenges related to this rapid increase in channel counts is the computation burden to process the acquired raw signals. The burden is further exacerbated by the high sampling frequency required for neural recordings (typically 40 kHz) and long experimental sessions (a large number of trials are needed for statistically meaningful results). For example, a mere 15-minute experiment will generate 37 GB of data if sampled at a 40 kHz, 16-bit resolution with 512 channels. Processing these large amount of data will be a major bottleneck in the real-time analysis of neural signals.

Real-time analysis of neural signals is an indispensable part of many areas of neuroscience research. For example, a brain-computer interface demands that the motor intension of a subject be decoded quickly for real-time controls (Libedinsky et al., 2016, Kao et al., 2017, Gilja et al., 2012). Studies in close-loop neurophysiology can provide insights on the relation between electrical stimulation and neural response (Keren, 2014, Rolston et al., 2010), but the acquired neural signals also need to be processed in real-time to control the stimulation parameters. As the number of recording channels inevitably increase in the future, the computational burden to process the signals efficiently is expected to become a major challenge.

Many spike detection and sorting algorithms have been proposed in the past (for reviews, please see Rey et al., 2015, Lewicki, 1998). However, they are still unsolved problems due to the fact that the signal-to-noise ratios and feature characteristics of neural recordings vary a lot between different preparations. Furthermore, sophisticated algorithms that aim to address this variability cannot process a large number of channels in realtime. The unfortunate result is that most researchers still need to rely on simple techniques like threshold crossing for realtime spike detection. Even those simple techniques may take up significant amount of time if one needs to analyze large-scale recordings. An efficient way to process a large amount of data is strongly desired.

An efficient way to analyze a large number of channels is to process them in parallel. In terms of programming difficulty, the simplest method is to use a multi-core CPU. Each core can work on one channel of data. However, since the architecture of the CPU is not originally designed for parallel processing, the number of cores in it is very limited.

Graphical processing units (GPUs) on the other hand, are designed from the ground up to perform massively parallel processing. Due to the keen interest in deep learning lately, many resources have been put into developing more powerful GPUs. Now a commodity GPU can have thousands of processing cores capable of performing tens of thousands of operations simultaneously. They can be efficiently integrated into the existing workflow as most desktop computers have a built-in PCI-E slot for GPUs. Multiple GPUs can also be installed on the same computer to further boost its processing power. The GPU has a great potential to ease the computational burden in neural signal processing.

Some previous efforts have been made in accelerating neural signal processing on the GPU. Cao et al. used GPUs to mine firing patterns in MEA recordings but the data input into the algorithm were spike trains rather than raw neural signals (Cao et al., 2010). Accurate spike detection on raw signals is very computationally intensive and sometimes is even more timing consuming than the pattern mining itself. Wilson and Williams used a GPU to process EEG data for a BCI application (Wilson and Williams, 2009). Specifically, they parallelized spatial filtering and auto-regressive spectrum estimation on the GPU. However, EEG has a much lower sampling frequency and its processing requires very different algorithms compared with spike recordings. Another study used a GPU for signal processing in spike signals (Wilson, 2011), but only simple algorithms, such as filtering and voltage crossing for spike detection, were implemented. It did not offer state-of-the-art algorithms. Chen et al. (2011) implemented NEO spike detection, spike interpolation and PCA projection on a GPU. However, they reported the results only on a 32-channel dataset. The maximum speed up they achieved on spike detection is only 3×. Whether the implemented algorithms can scale to thousands of channels for large-scale recordings remains unknown. Furthermore, they did not implement any spike sorting algorithm on the GPU, which is a critical part of the neural signal processing procedures. Information on memory transfer timing is also missing, which can be a significant overhead when a large amount of data are to be processed. Most of these previous efforts implemented only one or two simple algorithms and avoid the complexity of advanced algorithms. Last but not the least, these previous efforts do not provide an simple interface to interoperate with other commonly used programming environments, e.g. MATLAB or Python. Therefore, it is very difficult to integrate them directly into existing workflows. Currently, no comprehensive toolbox can offer advanced parallel neural processing algorithms in an easy-to-use package to the neuroscience community.

Converting a serial algorithm into a parallel version is non-trivial. In fact, many studies are devoted to the parallelization of a single algorithm (Chen et al., 2011; Manavski, 2007, Llamocca et al., 2011, Zhao et al., 2009, Ruijters and Thévenaz, 2012, HewaNadungodage et al., 2017, Wang et al., 2017, Rossinelli et al., 2010). The unique architecture of a GPU poses special challenges and constraints on the algorithm design if we want to fully exploit its potential. For example, a GPU core typically runs at a much lower frequency than a CPU core. It is also simpler and usually does not perform aggressive out-of-place executions and branch predictions to reduce the latencies in its instruction pipelines. As a result, a GPU core is usually much slower than a CPU core. A GPU relies on executing hundreds of thousands of threads simultaneously to hide the computation latency. In some cases where the level of parallelism offered by a simple partition of data is not enough (e.g. only hundreds of channel in a neural recording), a second level of parallelism must be exploited to fully utilize the GPU (e.g. processing multiple elements in a channel simultaneously). The original algorithms need to be carefully re-engineered to exploit fine-grain parallelism in their procedures. A GPU also has a more complex memory hierarchy. It has global, shared and register memory, each with its own size and latency constraints. For example, the shared memory can be more than 10 times faster than the global memory but its size is much more limited. The efficient use of various types of memory is very crucial to the performance of parallel processing. Furthermore, because threads in a GPU operate independently from one another, they must be carefully synchronized such that all threads can see the intermediate values produced. However, synchronization points introduce overhead which adversely affects the performance of parallel operations, so they can only be used when necessary.

In this study, we reported the development of Neural Parallel Engine (NPE), a comprehensive GPU-based toolbox for massively parallel neural signal processing. It incorporates multiple spike detection and sorting algorithms, for which we devised efficient parallelization strategies. We also compared the performance of our GPU toolbox with existing implementations on CPUs. Finally, we analyzed the bottleneck limiting the performance of parallel neural signal processing. We plan to open-source the toolbox in the future to allow other researchers to build new parallel algorithms using the basic components from our toolbox. NPE marks the beginning of a large project enabling realtime processing of large-scale neural recordings.

Section snippets

CUDA architecture

We have adopted the CUDA platform from NVIDIA to build our toolbox due to its more mature tooling and wider community acceptance. In the CUDA architecture, the basic processing unit is called a Streaming Multiprocessor (SM). Each SM has its own buffer, register space, execution units and shared memory. Fig. 1 shows the simplified architecture a GP104 series GPU from Nvidia:

In CUDA, threads are grouped together in what is a thread block. All threads in the same block share the same register

Parallelization strategies for neural signal processing

Neural signals typically consist of many channels of data. Each of these channels can usually be processed independently, at least in the early stages of the signal processing pipeline. This property can be exploited via parallel processing to significantly speed up the computation. Many levels of parallelism exist in CUDA. An appropriate scheme for mapping channels to CUDA programming structure is vital to the scalability of the program.

One simple method is to map each channel of data to one

Algorithms implemented in the toolbox

Our toolbox focuses mainly on the analysis of raw neural signals, as it is more computationally intensive to process them compared to spike trains. Our toolbox focuses on the two most important tasks in neural signal processing: spike detection and spike sorting. We have parallelized multiple commonly used and advanced algorithms for spike detection and sorting to cater for different needs of the research community. In the following, we will discuss the algorithms that have been implemented

MATLAB wrapper

For convenience and easy integration into existing analysis workflows, a MATLAB wrapper is created for all key functions of the GPU toolbox so that it can be used directly in MATLAB. The wrapper is created with the help of the mexplus C++ MATLAB development kit (Yamaguchi, 2017). The development kit provides some useful macros and functions to make creating a MATLAB wrapper easier. Custom data conversion routines have been created to convert the managed array class in MATLAB to the internal

Performance comparison

A synthetic neural signal was used to compare the performance of GPU-based and CPU-based neural processing. The simulated dataset from the wave_clus package is widely used in the literature for spike detection and sorting studies so we have also used it in our performance comparison (Quiroga et al., 2004). We used the same C_Easy1_noise04 data in all our speed comparisons on spike detection. The standard deviation of the noise in the simulated data is 0.4 times relative to the amplitudes of

Spike detection

We compared the performance of various neural signal processing routines implemented in our toolkit between GPU and CPU. Fig. 10 shows an example of the output from GPU spike detection. As we can see from the figures, most spikes were detected correctly by the toolbox. The missed spikes are due to the inherent limitations of the original algorithms rather than the GPU parallelization. Table 1 shows the comparison of the spike detection rate between GPU and CPU programs in the C_Easy1_noise025

Discussion

In this study, we have created the Neural Parallel Engine, a massively parallel neural signal processing toolbox on GPU. It can process thousands of channels simultaneously. It adopted a multi-tier parallelism architecture to accelerate the processing of data with large number of channels or long data lengths. A MATLAB wrapper was created to make it easily usable in the MATLAB environment. We compared its performance with CPU implementations and found that our library can offer a 5× to 110×

Conclusion

In this work, we have created the Neural Parallel Engine, a MATLAB toolbox that can offer 5× to 110× speed up in neural signal processing compared with CPU implementations. We have proposed a two-tier parallelization strategy to accelerate neural signal processing on GPU and a scalable scheme to map neural recording channels to software construct in CUDA. Our toolbox offers multiple options for spike detection and spike sorting to cater to different needs and preferences. It also offers

References (41)

  • D. Chen et al.

    Massively parallel neural signal processing on a many-core platform

    Comput. Sci. Eng.

    (2011)
  • R. Fiáth et al.

    Large-scale recording of thalamocortical circuits: in vivo electrophysiology with the two-dimensional electronic depth control silicon probe

    J. Neurophysiol.

    (2016)
  • V. Gilja et al.

    A high-performance neural prosthesis enabled by control algorithm design

    Nat. Neurosci.

    (2012)
  • C. HewaNadungodage et al.

    A GPU-oriented online recommendation algorithm for efficient processing of time-varying continuous data streams

    Knowl. Inf. Syst.

    (2017)
  • J. Hoberock et al.

    Thrust Parallel Algorithms Library

    (2017)
  • P.J. Huber et al.

    Robust Statistics

    (2009)
  • J.C. Kao et al.

    A high-performance neural prosthesis incorporating discrete state selection with hidden Markov models

    IEEE Trans. Biomed. Eng.

    (2017)
  • H. Keren

    Controlling neural network responsiveness: tradeoffs and constraints

    Front. Neuroeng.

    (2014)
  • M.R. Keshtkaran et al.

    Noise-robust unsupervised spike sorting based on discriminative subspace learning with outlier handling

    J. Neural Eng.

    (2017)
  • K.H. Kim et al.

    Neural spike sorting under nearly 0-dB signal-to-noise ratio using nonlinear energy operator and artificial neural-network classifier

    IEEE Trans. Bio-med. Eng.

    (2000)
  • Cited by (6)

    View full text