Spike sorting based upon machine learning algorithms (SOMA)

https://doi.org/10.1016/j.jneumeth.2006.08.013Get rights and content

Abstract

We have developed a spike sorting method, using a combination of various machine learning algorithms, to analyse electrophysiological data and automatically determine the number of sampled neurons from an individual electrode, and discriminate their activities. We discuss extensions to a standard unsupervised learning algorithm (Kohonen), as using a simple application of this technique would only identify a known number of clusters. Our extra techniques automatically identify the number of clusters within the dataset, and their sizes, thereby reducing the chance of misclassification. We also discuss a new pre-processing technique, which transforms the data into a higher dimensional feature space revealing separable clusters. Using principal component analysis (PCA) alone may not achieve this. Our new approach appends the features acquired using PCA with features describing the geometric shapes that constitute a spike waveform. To validate our new spike sorting approach, we have applied it to multi-electrode array datasets acquired from the rat olfactory bulb, and from the sheep infero-temporal cortex, and using simulated data. The SOMA sofware is available at http://www.sussex.ac.uk/Users/pmh20/spikes.

Introduction

After over a century of neurophysiological research we still do not understand the principle by which a stimulus such as an odour, an image or a sound is represented by distributed neural ensembles within the brain. While large numbers of studies have made detailed analyses of response profiles of single cells in isolation, such techniques cannot address holistic issues of how large ensembles of neurones can integrate information both spatially and temporally. There is little doubt that much of the information processing power of the brain resides in the activities of co-operating and competing networks of neurons. If we can unlock the principles whereby information is encoded within these networks as a whole, rather than focusing on the activities of single neurones in isolation, we may come closer to understanding how the brain works (Petersen and Panzeri, 2004). While some progress towards understanding how this is achieved at a gross structural level is being achieved with brain imaging techniques, the only way to provide an understanding at the level of multiple cell–cell interactions in the brain is to record simultaneously from large numbers of cells within a defined system. The two main difficulties in achieving this step have been, firstly, the lack of appropriate hardware to record simultaneously the electrical activity of ensembles of neurones at the single cell level and, secondly, restrictions of current methods in analysing the resulting huge volume of multivariate, high frequency data . In the current paper, we aim to resolve a crucial issue in the second category: spike sorting.

There have been many automatic (see Harris et al., 2000; Harris KD. Klustakwik Software (http://klustakwik.sourceforge.net/), 2002; Lipa P. BubbleClust Software (http://www.cbc.umn.edu/∼redish/mclust/MClust-3.3.2.zip), 2005), manual (see Redish A. MClust Software (http://ccgb.umn.edu/∼redish/mclust/), 2005; Hazab L. Kluster Software (G. Buzsaki’s Lab) (http://klusters.sourceforge.net/UserManual/index.html), 2004; Spike2 Software (Cambridge Electronic Design Ltd., UK)), and semi-automatic methods, i.e. a combination of both, to sort spikes. None of these methods supersedes the rest as each has associated difficulties (Lewicki, 1998). Normally, the first step in these methods is to find the number of clusters, i.e. template waveforms, formed from the chosen set of waveform features. Manual methods require the user to select these features and to identify the template waveforms, which can be labour intensive and inefficient. Fully automated methods rely on a chosen pre-processing method, normally principal component analysis (PCA), to automatically extract differentiable features. A clustering technique is then used to find a number of clusters. This approach is less time-consuming. However, there are still problems associated with this approach; mainly that most clustering algorithms require a pre-determined number of groups to identify, and PCA does not always produce several distinguishable clusters. The semi-automatic methods have a combination of the above problems.

The next step in these methods is to identify the sets of waveforms that best represent each of the clusters, usually referred to as cluster cutting. A user can manually split the feature space into clusters by defining boundaries, but this technique is normally inefficient and time-consuming. An alternative is to define the clusters automatically, by identifying the closest set of spike waveforms to each template waveform (mean) using a distance measure. This forms a set of implicit boundaries that separate the clusters. Clustering using an automated method normally ignores the clusters’ distributions, thus, classifying a high percentage of the dataset correctly if the clusters are not close and have a spherical distribution. However, the likelihood of finding such clusters in real electrophysiological data is minimal. This is because the differential noise levels attributed to the waveforms create differentially distributed clusters.

The technique proposed in this paper has the three following objectives, resolving some of the problems stated above (see also Lewicki, 1998):

  • 1.

    To form distinguishable clusters (where each represents a biologically plausible firing rate) by appending extra feature components, which describe the geometrical structure of the waveform, to the PCA components.

  • 2.

    To identify, automatically, the number of clusters within the feature space.

  • 3.

    To reduce the number of waveforms classified to the incorrect groups.

We deal with these objectives by extending a version of an unsupervised neural network (Kohonen), resulting in a fully automated process, which is a vital consideration when dealing with multiple electrode recordings.

The first part of our spike sorting methodology pre-processes the waveform data using a new combined approach. First, we acquire the PCA components and if necessary append this to features, which describe the waveform’s geometrical shape. The use of additional features, and their number, is dependent on the number of clusters formed and whether they represent neurons with a plausible firing rate.

The next part of the process identifies the number of clusters within the feature space, using a proposed extension to the Kohonen method. The standard Kohonen network requires several outputs, which correspond to the number of clusters (neurons) within the feature space. As we would have no prior knowledge of this, we acquire a sufficient number of outputs by calculating the maximum number of neurons that could feasibly be in the dataset and multiplying it by an arbitrary number.1 Thus, after training, a subset of the outputs will represent every cluster and one will be nearer the centre (denser area). We subsequently identify these outputs by comparing the distributions that each represents. Consequently, a set of non-central nodes arises which we use to reduce misclassification (next stage).

The final part of our proposed process (second extension) is to define the waveforms belonging to each cluster using all the nodes, i.e. central and non-central nodes. We firstly, move, i.e. space out, the non-central nodes around and towards their cluster’s edge. We subsequently use them with the central nodes in the classification process, where implicit boundaries are formed to contain every cluster’s size and shape. This resolves the problem of misclassification; if several clusters are in close proximity and one has a wider distribution, an implicit boundary would encompass this area reducing the likelihood of misclassification.

We test the new approach using datasets acquired from the rat olfactory bulb and the sheep temporal cortex. We demonstrate that our method out-performs others, and show that these processes can extract artifactual waveforms from the data, limiting inaccuracies in the analysis of the results.

Section snippets

Extracting experimental spike data

The spikes used in our spike sorting process were extracted from the continuous electrophysiological trace sampled at an electrode when the signal exceeded a given threshold (2* background noise). Each spike was sampled for a period spanning 0.4 ms preceding, to 1.25 ms after the threshold was crossed. For each channel of data, the width of the spikes collected was identical throughout the period of recordings.

Each spike collected was comprised of 48 sample points (t=1,,48), the signal

Feature extractions

The first stage of our methodology is to extract the features that differentiate spikes, thus revealing clusters within the feature space. Principal component analysis (PCA) will automatically extract these features Bishop, 1995, Csicsvari et al., 1998, Jolliffe, 2002. However, it appears that this is not always reliable and can produce results which are not biologically plausible. For example (bottom panel, Fig. 5), PCA formed one cluster suggesting that one neuron was recorded with a firing

Simulated data

To be able to explain this methodology we simulated sets of feature components, where a set of three components (C1C3) represents a waveform.3,4

Noise extraction

Electrical noise is almost invariably incorporated into electrophysiological data. This often originates from electrical equipment, e.g. from electrical actuators for delivering stimuli. Such signals may exceed the spike-triggering threshold and may resemble spikes in their form and amplitude. They are often difficult to exclude from electrophysiological datasets. Here we introduce a process for extracting waveforms that have been caused by noise, i.e. artifacts, from the spike waveform

Results generated from using the spike sorting process on simulated data

To test the efficiency of our process, we created two simulated datasets. The first contained three clusters, which were positioned apart and had similar distributions. The second contained two closely positioned clusters, where one had a wider distribution. We used SOMA on these datasets with an arbitrary number of nodes. We used 10 as this exceeded the number of clusters. The analysis on both datasets using all 10 nodes with SOMA appeared more effective and reliable than the method based on

Datasets

Electrophysiological data was acquired from two animal systems, the olfactory bulb (OB) of anaesthetized rats, and the inferotemporal cortex of awake behaving sheep. All experimental procedures involving animals were conducted in strict accordance with the Animals (Scientific Procedures) Act, 1986. Rats were anaesthetized (25% urethane, 1500 mg/kg) and fixed in a stereotaxic frame. A craniotomy was performed and the left eye enucleated to expose the left OB and allow a 30 channel MEA (6×5

Discussion

A comprehensive understanding of the principles employed by the brain in processing information requires sampling the activity of many individual neurons in neural systems. We have presented here a novel approach combining several machine learning techniques to identify the number of cells, and their activity, from multielectrode recordings.

SOMA is a fully automated process that requires minimal user involvement and is computationally inexpensive; these are vital considerations when sorting

Acknowledgements

J.F. was partially supported by grants from UK EPSRC (GR/R54569), (GR/S20574), and (GR/S30443). KMK was supported by a BBSRC project grant.

References (15)

There are more references available in the full text version of this article.

Cited by (0)

View full text