Classification of ECG signals using Hermite functions and MLP neural networks

Classification of heart arrhythmia is an important step in developing devices for monitoring the health of individuals. This paper proposes a three module system for classification of electrocardiogram (ECG) beats. These modules are: denoising module, feature extraction module and a classification module. In the first module the stationary wavelet transform (SWF) is used for noise reduction of the ECG signals. The feature extraction module extracts a balanced combination of the Hermit features and three timing interval feature. Then a number of multi-layer perceptron (MLP) neural networks with different number of layers and eight training algorithms are designed. Seven files from the MIT/BIH arrhythmia database are selected as test data and the performances of the networks, for speed of convergence and accuracy classifications, are evaluated. Generally all of the proposed algorisms have good training time, however, the resilient back propagation (RP) algorithm illustrated the best overall training time among the different training algorithms. The Conjugate gradient back propagation (CGP) algorithm shows the best recognition accuracy about 98.02% using a little amount of features.


Introduction
Development of accurate and quick methods for automatic ECG classification is vital for clinical diagnosis of heart disease [1].An arrhythmia is any abnormal cardiac rhythm [2].Heart arrhythmias result from any disturbance in the rate, regularity, and site of origin or conduction of the cardiac electric impulse [1].Among the various abnormalities related with functioning of the human heart, Premature Ventricular Contraction (PVC) is one the most important arrhythmias.PVC is the contraction of the lower chambers of the heart (the ventricles) that occur earlier than usual, because of abnormal electrical activity of the ventricles [2].This paper investigates the detection and classification of PVC arrhythmias.In the literature, several methods have been proposed for the automatic classification of ECG signals.The recently published papers are presented in [3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20].In [3], the authors used the discrete wavelet transform as the feature extractor and linear discriminants as the classifier for PVC beat classification and achieved the recognition accuracy (RA) about 95.6%.In [5] the authors used a feed forward neural network as classifier.They derived five features including the QRS width and offset, amplitude of R segment, the T segment slope and the R-R interval duration for PVC beat classification.In their work RA is about 96.85%.In [6], the authors used wavelet feature extraction in tandem with fuzzy neural network classification for PVC beat classification with RA about 98.20%.In [7], the authors used a multilayer perceptron (MLP) neural network classifier and achieved an accuracy of 88.3% in their testing set.In [8], the authors used morphological information as the features and a neural network classifier for differentiating the ECG beats including PVC beats.They achieved RA about 97%.The method presented in [15] is based on a hybrid fuzzy neural network that consists of a fuzzy self-organizing sub-network connected in a cascade with a multilayer perceptron.The authors proposed to use highorder statistics as input features for feeding their classifier for recognition of PVC beat from the others.They achieved RA about 98.20%.In [18], an automatic online beat segmentation and classification system based on a Markovian approach is proposed.According to these works, some important issues of designing an ECG classification system become clear.Addressing these factors suitably lead to development of more robust and efficient recognizers.One of these issues is related to the preprocessing module.Another issue is selection of the classification approach.In particular, the training algorithms of the neural network classifiers.Also, selection of the proper features plays an important role.Here, we present an efficient system for classification of the normal heartbeats, PVC and other abnormalities.In the preprocessing module, an un-decimated wavelet transform is used to provide an informative representation that is both robust to noise and tuned to the morphological characteristics of the waveform features.For the feature extraction module we have used the Hermite functions to represent ECG signals.Then we investigated several Multi-Layer Perceptron neural networks by varying number of hidden layers and neurons of each layer.Multiple algorithms were employed for training these neural networks.The performance of these algorithms was compared in several experiments.Figure 1 shows a general scheme of the proposed method.The paper is organized as follows.Section 2 describes the preprocessing module.Section 3 explains the feature extraction module.Section 4 presents the classifier.Section 5 describes the database and performance metrics.Section 6 shows some simulation results.Section 7 discusses the results and concludes the paper.

Signal preprocessing
Noise reduction is an important problem for analysis of ECG signals.The most troublesome noise sources are the electrical activity of muscles (EMG) and instability of electrode-skin contact [21].For removal of such noise, an advanced signal processing method, such as discrete wavelet transform (DWT) denoising technique [22], should be used.However, DWT is a timevariant transform.To overcome this problem, we used the stationary wavelet transform (SWT) which is also known as the undecimated wavelet transform (UWT) or translation-invariant wavelet transform.SWT uses the average of several denoised signals that are obtained from the coefficients of ε-decimated DWT [23].Suppose the signal

Preprocessing
The SWT is given by: where, 2 , , , is the complex conjugate of the mother wavelet.Figure 2 shows the block diagram of SWT.In this figure, the blocks of H(z) and Hr(z) are the decomposition and reconstruction high pass filters, respectively and the blocks of G(z) and Gr(z) are low pass filters.d(•, •) denotes the decomposition coefficients and a(•, •) denotes the approximation coefficients.This figure shows a three level decomposition.For denoising, we have used the Daubechies wavelet functions (db1) with decomposition level of five, based on our extensive experiments.Figure 3 shows ECG signals before and after denoising.For smoothing the ECG signals, we have used the Savitsky-Golay (SG) filtering method [23].The filter coefficients are achieved by the un-weighted linear least-squares fitting method using a polynomial function.For this reason, the Savitzky-Golay filter is also called a digital smoothing polynomial filter or a least-squares smoothing filter.A higher degree of polynomial makes it possible to achieve a high level of smoothing without attenuation of the data features.The Savitzky-Golay filtering method is often used for frequency information or spectroscopic information.For the first type, it preserves the high-frequency components of the signal and for the second type it preserves higher moments of the Peak.In this paper we have used the first type of SCG.

Feature extraction
Feature extraction plays an important role in any classification task.It is clear that there is a great variety of morphologies among the heartbeats which belongs to one class, even for the same patient.Moreover, beats belonging to different classes are morphologically similar to each other.They occupy a similar range of values and frequencies; thus, it is difficult to recognize one from the other on the basis of only time or frequency representations.Different feature extraction techniques have been applied.Traditional representations include features describing the morphology of the QRS complex, such as RR intervals, width of the QRS complex [3,14,15,16,2], wave interval and wave shape features [15].Some authors have processed features resulting from Fourier [25] or wavelet transformations [3] of the ECG.Clustering of the ECG data, using methods such as self-organizing maps [16] or learning vector quantization [4], as well as internal features resulting from the neural preprocessing stages [4] have been also exploited.
Other important feature extraction methods generate statistical descriptors [24] or orthogonal polynomial representations [16,25].None of these methods is of course perfect and fully satisfactory.
In this paper we will illustrate supervised classification applications that rely on the processing of features originating from the description of the QRS complex by using the higher-order statistics and Hermite basis functions expansion.The HOS description exploits the fact that the variance of cumulant functions is usually lower than the variance of the original signals.
On the other hand, the Hermite expansion takes advantage of the similarity of the individual Hermite functions and different fragments of QRS complex of the ECG waveform.Coefficients of the Hermit function can describe the ECG signals.
In the Hermite basis function expansion method, the QRS complex is represented by a series of Hermite functions.This approach successfully exploits existing similarities between the shapes of Hermite basis functions and QRS complexes of the ECG waveforms under analysis.Moreover, this characterization includes a width parameter, which provides good representation of beats with large differences in QRS duration.Let us denote the QRS complex of the ECG curve by x(t).Its expansion into Hermite series may be written in the following way: where, cn shows the expansion coefficients, σ is the width parameter, and φn(t, σ) shows the Hermite basis functions of the nth order defined as follows [16]: And Ht is the Hermite polynomial of the nth order.The Hermite polynomials satisfy the following recurrence relation: For example: ( ) 6 3 The higher order of the Hermite polynomials shows better quick changes in the time.The coefficients of Hermite basis functions expansion, cn, may be treated as the features used in the recognition process.They may be obtained by minimizing the sum square error, defined as: This error function represents the set of linear equations with respect to the coefficients cn.They have been solved by using singular value decomposition (SVD) and the pseudo-inverse technique [26].In numerical calculations, we have represented the QRS segment of the ECG signal by 181 data points around the R peak (90 points before and 90 points after).A data sampling rate equal to 360 Hz generates a window of 250 ms, which is long enough to cover a typical QRS complex.The data have been additionally expanded by adding 45 zeros to the end of each QRS segment.This additional information is added to reinforce the idea that the beats do not exist outside the QRS complex.Subtracting the mean level of the first and the last points normalizes the ECG signals.The width σ was chosen proportional to the width of the QRS complex.These modified QRS complexes of the ECG have been decomposed onto a linear combination of Hermite basis functions.We can solve the (2) with this matrix: S is number of data which are selected from one signal.i t (i=1,2,..s) shows the ith datum of one signal.Because the above matrix is not squared, we have used the pseudo-inverse technique.

Neural network classifier
We have used MLP neural networks as classifiers.
An MLP neural network has several layers.One input layer of source nodes and one or more hidden layers of computation nodes (neurons) and one output layer.The problem of learning algorithm is very crucial for MLP neural network.Back-propagation (BP) learning algorithm is still one of the most popular algorithms [27].In BP the weight values are updated by a simple gradient descent algorithm: where, k g is the current gradient, k a is the learning rate and k w is a vector of current weights and biases.With standard steepest descent, the learning rate is held constant throughout training, that causes the algorithm be very sensitive to the correct setting of the learning rate.If the learning rate is set too high, the algorithm may oscillate and we may have an unstable algorithm.On the other hand, if the learning rate is set too small, it may take a long time to converge.However, by allowing changes in learning rate change during the training process, the performance of the algorithm can be improved.An adaptive learning rate will attempt to keep the learning step size as high as possible while keeping the learning stable [28].In Variable Learning Rate Back-propagation (GDX) algorithm the adaptive learning rate with momentum training is used.Gradient descent with momentum is one of the versions of gradient descent that not only allows a network to respond to the local gradient, but also allows it to response to the recent trends in the error surface.Multilayer networks usually use sigmoid transfer functions in the hidden layers.These functions can compress an infinite input range into a finite output range.
Using the steepest descent training method causes a big problem, because the gradient may have a very small magnitude and, therefore, cause small changes in the weights and biases.To solve these problems, the resilient back-propagation (RP) algorithm is proposed [27].The training phase of the BP algorithm network is usually timeconsuming.The conjugate gradient algorithms are proposed to overcome this issue.In the conjugate gradient algorithms, in order to determining the step size, search is made along the conjugate gradient path, which will minimize the performance function along that line.This will make generally faster convergence than steepest descent directions.The conjugate gradient algorithms start out by searching in the steepest descent direction (negative of the gradient) on the first iteration: Then, a line search is performed to determine the optimal distance to move along the current search direction: The next search direction is then determined so that it is conjugate of the previous search directions.The global method for determining the new search direction is to combine the new steepest descent direction with the previous search direction: The way in which the k  constant is computed and distinguishes the various versions of conjugate gradient.For the Fletcher-Reeves update the method is: Equation ( 12), shows the ratio of the norm squared of the current gradient to the norm squared of the previous gradient.Another version of the conjugate gradient algorithm was introduced by Polak and Ribiére [29].As with the Fletcher-Reeves algorithm, the search direction at iteration is determined by: For the Polak-Ribiére update, the constant k  is caculated by: For all conjugate gradient algorithms, the search direction is periodically reset to the negative of the gradient.The standard reset point occurs when the number of iterations is equal to the number of network parameters (weights and biases), but there are other reset methods that can improve the efficiency of training.One such reset method was proposed in [30], the line search requires that the network response to all training inputs be computed several times for each search so that it is computationally expensive.The scaled conjugate gradient algorithm (SCG) was developed by Moller [29].Newton's method is an alternative to the conjugate gradient methods for fast optimization.The basic step of Newton's method is: where, A is the Hessian matrix (second derivatives) of the performance index at the current values of the weights and biases.Because of the expensive calculation of the Hessian matrix, usually some of algorithms which don't require the calculation of second derivatives are introduced.These are called quasi-Newton (or secant) methods.The quasi-Newton method, which has been most successful in published studies, is the Broyden, Fletcher, Goldfarb, and Shanno (BFGS) update [29].The one step secant (OSS) method is an attempt to bridge the gap between the conjugate gradient algorithms and the quasi-Newton (secant) algorithms.This algorithm does not store the complete Hessian matrix.The Levenberg-Marquardt (LM) algorithm [31] uses the approximation to the Hessian matrix in the following Newton-like update: where, J is the Jacobian matrix, e is a vector of network errors, and µ is a constant.

Database and performance metrics 5.1. Performance metrics
One of the significant issues in ECG beat classification is how to appropriately evaluate the performance of a proposed method.In this study, we have considered three statistical indices: Accuracy (Acc), Sensitivity (Se) and positive Predictivity (PP) which are computed in ( 17)- (19), respectively.Accuracy is the most important metric for determining overall system performance.We defined the overall accuracy of the classifier as follows: TE T NN Acc 100 N   (17) where, E N is the total number of errors in classification and T N is the total number of beats in classification.Equation (18), represents the ratio of the number of correctly detected events, TP (true positives), to the total number of events: Pp, is the ratio of the number of correctly detected events, TP, to the total number of events detected by the analyzer and is given by: Sensitivity measures how successfully a classifier recognises beats of a certain class without missing them, whereas positive predictivity measures how exclusively it classifies beats of a certain type [32].

MIT-BIH arrhythmia database
The MIT-BIH arrhythmia database [33] was used as the data source in this study.The database contains 48 recordings.Each record has a duration of 30 minutes and includes two leads; the modified limb lead II and one of the modified leads V1, V2, V4 or V5.The sampling frequency is 360 Hz, the data are bandpass filtered at 0.1-100 Hz and the resolution is 200 samples per mV.Twenty-three of the recordings are intended to serve as a representative sample of routine clinical recordings and 25 recordings contain complex ventricular, junctional, and supraventricular arrhythmias.There are over 109,000 labelled ventricular beats from 15 different heartbeat types.There is a large difference in the number of examples in each heart beat type.The largest class is "Normal beat" with about 75,000 examples and the smallest class is "Supraventricular premature beat" (SP) with just two examples.
The database is indexed both in timing information and beat classification.For more details about MIT-BIH Arrhythmia database see [34].We used a total of seven records marked as: 100, 101, 102, 104, 105, 106, and 107 in the database.We extracted a total of 15,506 beats; 8,405 normal beats, 625 abnormal PVC arrhythmia beats, and 6,476 other arrhythmic beats.We used the database index files from database to locate beats in ECG signals.

MLP neural network architectures and training algorithms
Various network architectures were evaluated to find an optimum solution for ECG signal diagnosis problem.In order to evaluate system performance, the number of hidden layers as well as the number of neurons in the hidden layers was varied in different experiments.The output (target) vector is defined with a combination of 1s or 0 s to represent each of the classes being recognized.To assign the ECG waveforms into one of three different classes, the number of neurons in the output layer is set to be three.Table 1 shows the ECG classes and representation of the desired neural network outputs.We have investigated four architectures of neural networks.Two architectures, NET1 and NET2 have a single hidden layer with X neurons which was set to 35 and 45 respectively.A tan-sigmoid transfer function was selected for hidden layers.The output layer transfer functions for fitting problems are generally selected to be linear, whereas for pattern recognition problems it may be a sigmoid function.However, we propose two other network architectures NET3 and NET4 with two hidden layer.NET3 and NET4 also have three neurons in output layer.Hidden layers of NET3 have 30 and five neurons respectively and these values for NET4 are 20 and five neurons.Other parameters for these four architectures are the same.In order to facilitate the performance comparison of different training algorithms, we have used their acronyms.Table 2 shows these training algorithms and their acronyms.

Figure 1 .
Figure 1.General scheme of the proposed method.

Figure 2 .
Figure 2. Block diagram of SWT.H(z) and Hr(z) are the decomposition and reconstruction high pass filters.G(z) and Gr(z) are low pass filters.Term d(•, •)denotes the decomposition coefficients and a(•, •)denotes the approximation coefficients.