Predicting Ventricular Deﬁbrillation Results Using Learning Models: A Design Practice and Performance Analysis

This work proposes a learning model to predict the outcome of electrical defibrillation from ECG signals in ventricular fibrillation (VF) periods, which is a lethal situation happening when a patient is suffering cardiac arrest. An animal experiment of rats is conducted to obtain the ECG signals and necessary information for this study. This proposed model only extracts one feature from the ECG signals and enjoys low computational complexity at both training and testing stages. The statistics of this extracted single feature is further analyzed, and mathematical closed-form formulas for several interesting performance indices including the sensitivity, specificity, accuracy, precision and Area Under the Curve (AUC) are obtained to gain more insights of the proposed system. Moreover, the extracted feature can be treated as a linear combination of individual frequency components of the ECG signal, where the combining coefficients of the linear combination may show informative clinical inference. Frequencies corresponding to large trained combining coefficients imply that they contribute more in distinguishing the defibrillation outcome, and vice versa. As a result, important frequencies of the ECG signals can be identified and insignificant frequencies can also be filtered out by the proposed training. Simulation results corroborate the analytical results, and show that the proposed scheme greatly outperforms several competitive learning models and traditional methods in terms of testing accuracy and computational complexity.

To avoid frequent electrical defibrillation, a methodology is to find a best timing of defibrillation or predicting the success rate of defibrillation. Several research has been conducted in this field. To name a few: The popular amplitude spectrum area (AMSA) was proposed by Young et al. in [2]. AMSA evaluates the area of amplitude spectrum of the ECG signals and has been used for evaluating the effort of cardiopulmonary resuscitation, predicting the success of defibrillation and suggesting optimal defibrillation timing (see, e.g., [2]- [6]). In [7], the authors found that the shortest waveform still retained predictive characteristics in frequency domain for amplitude spectrum area (AMSA) was 0.8 seconds while in time domain for median slope (MS) was 1.8 seconds and discussed that 1-second waveform generally included information for deriving waveform measures. In [8], they also overcame the general reduction in performance of classifying VF with chest compression by using support vector machine (SVM) to classify the combined information from 24 individual measures. The ECG of pigs was analyzed in [9] by their proposed method, blind source separation (BSS), which involved short time Fourier transform (STFT), singular value decomposition (SVD), and independent component analysis (ICA). The work in [10] distinguished ventricular tachycardia (VT) from VF and considered overlapping case of them as VT-VF. In [11], the authors found that the minimum duration with accurate prediction in time domain for sample entropy predictor was 1.5 seconds. A predictor called smoothed nonlinear energy operator (SNEO) was proposed in [12], which had a good performance for 2 seconds as the shortest length of segment. The authors in [13] synthesized VF with white noise for trend analysis. A method called detrended fluctuation analysis (DFA) was proposed in [14] to specifically predict the result of first-shock defibrillation. The authors analyzed the variance of ECG in [15] and proposed a neural network model in [16] for first-shock result prediction.
In general, the signals used for analysis can be categorized into time or frequency domain or both. Also, some VF signals are with chest compression. These are categorized and summarized as follows: VF without chest compression was analyzed in [7]- [12], while the continuous waveform of VF with and without chest compression was analyzed together in [15] and [16]. The differences of analysis between VF with and without chest compression were compared in [8]. Time domain and frequency domain were both analyzed in [7]- [11], whereas time domain was the main focus in [12]- [14], and frequency domain was the main focus in [15] and [16].
Recent research efforts have been put to solve this issue using machine learning, e.g., see [15]- [16]. Most of the works attempt to build a good learning model to best predict the outcome of the defibrillation. However, there is still room for improvement including 1) reducing the number of extracted features, and 2) decreasing the computational complexity of the learning model. For 1), too many features may lead to over fitting issues while few features may also result in poor prediction outcome. A suitable number of features is usually determined via simulation in the current literature. 2) is important for implementations in hardware and power consumption limited devices. In addition to 1) and 2), few studies have built mathematical model for the extracted features and provide a clinical inference about what may cause the successful or failed outcome via the extracted features and learning models. These motivate us to explore this topic, and try to solve 1) and 2) as well as to propose a statistical model for the extracted features and provide informative clinical inference for the defibrillation outcome.
In this paper, we propose a new learning model to predict the electrical defibrillation outcome. An animal experiment of rats was conducted to capture the ECG signals and necessary information for this work. The proposed learning model contains two modules. One is feature extraction module and the other is the classification module. In the feature extraction module, the proposed model extracts only one feature. We analyze this extracted feature and find that its statistic is close to a Gaussian distribution for both successful and failed outcomes. Consequently, classical detection can be applied at the classification module, and the corresponding theoretical results are available for the proposed model. We derive several interesting performance indices in closed-form formulas including sensitivity, specificity, accuracy, precision and Area Under the Curve (AUC) for the proposed system. These closed-form formulas help in gaining more insights and further understanding the proposed system. Moreover, the feature extraction module is linear and has low computational complexity in both training and testing stages. Thanks to the linearity, the feature extraction module can be treated as an equivalent system that linearly combines the frequency domain components of the input ECG signal. As a result, the extracted feature is a linear combination of the frequency components, where the combining coefficients after training can be regarded as the contribution from individual frequency components. Large values of weights imply the corresponding frequency components are important in distinguishing the successful and failed cases, and vice versa. The values of the trained weights may provide informative clinical inference. For instance, the values of weights for frequency at 0 Hz and 60 Hz are small, which correspond to DC (direct current) and harmonic of electrical supply. This means that the proposed training model "filters out" these large interference sources irrelevant to the decision. On the other hand, we find that some frequencies have large value of weight, and the simulation results also indicate that the successful and failed cases have significant differences at these frequencies. Simulation results corroborate the analytical results, and show that the proposed scheme greatly outperforms several competitive learning models as well as traditional methods such as AMSA [2] and DFA [14] in terms of testing accuracy and computational complexity.
The rest of this paper is organized as follows: In Section II, we propose the learning model and the corresponding mathematical derivations. Section III analyzes the proposed system and obtains theoretical results. Simulation results are  provided in Section IV, and conclusions are summarized in Section V.

II. PROPOSED SYSTEM AND METHODS
A block diagram of the proposed system is shown in Fig. 1, which consists of signal acquisition system, preprocessing, feature extraction, and classification modules. Fig. 2 shows a diagram that represents a quick view of the lab setup for animal experiments and signal acquisition. The whole experiment process from VF inducement to cycles of CPR and electrical defibrillation is shown in Fig. 3. The experiment subjects, male Wistar rats, were controlled in weight 450 ± 50 g and age 14 weeks. The rat was anesthetized with 50 mg/kg body weight sodium pentobarbital intraperitoneal injection so that it could stay still for at least 1 hour. Further injection was proceeded based on the situation. Breath supply by a ventilator pumped air with 0.65 mL/100 g body weight tidal volume and fraction of inspired oxygen (FiO2) 1.0 at frequency 100 breaths/min through tracheal intubation with a PE 200 catheter. Original ECG signal was probed as lead III, a differential pair signal from the left arm and the left leg with the right leg connected to the ground, so that we could observe the electrical  activity of the interior of heart. Body temperature was monitored through a thermodilution catheter which was inserted through left femoral artery and advanced into the abdominal aorta. Other physiological signal observations and environmental settings were the same as mentioned in [17]. While keeping body temperature at 37 ± 0.5 • C, we induced cardiac arrest on the rat by providing a 60 Hz current progressively increasing to 1 mA through the guidewire for 1.5 minutes and then waited for 3.5 minutes to make sure no spontaneous defibrillation. When starting CPR, we first pressed for 1 minute and then gave an electrical shock of 3 J for defibrillation. If VF kept happening, rest CPR cycles were pressing for 30 seconds followed by electrical shock until ROSC, PEA, or asystole. After 2 electrical shocks, the energy of shock was raised to 5 J. The gap between pressing and electrical shock was about 1.3 to 1.5 seconds. Defibrillation result judgement time was set to 5 seconds, the same as in [14], counted from the release of electrical shock.

A. ANIMAL EXPERIMENTS AND SIGNAL ACQUISITION
The results of defibrillation were separated into four types including: 1) return of spontaneous circulation (ROSC), 2) pulseless electrical activity (PEA), 3) asystole, and 4) VF that kept happening. Paramedics stopped electrical defibrillation since there was no VF in the judgment time when the former three types of result happened. Hence, these results were categorized to be successful defibrillations; while the last type was considered as a failed one. Similar categorization was used in [18]. An example of successful defibrillation is shown in Fig. 4.
The electrical signal was collected and amplified to an easier observing scale, then sampled and digitized to computable data. The signal was amplified by an instrumentation amplifier. The gain was set to 2000 V/V by adjusting a potentiometer which was connected to specific pins to about 25 . The bandwidth was higher than 1 KHz and common-mode rejection ratio (CMRR) was about 130 dB.
The amplified signal was sampled and digitized by National Instrument multifunction I/O device USB-6351 (NI 6351). Supply voltage of ±10 V was also provided by NI 6351. When digitizing the analog signal to resolution of 16 bits, the minimum recognizable voltage range was ±0.31 mV. The maximum peak of amplified ECG signals was smaller than ±10 V, which is smaller than the voltage limitation of NI 6351. Thus there was no clipping issue during signal acquisition.
The collected data is divided into two non-overlapping groups. One is for training and the other is for testing. For the training data, 294 of them are labeled as fail and 67 as success; while in the testing data, 192 are labeled as fail and 18 as success.

B. PREPROCESSING
The raw data is preprocessed using the following procedure. First the raw data is transformed from time domain to frequency domain and use its magnitude as input data for next process. From Fig. 4, the VF period between pressing and charging is only 1.5-second. Hence we extract 1.3-second VF data before charging. Then from the 1.3-second VF data, we further extract only 1 second duration for every 0.01-second delay starting from the end of the data until we obtain total 30 pieces of 1-second data. The extraction procedure is shown in Fig. 5.
In the experiment, 1000 samples are gathered in 1 second, which is a 1kHz sampling rate widely used for extracting the ECG of rats. Hence the extracted 1 second VF has data dimension of 1000 (N = 1000), and it is augmented to 30 times. To reduce different prior probabilities between failed and successful classes in training data and testing data sets, we duplicate success-labeled training data by a factor of 4; while success-labeled testing data by 10. After all augmentation process, for training, the number M (0) for failed cases is 294 × 30 = 8820 and the number M (1) for successful cases is 67 × 30 × 4 = 8040. Thus the total number M of cases for training is 16860. Similarly, for testing, 192×30 = 5760 failed cases and 18×30×10 = 5400 successful cases.
Let the m th augmented data be s A (m). Every of the 16860 augmented data is normalized using the following equation so that each of them contains the same energy: (1) Fourier transform comes after normalization. The magnitude response in the following equation is used for further processing in the proposed model, (2) After preprocessing, we obtain 501 dimensional data from 0 Hz, i.e., direct current (DC), to 500 Hz. The Nyquist frequency of sampling rate is at 1 KHz. The means and distribution ranges of the successful and failed classes from the largest to the smallest value are shown in Fig. 6, where the upper sub-figure shows the distribution ranges of all frequency responses while the lower one is the enlarged version and only shows the ranges of which are inside 100 Hz.
Each bar includes the magnitude values of successful cases and failed cases at specific frequency and shows the range from the largest value to the smallest one. Point markers and cross markers represent means of failed cases and successful cases at specific frequencies respectively. In general, the magnitude at 0 Hz has the largest mean value and second large distribution range while the magnitude of at 60 Hz, which is the frequency of electrical supply, has the largest distribution range and second large mean value. Since the magnitudes of these two frequencies are much larger than others, they may provide significant effect on performance when all frequencies are analyzed together as a combination.

C. FEATURE EXTRACTION
There are two main purposes for feature extraction in this paper. One is to find features that have implicit characteristics in the original data, and the other one is to reduce dimension so that unnecessary features are discarded.
The proposed feature extraction model consists of two popular and useful methods. One is principal component analysis (PCA) and the other is discriminant analysis feature Extraction (DAFE). Similar extraction models that used these two methods were introduced in [19]- [24] for different applications. PCA is an unsupervised machine leaning method.
Thus the process can be done without using labels of classes. On the other hand, DAFE takes labels of classes into consideration [25]. Dimension of PCA features is between 1 to the dimension of input data according to the performance of model in the training process, whereas dimension of DAFE features is one fewer than the amount of classes. In the proposed model, we also truncate the dimension of original data in the frequency domain by applying a threshold and reform the preserved features. This is introduced later in Sections II-C and III-B.
At the first stage, the features of PCA are captured by linear transforming of preprocessed data. Without considering the labels of classes, we first calculate the mean vector of the data obtained in (2). Let the data vector be denoted by . Let x (m) be the m th data vector, and V P be the eigenvectors in descending order of the covariance matrix of x for all data. Dimension of the features are reduced by preserving features in the amount of a number N P , which is determined according to the performance of the model. After dimension reduction, V P becomes V P as an N-by-N P matrix which comprises PCA coefficient vectors. The PCA features can be obtained by projecting the data vector x(m) to the column space of V P , i.e., The second stage of feature extraction is DAFE, a part of linear discriminant analysis (LDA). The algorithm seeks to find the projection axis that data in different labels of classes are separated the most. Since we have two different labels of classes, 0 for failed cases and 1 for successful cases, there is only one preserved feature after proceeding DAFE. The calculation begins with finding means of PCA features in different classes Then, the scatter of data within class is calculated by Also the scatter of data between classes is calculated by According to Fisher criterion, the DAFE coefficient vector, v D , is designed to maximize the ratio of the scatter of data between classes to the scatter of data within class after the projection as The problem in (7) is a simple eigendecomposition as [19]- [24].
where λ D is the eigenvalue of S −1 w · S b and v D is the corresponding eigenvector. In (6), S b only has one basis consisting of µ (0) y P − µ (1) y P , which implies that S −1 w · S b is an rank-one matrix with one non-zero eigenvalue. Hence after DAFE and PCA, only one feature is preserved given by From (3) and (9), the linear transformation of PCA and DAFE can be combined as a vector given by For convenience, we called v P+D "coefficient vector", which is used to judge how individual frequency elements affect the decision. If the coefficients are large, it implies that the corresponding frequency components are important in clinical inference, and they should be preserved in predicting the defibrillation results. Hence, the single feature after the PCA and DAFE to make decision is simply an inner product of the coefficient vector and the magnitude response of the ECG signal: The final process of the feature extraction is truncation. In this process we directly eliminate specific frequencies that does not contribute much to the extracted feature to reduce complexity. We apply a threshold T for truncation on the coefficient vector where the value of T can be determined according to the testing accuracy later. Hence, the output, truncated y P+D (m) in (11), becomes

D. CLASSIFICATION
The extracted single feature can be approximated by a Gaussian random variable, which will be verified via simulation later. Since there are two classes, success or fail, the decision problem becomes the classical detection problem [26]. Under such circumstances, the maximum a posteriori (MAP) classifier can be used and the corresponding theoretical analysis can be utilized as well.
A linearized maximum a posteriori (MAP) classifier, LDC, is used for classification of the feature. LDC is derived from posterior probability function with likelihood function of normal density as mentioned in [27]. The variances of different labels of classes are regarded as the same and are both replaced with a pooled variance referring to [28] so that the process is simplified as a linear transform. The feature for classification is a scalar after dimension reduction through the whole feature extraction process.
To compare two posterior probabilities: P(c = 0|y(m)) and P(c = 1|y(m)), we first transform the posterior probability function of 0-labeled class to a sigmoid function as where Different variances are seen as the same and replaced with pooled variance, From (16), the term in (15) is simplified to which is a function of y multiplying weight and adding bias The posterior probability function in (14) then becomes whereas the posterior probability of 1-labeled class becomes (1) . (21) By simply switching labels of parameters to the other class, the value of z (1) (m) are derived referring to (15). The predicted defibrillation result can then be determined by selecting the class with larger value in posterior probability. Another MAP classifier, Quadratic Discriminant Classifier (QDC), can be derived using similar derivation as (14) and (15). However, it keeps different variances as the original one. Hence all the parameters in (15) are preserved. The corresponding performance comparisons for various classifiers will be shown in the simulation results later.

III. SYSTEM PARAMETERS AND THEORETICAL MODEL
To build up a learning model with good performance in both training and testing stages, we need to analyze the parameters including the number of preserved PCA features, the value T of threshold in (12), and how the coefficient vector at individual frequencies affect the classification. After determining the parameters, we build a theoretical model to predict the performance of the proposed system.

A. PRESERVED FEATURES
The features are extracted from original data for disclosing specific characteristics. We know that signals at 0 Hz and 60 Hz have relatively larger magnitudes than other frequencies, and would like to know the effects of other frequency components. This can be done by observing the distribution of frequency response and features.
After the first stage of feature extraction, distributions of PCA features y P in (3)  To see this more clearly, Gaussian distribution is used for fitting the distribution of each feature, and the means as well as variances are used in Welch's t-test to indicate the differences between distribution of features for different classes. Welch's t-test seeks to find the t-value to test null hypothesis. When a t-value approaches 0, it indicates that the two distributions being compared are almost the same. On the contrary, a t-value with larger distance from 0 leads to larger differences between the distributions. For comparison, we take the absolute value of t-value, as modified Welch's t-test instead of the original one in the following formula: (1) . (22) Training and testing performance of PCA are shown in Fig. 8. The left side of the y axis represents the accuracy while the right side represents Area Under the Curve (AUC). From this figure, the accuracy and AUC both reach a satisfactory performance when the amount of preserved PCA features is 56. Since the order of PCA features is descending referring to corresponding eigenvalues, it is reasonable to finalize the number of preserved PCA features, N p , to be 56 and keep only the first 56 PCA features for further feature extraction. Table 1 shows the parameters including eigenvalue, t-value and the coefficients at 0 Hz and 60 Hz and the maximum absolute value (ABS) of some coefficients of PCA+DAFE feature. From the table, there are total twelve features after PCA+DAFE but only the first feature has non-zero eigenvalue and large t-value. Hence we only keep the first feature after PCA+DAFE, which is also the theoretical results  Table 1 are corresponding to the coefficient vector v P+D in (10). According to (11), the final scalar result y P+D is an inner product of  v P+D and x, and the elements in v P+D correspond to the scaling weights for signal at particular frequencies. For example, the first element of v P+D corresponds to the scaling weight to the frequency at 0 Hz, the second element corresponds to 1 Hz and so on. Hence, the elements with large coefficients in v P+D imply that the corresponding frequencies are important and contribute more to form the final results. In Table 1, we see that for the first feature (the single kept feature), the coefficients at 0 Hz and 60 Hz are relatively small than the maximum absolute value of the whole coefficients. This result implies that although the signals at 0 Hz and 60 Hz have significantly larger magnitude than others, they are not important in making final decision. Hence they are "filtered out" by assigning small coefficients in v P+D in the proposed feature extraction scheme.

B. THRESHOLD FOR TRUNCATION
As mentioned in the previous section, we determine the threshold value for truncation, T, mentioned in (15) to eliminate insignificant information. This can be done by setting insignificant elements of v P+D to zeros, which corresponds to unimportant frequency elements in making decision such as the 0 Hz and 60 Hz components as well as the baseline wander at 1 Hz [15] and [16]. Fig. 9 shows the accuracy as well as AUC performance as functions of the value of threshold T. From the figure, we determine T = 0.032 in the following experiments because this threshold value not only keeps performance almost the same as that without truncation but also filters out unimportant frequencies of DC, baseline wander, and electrical supply.   60 Hz and 1 Hz, which means that all coefficients within the dash-dot lines (green lines) in Fig. 10 are set to zeros.
From the experiment, a significant performance degradation occurs when T is set to be 0.189, which also filters out the signals at 2 Hz. The significant effect of the 2 Hz component on performance implies that it is a good direction to further investigate whether or not 2 Hz component is a key of distinguishing successful and failed cases.
To see this more clearly, we conduct an experiment by removing only one frequency component at a time and observe how the performance is affected. The result is shown in Fig. 11. From this figure, we see that removing 2 Hz component indeed leads to much more serious performance degradation than other components. Therefore, we know that the signals at 2 Hz play a crucial role in making correct decision in the proposed system.

C. THEORETICAL MODEL OF THE PROPOSED SYSTEM
Now we would like to build a theoretical model for the proposed system. From the discussion in the previous section, the single extracted feature y P+D in (11) is used to judge either success or failure. The distribution of y P+D for both successful and failed cases at the training stage with and without truncation are shown in Fig. 12. We see from the figure that y P+D approximates the Gaussian random variables very well both for successful and failed cases. Hence it is reasonable to approximate y P+D as Gaussian random variables with different means and variances for successful and failed cases. Then the detection problem becomes a classical hypothesis detection problem, see, e.g., [26]. Moreover the results with and without truncation have almost the same shapes of distribution except a shift of mean values. Hence using the proposed truncation can keep almost the same detection performance; while it can reduce the computational complexity and eliminate unnecessary information in advance.
Similarly, we show the distribution of y P+D at the testing stage for both successful and failed cases with and without truncation in Fig. 13. Similar results including Gaussian approximation and truncation not affecting too much performance are also seen at the testing stage. Next let us propose a theoretical performance model for the proposed system. Prior probabilities for successful and failed classes can be eliminated due to the use of data augmentation in the preprocessing stage. Hence the amount of data labeled as failed is close to that labeled as success. In addition, from Figs. 12 and 13, the variances for both successful and failed cases can be approximated to be the same. Thus, the posterior probability function can be simplified to the likelihood function with Gaussian distribution. The decision boundary can be derived as follows: Here, y is an arbitrary value of the extracted feature. The null hypothesis, H 0 , implies that distributions on two sides are the same; while the alternative hypothesis, H 1 , implies that distributions on two sides are different. For further simplification, we calculate the logarithm of both sides and remove the same parts as Decision boundary is at the middle of μ (0) y and μ (1) y given by Extracted features with values smaller than the decision boundary are considered as failed cases; while others are considered as successful cases. Various performance metrics are also derived and expressed as commonly used Q functions defined by where the parameters z (0) = (y − μ (0) y )/ K y and z (1) = (y − μ (1) y )/ K y respectively represent the standard-deviation normalization of the feature in specific class.
Various theoretical performance metrics are listed in terms of the Q function in Tab. 2. One of the representative performance metrics, namely accuracy, can be decided once the sensitivity and specificity are determined with specific value of known decision boundary referring in [29]. Also, the theoretical ROC (receiver operating characteristic) curve can be obtained by using different decision boundary to create a fine-resolution relationship between sensitivity and specificity, more specifically, 1−specificity as the horizontal axis and sensitivity as the vertical axis. Moreover the Area Under the Curve (AUC) can be obtained by integrating the ROC curve and it is expressed as the formula shown in Tab 2.

IV. SIMULATION RESULT
In this section, we compare the performance of the proposed learning models with other models. Also the proposed theoretical results are compared to the simulation results to show the accuracy of the analysis. In the simulation, the PCA preserves 56 features, the DAFE preserves one feature and the thresholds of truncation, T, is set to 0.032. LDC and QDC refer to Section II-D.

Experiment 1 (Theoretical vs. Simulation Results of Proposed Models):
In this example, we show the simulation results of the proposed learning models and their theoretical results. Table 3 shows the details of the four learning and theoretical models to be demonstrated. For the theoretical model, the analytical closed-form formulas in Table 2 are used.
Also, we applied the results in Section III-C. and calculated the corresponding suggested decision boundaries in Table 4.
The training results of the four models are shown in Table 5 and the testing results are shown in Table 6. From Tables 5 and 6, we have the following observations: First, the theoretical results in general match the simulation quite well. For example, for the training performance in Table 5, the AUCs of the proposed model and theoretical model are 0.91 and 0.90, respectively. When truncation is considered, the AUCs of the proposed model and theoretical model are both 0.91; for the testing performance in Table 6, the AUCs of the proposed model and theoretical model are both 0.88. When truncation is considered, the AUCs of the proposed model and theoretical model are 0.87 and 0.86, respectively. These show the accuracy of the proposed theoretical closed-form results for the proposed systems.
Second, truncation does not degrade performance. Sometimes it even slightly improves the performance. For example, the training accuracy without truncation improves from 82.9% to 83.4% with truncation. At the same time, the testing accuracy without truncation improves from 79.8% to 80.1% with truncation. This is not a surprising result, because learning models sometimes have over-fitting problems. By using the proposed truncation, the insignificant information (frequency components) can be filtered out in advance and it only keeps useful information. Thus over-fitting problems can be avoided. As a result, using the proposed truncation not only reduces the computational complexity but also avoids over-fitting problems.
We also show the ROC curves for these four models in Fig. 14. The operating points are included in the figures as well, which use the suggested decision boundaries to investigate the performance metrics. Thus the operating points can actually be obtained using Tabs. 5 and 6, because the term "true positive rate" means the same thing as "sensitivity", and the term "false positive rate" means the same item as 1-specificity.    From Fig. 14 and Fig. 15, we again observe that the proposed theoretical results match the simulation results well, especially at the training stage. For example, at the training stage in Fig. 14, the ROC gap between the proposed and the theoretical model is within 1%; at the testing stage in Fig. 15, the ROC gap between the proposed and the theoretical model is within 2% near the operating point. Although there are minor performance differences at the testing stage, the performance trends of the theoretical and simulation results in general meet each other in the potential operational regions. This again shows the accuracy and usefulness of the proposed analytical results applied in the proposed systems.

Experiment 2 (Comparison with Other Learning Models and Traditional Methods):
In this experiment, we show the performance comparisons of the proposed scheme with other learning models. Recall that the proposed scheme contains two stages. The first stage is the feature extraction, which consists of PCA, DAFE and truncation. The second stage is the classification stage, in which LDA is used to determine whether the input data is successful or failed case.
The learning models to be compared are modified from the proposed model. Some of them use the same feature extraction as the proposed one, but different classifier such as SVM, LDC or ODC are applied. Some of the modified models simply use DAFE at the feature extraction stage, and use SVM, LDC or ODC as classifiers, where there are three combinations. The parameters of those models are artificially selected so that the their best performance is achieved. In additions, we also include two widely used methods, namely, AMSA [2] and DFA [14] in the comparison.
The comparisons for training performance is shown in Table 7, and that for testing performance is shown in Table 8. We have the following observations from these two tables. First, some of the models can achieve much better performance at the training stage. To name a few, from Table 7, PCA without truncation + SVM achieves 100% accuracy and highest scores in all performance metrics. However, the corresponding testing performance is quite poor from Table 8. This is a classical over-fitting problem. Hence, even some models have good training performance, if their testing performance is bad, they cannot be applied in practical applications.
Second, the proposed scheme has the best testing performance among the models from Table 8. Although one model, i.e., PCA+DAFE with truncation + SVM or QDC, performs almost the same with the proposed model, using SVM as the classifier leads to complicated computational complexity; at the same time, further efforts are needed to build theoretical model for PCA+DAFE+SVM. Therefore, the proposed model has advantages in the computational complexity as well as its interpretable theoretical results to explain this proposed learning model. Third, using PCA or DAFE alone as the feature extraction in general leads to poor performance. One exception is PCA without truncation + LDC, which has an accuracy of 79.8% and an AUC of 0.87 from Table 8. Nevertheless, this scheme uses PCA alone and needs to preserve 56 features to achieve this performance. Also the corresponding analytical result is not as easy as the proposed one. On the other hand, the proposed scheme only keeps one feature due to the use of the DAFE after the PCA. As a result, the proposed scheme has advantage in terms of the number of preserved features.
Finally, from Tabs. 7 and 8, the traditional AMSA [2] and DFA [14] that are non-learning based approaches, are also included. One can see that the AUC values of these two schemes are around 0.53 at the training stage and around 0.56 at the testing stage, which are relatively poor compared to the learning based methods. Other performance indices of these two schemes are also inferior to those of the learning based approaches, observed from the two tables.
To see the testing performance more clearly, we show the ROC performance for testing stage of these models in Fig. 16. We also indicate the operating point in this figure.
To have a higher AUC, the ROC should be closer to the left and top side. Observe that the proposed scheme performs the best in the operating point, which can also obtained from Table 8. Although some schemes may slightly outperform the proposed system in different operating regions. As mentioned earlier, the proposed scheme has advantages in computational complexity, concise theoretical results and the number of preserved features. Consequently the proposed solution is preferred to be used in practical situation.
Example of Computational Complexity: The computational complexity can be significantly reduced via the proposed schemes, i.e., PCA+DAFE with truncation (feature extraction) and LDC (classification). The reduction can be divided into two parts including: (a) Feature extraction with truncation, and (b) classification. They are discussed separately as follows: Feature extraction: In (a), the proposed PCA+DAFE reduces the number of features from high dimension to only one. This significantly reduces the computations later in classification. This procedure is a simple vector inner product in (11), and the result is y P+D (m). Then with truncation, the dimension of the vector inner product in (11) can be further reduced to that in (13), and the result is y(m). Since the dimension of v P+D is 501 in (11), after truncation, the dimension is reduced from 501 to 82 in (13) by setting the truncation threshold to 0.032 through the experiments in Section III-B. Hence, the on-the-fly complexity to obtain y(m) needs 82 multiplications.
Let us see the complexity of other feature extraction scheme, e.g., the pure PCA. The experiments in Section III-A showed that the best number of features after PCA is 56. That is, there are 56 features, and the corresponding computations involve matrix multiplications in (3). Hence PCA needs 56×501 = 28056 multiplications without truncations. Therefore, the complexity in feature extraction is 28056/82 ≈ 342 folded compared to the proposed scheme.
Classification: In (b), classification is introduced in Section II-D. Obtaining z (0) (m) in (17) needs one multiplication. Then, to obtain P(c = 0|y(m)) in (14), we need one extra exponential operation and one division. Hence obtaining both P(c = 0|y(m)) and P(c = 1|y(m)) needs totally 2 multiplications, two exponential operations and two divisions. Let us see the complexity of other classification scheme, e.g., the SVM. Even if we exclude its iteration complexity at the training stage, for the testing stage, SVM still needs to perform vector inner product. If the PCA in feature extraction reserves 56 features, SVM needs at least 56 multiplications at the testing stage, which is still much larger than that in the proposed scheme. From the discussion above, the proposed scheme can indeed significantly reduce the computational complexity.

V. CONCLUSION AND FUTURE WORKS
We have proposed a learning model that extracts only one feature to predict the outcome of electrical defibrillation. The statistics of this proposed system have been analyzed and several performance indices have been derived in closed-form formulas to gain more insights of the obtained results. We have found that the extracted single feature can be regarded as a linear combination of individual frequency components of the input ECG signal. The trained combining weights provide informative clinical inference, where frequencies corresponding to large weights have significant impact in judging the outcome of defibrillation, and vise versa. We have observed that the proposed training model has large weights near 2 Hz and small weights at 0 Hz and 60 Hz. This result has shown that 2 Hz frequency component is important in distinguishing the outcome of defibrillation in a rat animal experiment; while 0 Hz and 60 Hz are not and thus they have been "filtered out" via learning. Simulation results have shown the accuracy of the analytical results as well as that the proposed system outperforms several competitive schemes in terms of testing accuracy and computational complexity. For the future works, the proposed schemes and algorithms can be verified using human data. In addition, other algorithms for feature extraction and classification can be investigated to enhance performance and build corresponding theoretical models for predicting ventricular defibrillation results.