Computer Methods and Programs in Biomedicine

Background: Cardiomyocytes differentiated from human induced pluripotent stem cells (iPSC-CMs) can be used to study genetic cardiac diseases. In patients these diseases are manifested e.g. with impaired contractility and fatal cardiac arrhythmias, and both of these can be due to abnormal calcium transients in cardiomyocytes. Here we classify different genetic cardiac diseases using Ca 2 + transient data and different machine learning algorithms. Methods: By studying calcium cycling of disease-speciﬁc iPSC-CMs and by using calcium transients measured from these cells it is possible to classify diseases from each other and also from healthy controls by applying machine learning computation on the basis of peak attributes detected from calcium transient signals. Results: In the current research we extend our previous study having Ca-transient data from four different genetic diseases by adding data from two additional diseases (dilated cardiomyopathy and long QT Syndrome 2). We also study, in the light of the current data, possible differences and relations when machine learning modelling and classiﬁcation accuracies were computed by using either leave-one-out test or 10-fold cross-validation. Conclusions: Despite more complex classiﬁcation tasks compared to our earlier research and having more different genetic cardiac diseases in the analysis, it is still possible to attain good disease classiﬁcation results. As excepted, leave-one-out test and 10-fold cross-validation achieved virtually equal results. © 2021 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )


Introduction
Since their discovery induced pluripotent stem cells (iPSC) have been widely utilized for scientific research purposes and they hold great promise for use in biomedical research and development [1] . Patient-specific iPSC-derived cardiomyocytes (iPSC-CMs) offer an attractive experimental platform to model cardiac functionality and diseases.
Calcium cycling has an important role in extraction-contraction coupling of cardiomyocytes since it is the central regulator of in cardiac contraction and relaxation. Cardiac diseases often cause variability and distortions in calcium cycling of cardiomyocytes and affect their functionality. Such distortion abnormalities in calcium transients can represent a patient's cardiac phenotype [2] . Detection and characterization of these distortions are important, es-pecially, because they can be used in order to develop recognition and diagnostics of cardiac diseases. Thus far, machine learning methods have rarely been used for data associated with iPSCderived cardiomyocytes. Machine learning has been applied at least to analyze electrophysiological effects made by chronotropic drugs [3] and mechanistic action of drugs in cardiology [4] .
We started our preliminary research of calcium transients measured from iPSC-CMs generated from genetic cardiac disease patients in order to recognize calcium transient peaks from calcium signals and classified regularly i.e. normally cycling from abnormally cycling calcium peaks by means of signal analysis and machine learning algorithms [5] . After this we discovered that it is possible to separate three different genetic cardiac diseases from each other and from controls (wild type, WT) [6] . These diseases were long QT syndrome 1 (LQT1), an electric disorder of the heart that predisposes patients to arrhythmias and sudden cardiac death [7] , hypertrophic cardiomyopathy (HCM), disorder that affects the structure of heart muscle tissue leading to arrhythmias and progressive heart failure [8] with a myosin -binding protein C gene https://doi.org/10.1016/j.cmpb.2021.106367 0169-2607/© 2021 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ ) mutation (HCMM), and catecholaminergic polymorphic tachycardia (CPVT), an exercise-induced malignant arrhythmogenic disorder [9] . We also found that in order to separate between those diseases and healthy controls it was not necessary to differ calcium signals that were either entirely normal cycling from abnormally cycling signals. Next, we extended our existing signal data and added another HCM disease mutation [10] , an α-tropomyosin of the β-myosin heavy chain mutation (HCMT). Our newest research described that is possible to study effects of a drug on calcium transients and to classify and separate drug effects with machine learning method [11] .
In this research we extend our data set by adding transient signals of controls and cardiac diseases with two additional genetic cardiac diseases: dilated cardiomyopathy (DCM), a disease of the heart muscle, and long QT syndrome 2 (LQT2), a disease with electrical problems in cardiomyocytes. Thus, there are seven different classes of genetic cardiac diseases with two mutation for HCM, two types of LQTS (LQT1 and LQT2), one mutation for DCM and several mutations for CPVT and then healthy cardiomyocytes from control individuals which makes their computational modelling more complex than in our previous research [ 5 , 6 , 10 , 11 ]. We also compare and ponder the use of leave-one-out and 10-fold cross-validation and their possible effects of classification accuracy while building models with several different machine learning methods.

Peak attributes computed from calcium transient signals
Categorization of iPSC-CM calcium transients to normal and abnormal signals was determined by an expert biotechnologist. In LQT1 transient signals the share of abnormally cycling signals was 69%, in HCMM it was 37%, in CPVT 53%, in HCMT 44%, in LQT2 72%, in DCM 62% and in WT 14%. Abnormality was defined according to remarkably irregular amplitude or duration of calcium peaks whereas normally cycling peaks had regular amplitude and duration of peaks. However, as said above and shown by our earlier research [ 6 , 7 ], this property, either normal or abnormal transient signals, did not affect how well these transient signals could be classified into different classes. Therefore, we used them as such, without computing separately abnormal and normal transient signals. Fig.1 exemplifies a normal LQT2 signal and abnormal LQT2 signal. Correspondingly, Fig. 1 also shows those of DCM signals. Examples of calcium transient signals of other diseases and controls can be found from the figures in our two earlier articles [ 6 , 10 ]. Often the forms and sizes of peaks may vary noticeably, especially in abnormal transient signals.
After recognizing all acceptable peaks in a signal, attribute values of those peaks were computed. In our first research [6] for the machine learning classification of three genetic cardiac diseases and controls, we applied the first ten attributes to be given in the following. Later, up to our most recent research [11] and the current research we designed additional four attributes to obtain more information about peaks of calcium transient signals. Thus, we applied 14 peak attributes illustrated in Fig. 2 . Left amplitude is equal to the difference of the peak maximum and the amplitude value of the peak beginning. Right amplitude was computed from the peak maximum and end. Left duration is equal to the time difference from the peak beginning to the maximum, and right duration is the time difference from the peak maximum to the peak end. Next, the maximum of the approximated first derivative values from the peak beginning to the peak maximum (from the left peak side) was computed. Then the minimum of the first derivative values from the peak maximum to the peak end was evaluated. To apply this as a positive value its absolute value was taken. The maximum of the second derivative as well as its absolute minimum were computed from the right peak side only, since sometimes the left side of rather small peaks could be so low containing only a few samples that approximating the maximum and absolute minimum of the second derivative values would not succeed well. The surface area of a peak was computed as the sum of amplitude differences of peak curve values and values from a line calculated from the peak beginning to its end. Peak-to-peak interval was formed as the time difference from the maximum of the current peak and that of the preceding peak. In the case of the first peak in a signal it was estimated from the signal beginning. Next, time difference was computed from the location of the first derivative maximum of the peak left side to the peak beginning. Time difference was also computed from the location of the absolute minimum of the peak right side to the peak maximum. Next, an attribute called mean peak duration was computed in the following way. First, running along peak left and right amplitudes their amplitude halves were approximated, second the mean of those two half amplitudes was calculated, and then two intersection locations were computed for the horizontal line of this mean and peak curve. If rarely a peak was very asymmetric as to its left and right side amplitudes so that either the left side amplitude was smaller than the half of the right side amplitude or vice versa, the half of the smaller side amplitude was not used, but its whole amplitude. In other words, the mean of the smaller side amplitude and the half of the greater side amplitude was computed. Time difference of those two intersection locations is the attribute value called mean peak duration. Ultimately, peak curve length was ap-   (4) right D 2 durations, approximate location L 1 of the computation of (5) first derivative max( s ') for a calcium transient signal s , approximate location L 2 for (6) absolute first derivative |min( s ')|, location L 3 for (7) second derivative |min( s '')| and location L 4 for (8) second derivative max( s ''), (9) surface area R , (10) time interval t from the maximum of the preceding peak, (11) duration d 1 from the peak beginning to L 1 , (12) duration d 2 from the peak maximum to L 2 , (13) mean peak duration m , and (14) approximated peak curve length l . proximated by running signal value by value from the following sample location after the peak beginning to the peak end, estimating the Euclidean distance from the current location to the preceding one and summing up all these distances. The minimum, average and maximum lengths of all 1173 calcium transient signals were 7.7 s, 22.7 s and 46.5 s. The minimum, average and maximum numbers of peaks per signal were 1, 15.7 and 123. Altogether, 18387 peaks were recognized in six diseases classes and controls. See Table 1 . After the recognition of peaks in all signals, values of the above-mentioned 14 attribute were computed. In order to study the importance of the current attributes for disease classification, we computed with ReliefF algorithm (in MATLAB as all our data analysis and classification) importance values as depicted in Fig. 3 .
In Table 2 the means and standard deviations of 14 attributes are presented. The means of different diseases mostly differ from each other. From the attribute peak-to-peak interval t in Table 2  Naturally, different sam pling frequencies used are no ideal situation, but this was needed when collecting them took time and the software used was developed along with time. The difference Table 2 Means and standard deviations for 20 cell lines: left and right side amplitudes a 1 and a 2 , left and right side durations D 1 and D 2 , left side maximum of first derivative max( s '), right side absolute minimum |min( s ')|, right side maximum max( s '') and absolute minimum |min( s '')|of second derivative, peak surface area R , peak-to-peak time interval t , duration d 1 from the peak beginning to the maximum of left side first derivative, d 2 duration from the peak maximum to the first derivative absolute minimum, mean peak duration m and peak curve length l. Two lowermost cell lines are from disease DCM.  Thinking that this difference would be distributed for measured values, e.g., the beginning of a peak in a signal, the difference given by these two frequencies would be from 0 to 0.099 s. When this type of random measuring value follows a uniform distribution (any value between 0 and 0.099 equally probable), its mean is 0.99 s/2. When it is similar for measuring the peak maximum, we would obtain mean 2 • 0.099 s/2 = 0.099 s while measuring the beginning and end for the duration D 1 of the left side of a peak. D 1 was one of six attributes directly depending on time. For other five duration attributes D 2 , t, d 1, d 2 , and m , the differences of their means in Table 2 are, however, mostly greater than those of D 1 . Therefore, in principle D 1 would be more critical than those other. To be more exact, the distribution of the different sampling frequencies have to be taken into account: 90 signals for 8 Hz, 233 signals for 13 Hz and 54 signals for 14 Hz (these two close to each other), 601 signals for 23 Hz, 138 signals for 33 Hz and 67 signals for 38 Hz. Thus, only 90/1173 (total number of signals) or 7.5 % and 67/1173 or 5.7 % originated from the lowest and highest sampling frequencies. This means that the difference of 0.099 s could only occur in a small minority of all possible classification computations. More than half, 51.2 % of all signals were sampled at 23 Hz as WT, HCMM and HCMT data. The approximate frequency 23 Hz is around 15 Hz greater than 8 Hz and 15 Hz less than 38 Hz, i.e., approximately in the middle of minimum 8 Hz and maximum 38 Hz giving approximate symmetry in the frequency distribution around the most used frequency of 23 Hz. Mean random time difference between 23 Hz and 38 Hz would be around 0.017 s according to 1/23 Hz -1/38 Hz. All these mean that any possible random difference caused by these different sam pling frequencies used would be essentially smaller than the maximum 0.099 s for the great majority (around 87 %) of signals and, thus, their classification. Since 51.2 % originated from "the middle" frequency of 23 %, this means that inside this largest part of the signals there is no difference between sampling frequencies. To conclude, any actual average random effect for time attributes of all signals is much smaller than that largest of 0.099 s.
The attributes depending on amplitude (all the rest in addition to the five mentioned) are more complicated. However, all of them also depend on time. Therefore, they also partly follow the preceding analysis.
As an example, two different cell lines of DCM disease are visualized in Fig. 4 (A). Subject to the numbers of signals and their peaks this was the smallest class. There were 639 peaks in 35 signals of cell line 12619.DCM and 533 peaks in 32 signals of cell line 12704.DCM. In Fig. 4 (B) there are 3870 peaks of cell lines 0341.LQT2, 03417.LQT2, 03809.LQT2 and 03810.LQT2. When the cell lines cover mostly the same areas in each visualization, this reflects such distributions that possible differences of the peaks between cell lines of each disease class are minor in the current data. In addition, Fig. 5 shows the presentation computed for the data by all cell lines.

Technical specifications for classification
The genetic cardiac disease data examined in this study is challenging. We have seven classes included (wild type and six genetic cardiac diseases) in our dataset and the total number of signals is 1173. Since the amount of data is still relatively small when taking into account the number of classes, we cannot talk about big data yet in this context compared to many image classification tasks where there are tens of thousands of annotated images available for research. The data collection procedure is a laborious task to do in practice and begins from finding the voluntary patients and controls for the research. Then the actual iPSC reprogramming process is made and the iPSCs are differentiated into cardiomyocytes. From cardiomyocytes the calcium transient signals are measured which is final the stage of data collection process. After that preprocessing of signals is made including tasks such as finding peaks from signals and extracting peak-based variables. When the preprocessing of the signals has been made, classification by means of machine learning methods can be performed. When considering the practical limitations behind the classification of genetic cardiac diseases, the classification method must be selected and fine-tuned carefully. Nowadays, deep learning solutions are popular in different domains and have become a standard approach to use. However, deep learning methods require large amounts of training data, which is a hard constraint to overcome. For genetic cardiac disease classification, interesting deep learning solutions would be to use, for example, RNN networks [14] and LSTM networks [15] . Nevertheless, we have omitted deep learning solutions in this paper due to limited dataset and concentrated on methods, which can handle small datasets well.
Majority of the methods used were applied also in our earlier studies [ 5 , 6 , 10 , 11 , 16 , 17 , 18 ], but we have included new method which has not been used in aforementioned studies. Furthermore, we examined two different test set-ups (leave-one-signal data-out, 10-fold cross-validation) giving new perspective to classification of cardiac diseases compared to earlier studies and these are explained in subsequent subsections in detail. All experimental tests were performed using Dell Precision Tower 7810 workstation having dual Intel Xeon E5-2640 v4 @ 2.40GHz processors, 128 GB RAM and Win10 Pro operating system. Tests were made using MATLAB 2019b with Parallel Computing Toolbox and Statistics and Machine Learning Toolbox.
As a first method, we applied k -Nearest Neighbor ( k NN) classifier [ 19 , 20 ] with different ( k value, distance measure, distance weighting) triplets. We tested altogether 19 different odd values of k ( k ∈ {1, 3, 5, …, 37}). Justification behind testing only odd k values is to decrease the possibility of ending up to a tie, which would again need special attention. If a tie, however, occurred in a classification, it was solved by random choice. In the case of distance measures, we examined eight alternatives. These were Chebychev, Manhattan, correlation, cosine, Euclidean, standardized Euclidean, Mahalanobis and Spearman distance measures. Finally,  we also tested three distance weighting schemes called equal, inverse weighting and squared inverse weighting. Overall, with the k NN classifier 456 different triplets were tested and all of them were examined with both test set-up approaches.
Decision trees have performed well in our earlier studies and in this paper we applied two tree-based algorithms. Firstly, we used classical CART [ 21 , 22 ] decision tree algorithm with default parameter values. Secondly, we examined Random Forests classifier [ 23 , 24 ] where we tested 100 different values for the number of trees in a forest ( # trees ∈ { 1 , 2 , . . . , 100 } ). Both Random Forests and CART classifiers were tested with similar parameter settings on both test set-ups.
Discriminant analysis is a famous and traditional method for classification tasks. We investigated the suitability of three discriminant analysis methods for the genetic cardiac disease clas-sification. These methods were linear discriminant analysis [25] , quadratic discriminant analysis [ 25 , 26 ] and Mahalanobis discriminant analysis also known as minimum Mahalanobis distance classifier [ 27 , 28 ]. We used the default parameter settings in all discriminant analysis experiments.
Multinomial logistic regression (MLR) [ 29 , 30 ] is an extension of logistic regression applied in binary classification tasks. We used MLR method with default parameter settings with both test setups. Next classification method was Naïve Bayes classifier [31] with and without kernel smoothing density estimate (KDE). Without KDE, Naïve Bayes has assumption of Gaussian distribution. When KDE is applied with Naïve Bayes classifier, there are several alternatives for kernel selection. In this study, we tested four kernels namely Epanechnikov, normal (Gaussian), triangle and box kernels. These four kernels have also been examined in our previous stud-ies [ 6 , 10 , 11 , 16 , 17 , 18 ]. Otherwise, with Naïve Bayes classifier we applied the default parameter settings. Support Vector Machine (SVM) is a well-known classification algorithm. The original purpose was to use it in two-class classification problems, but it has been extended to cover also multiclass cases where the number of classes is larger than two. We have used binary and different multiclass extensions of Least-Squares Support Vector Machines (LS-SVMs) [ 32 , 33 ] in our previous studies [ 5 , 6 , 10 , 11 , 16 , 17 , 18 ]. However, in this research we have done several modifications compared to previous researches. Firstly, we apply now the SVM algorithm [34] instead of LS-SVMs. Secondly, earlier our LS-SVMs approaches included the use of either binary LS-SVM classifier or tree-based multiclass LS-SVM. However, now we utilize one-vs.-one (OVO) [35] approach where M ( M −1 ) 2 individual binary SVM classifier are constructed when M is the number of classes. We are using error-correcting output codes (ECOC) framework [36] to model OVO approach and due to the properties of OVO, ECOC uses ternary coding [ 37 , 38 ] and loss-weighted decoding scheme is applied. When training an individual binary SVM classifier, we utilize Iterative Single Data Algorithm (ISDA) [39] in hyperplane optimization. ISDA algorithm is designed, especially, to be used with large datasets. Moreover, with the SVM classifier Hinge loss function is applied. When dealing with SVMs, parameters and kernel selection are important issues. We tested four kernels (the linear, quadratic, cubic and the RBF) in this paper. A common parameter for all kernels is boxconstraint ( C ) and we tested 21 values for this hyperparameter ( C ∈ {2 −8 ,2 −7 ,…, 2 12 }). In the case of the RBF kernel we also examined the impact of kernel scale parameter ( σ ) and we had the same parameter value space for it as for boxconstraint. Thus, all the polynomial kernels were tested with 21 parameter values and the RBF kernel with 21 2 parameter combinations.
The first test set-up was leave-one-signal data-out (LOSDO), which is a variant from leave-one-out method. The classification process with LOSDO technique can be described as follows: 1 Let the number of signals be N in the data. 2 Extract data from the i th signal for test set. Notice that data from one signal includes several rows of data and each one of the rows represents data gained from one peak. 3 Leave the data from N-1 signals to training set. 4 Perform z-score standardization (columns to zero mean and unit variance) for the training set. 5 Scale the columns of test set with the scaling parameters gained from the training set in step 4. 6 Train classifier with the training set and with the given parameter settings. 7 Predict class labels for the test set data. Now, prediction is made at peak level. 8 Take the mode of predictions in order to get a signal level prediction. If mode is not unambiguously determined, take the smallest value occurred in a tie. 9 Repeat steps 2-8 for all signals in a dataset. 10 When signal level prediction for all signals has been made, evaluate confusion matrix. 11 From confusion matrix ( CM ) evaluate accuracy and TPR (true positive rate) for each class. 12 If a classification method requires parameter tuning, repeat the process with all parameter settings and select that parameter combination as a final result, which receives the highest accuracy.
The other test set-up was to use 10-fold cross-validation in classification. 10-fold-cross-validation is a standard way to perform classification in machine learning, but in the case of signal classification, there are some special issues that need to be taken into account. We used stratified sampling at signal level when doing the 10-fold cross-validation division. By this means, we ensured that all classes in a data are represented in each fold. Furthermore, the data from one signal is included only to one-fold. In other words, data from one signal is not scattered to several folds simultaneously. The same cross-validation division was used with all parameter settings tested and/or classification methods in order to ensure fair comparison of the results. Classification procedure goes as follows in detail: 1 Extract the i th fold from the data to test set. 2 Leave the rest of the folds to training set. 3 Perform z-score standardization (columns to zero mean and unit variance) for the training set. 4 Scale the columns of the test set with the scaling parameters gained from the training set in step 3. 5 Train classifier with the training set and with the given parameter settings. 6 Predict class labels for the test set ( i th fold). Now the predictions are at peak-level. 7 Obtain a signal level prediction for each signal in a test set by taking the mode separately from the predicted class labels for each signal data. If mode is not unambiguously determined, take the smallest value occurred in a tie. 8 When signal level prediction for all signals within the i th fold has been made, evaluate the confusion matrix ( CM i ). 11 Evaluate accuracy and TPR for each class from CCM . 12 If a classification method requires parameter tuning, repeat the process with all parameter settings and select that parameter combination as a final result, which receives the highest accuracy.
Since accuracy and TPRs are calculated from CCM , we do not get standard deviation for accuracy and/or TPRs. If we would calculate accuracy and/or TPRs from each fold separately, the standard deviation obtained would not be comparable with accuracy/TPRs obtained from CCM , since CCM represents evaluation measures obtained from the whole data whereas standard deviation would be calculated from folds (each fold is around 10% of the data). LOSDO procedure presents performance measures from the whole data and, thus, performance measures gained from the CCM are more natural choice compared to traditional 10-fold cross-validation process where performance measures are presented as a mean of ten folds. Accuracy is defined in this study as follows accuracy =

Comparing model building with cross-validation or leave-one-out techniques
Obviously, the hold-out procedure is the simplest technique for computing machine learning models and to divide an available data set into two parts, a training set and a test set typically with the equal size or sometimes a training set is larger, e.g., 2/3 from an original data set. If a data set is small, a larger training set is used in order to attempt to build a model or classifier by applying as representative training set as possible. Sampling for training and test sets is, of course, made randomly. However, the stratified hold-out is suggested [40] , since then the random sampling is executed so that every class of a data set is represented in both training and test sets approximately according to the class distribution of the original whole data set. Cross-validation [40] can be seen to form a more versatile division, when a data set is divided or randomly sampled into several ( K ) separate parts called folds of the equal size or as equal as possible if the number of data instances n is not divisible by K . Usually, K is equal to 10, i.e., 10-fold crossvalidation is applied, but also 5-or 20-fold cross-validation may be utilized. Thus, the choice of 10 is chiefly an established practice. Then one by one, K folds or subsets are used as a test set when the corresponding training set is formed based on other K -1 subsets. This way, a test set is around 10% of the original data set and its training set 90%, respectively. Leave-one-out technique is equal to the special cross-validation procedure in which every test set includes only one instance and its training set all other n -1 instances [40] . In this sense, a maximal number of data instances is used for building a model. The number of models or classifiers built is equal to n (the size of the data set) which may cause high running times. Leave-one-out is mainly applied only to small data sets.
When a classification or other machine learning problem cannot be fully solved with finite (and not even infinite in principle) data sets, because this also depends on the capacity of an algorithm applied to solve the learning problem, there may appear more or less errors that are called bias that cannot be fully eliminated and not calculated precisely in practice, but can be approximated. The other error source stems from a practical limitation, training set applied that is always finite and limited in reality. Thus, such a training set is not able very well to represent the actual population of data instances. This error over all potential training sets of the same size as well as to test sets is variance of a machine learning technique for a problem given. The expected error of a classifier is the sum of bias and variance [40] .
Leave-one-out is seen virtually unbiased and, on the other hand, it has high variance [41] . Classification accuracy is used to express how well a classifier can classify test data instances on average. In a way, it is opposite to prediction error. After assuming independent and identically distributed data instances correct and incorrect classifications were expressed with loss function values 1 and 0 [41] . Then accuracy value A for classification could be computed as follows where T is the test set, x i its instance, C a classifier and y i the actual (known) class of x i . Then δ( C ( x i ), y i )) is equal to 1 if C ( x i ) is equal to y i and otherwise it is 0 [41] .
K -fold cross-validation was considered by using prediction error and its expected value over training sets [42] . They studied variance in the context of cross-validation starting from identically distributed (dependent) attributes ( ˆ H their average) containing the property asymptotically converging to a normally distributed attribute which is characterized with its expectation E( ˆ H ) and variance as follows.
Further, they used the covariance matrix of cross-validation errors and found that there are much similarity in its contents shown in covariance matrix blocks [42] which was natural, since it was based on cross-validation using the same data set divided into folds. They formed a linear combination of three moments derived from the covariance matrix. By using the moments derived it was possible to show that no unbiased estimator of variance Var ( ˆ H ) exists. This was also proved for leave-one-out [42] . Independence assumptions for K -fold cross-validation and leave-one-out were introduced and the assumptions were utilized to derive sampling distributions for their estimators of crossvalidation and leave-one-out [43] . It was presented that K -fold cross-validation should not be executed repeatedly, since this could not give a more reliable estimate since mean accuracies produced by any two repetition are dependent. Also, the sample variance of leave-one-out is constant because of the characterization of it. This was interpreted as another explanation not to execute K -fold crossvalidation repeatedly [43] .
Our current data contained 1173 signals originating from seven classes and comprising different numbers of peaks from 1 to 123. Peak sizes and shapes varied much. A signal classification was made on the basis of its peaks that were first classified one by one with a classifier and the class of that signal was decided according to the majority of its peak classifications. When the actual classifiers tested were applied to peak data instances, the main process focused on them. On the other hand, the results gained are given as classification accuracies stemming from entire signals of its own peaks. Accordingly, we examined their common influence.
Peak attribute distributions are right-skewed, e.g., as in Fig. 6 . All 14 attributes were right-skewed. It is natural when all these are biophysical attributes of non-negative real values. None was normally distributed according to Kolmogorov-Smirnov test. They are not ideal such as independently and identically distributed.
Although 10-fold cross-validation and leave-one-out were seen rather different from each other [40][41][42][43] , at least as different methods, they resemble much each other in the sense that the majority, approximately 90%, of every training set of 10-fold cross-validation contains the same data instances than those in any training set of leave-one-out. Because sampling is made randomly, any test set of size 10% from the whole data set in 10-fold cross-validation ought to mainly include data following the same distribution as in the corresponding training set. For these reasons, at the beginning we assumed that differences of each classification technique between classification results (accuracies) computed with either 10-fold cross-validation or leave-one-out would mostly be minor. Those theoretical considerations above suggest that variance associated with leave-one-out would be extensive. Usually, all literature sources mention that leave-one-out should be used for small data sets only. Instead, it is rather indefinite what size of a data set may or may not be small.
In the present research we had only one data set. Usually more data sets are used, for example, six [41] or more while comparing machine learning methods. Often tens of them are recommended to obtain somewhat wide and general understanding about classification results gained. If there are several data sets, they may be rather small to restrict execution times of classifications to be reasonable. Nevertheless, we used one data set only, but made classifications with numerous classification algorithms and their variations to obtain generality but, at the same time, restricting execution times within quite reasonable durations. This restriction Table 3 Leave-one-out: Classification results as sensitivities and accuracies for seven disease classes when C and σ are the control parameters of radial basis function support vector SVM RBF. The three highest accuracies are in Bold.

Method
Sensitivity % Accuracy %  Table 4 Leave-one-out: Classification results as sensitivities and accuracies in which k is the number of nearest neighbor search that produced the best results. The highest accuracy is in Bold. was made because execution times for radial basis function support vector machines (RBF-SVM) took around eight weeks for 441 control parameter value combinations used in leave-one-out. In addition, other SVM methods, multinomial logistic regression and random forests were relatively slow and took hours or even days. Other, simpler methods took short times such as minutes, but being simple methods meant that those more complicated gave better results being able to model data better and were so necessary to be included in the tests.

Results
Many classification methods based on machine learning were run for the current data to analyze how efficiently the six diseases and controls were possible to differentiate from each other by applying peaks recognized from calcium transient data. Sensitivities or true positive ratios and accuracies gained with the leave-oneout technique are shown in Tables 3 and 4 . In Table 3 , random forests generated the highest classification accuracy of 73.7 %. SVM with radial basis function (RBF) of 72.2 % and decision trees (CART) of 71.5 % were almost equally efficient. In Table 4 , k NN with Mahalanobis equal weighting produced the highest accuracy of 68.5 %, and several other were virtually equally good.
Sensitivities and accuracies produced by applying 10-fold crossvalidation are presented in Tables 5 and 6 after having used the same classification methods similarly to the tests made with leaveone-out. Cross-validation test series was executed only once so that every instance was run once for testing, i.e., as for leave-oneout test. In Table 5 the best results are 72.5 % of random forests, 70.3 % of ECOC-SVM RBF and 69.4 % of CART. In Table 6 the best is 67.9 % of k NN Mahalanobis squared inverse weighting.
As compared to our earlier results the current accuracies of seven classes, that is, 6 diseases or mutations and controls (wild type) the classification accuracies obtained are smaller than those with 527 calcium transit signals for four classes (three diseases and controls) of 78.6 % [6] and those with 941 signals for five classes  Tables 3-6 , LQT1 and HCMM are among the best separated and the controls (WT) often had the poorest sensitivities (true positive ratios). LQT1 and LQT2 could be separated very accurately from each other. HCMM and HCMT could also be separated even if these are two mutations of the same disease.
There are the accuracies of 40 different methods together in Tables 3 and 4 as well as in Tables 5 and 6 . For 36 methods in  Tables 3 and 5 , the accuracy results of leave-one-out are slightly higher than those corresponding of cross-validation in Tables 5 and  6 . For these 36 the differences are from 0.1 % to 2.6 %. For the rest four methods cross-validation gave very slightly better accuracies, when the differences of leave-one-out and cross-validation are small being from 0.1 % to 0.7 %. Altogether, the minimum difference is 0.1 %, maximum 2.6 % and mean 1.0 %, in other words, leave-one-out generated better accuracies, but the mean difference is rather insignificant.
Specificity values resembled roughly sensitivity values. For example, in association with the best accuracy result of 73.7 % in Table 3  For the sake of comparison, we still computed with clustering how the peak data were distributed while applying unsupervised machine learning. K-means ++ clustering method has been applied to the whole dataset with different parameter values. As a preprocessing stage, we z-score standardized the features in a dataset. Clustering has been repeated with 7, 8, 9 and 12 clusters. Squared Euclidean distance measure has been used in clustering. The number of iterations was 10 0 0 and 10 0 different initializations were tested with all clusterings.
Clustering was performed on a peak level and this means that each peak in a signal is assigned to a certain cluster. In order to have a signal level clustering for specific signal, we take a mode (majority vote) from the peak level cluster assignments of a specific signal and this determines to which cluster a signal belongs. In Table 7 there are the best results given by 9 clusters, Table 7 K-means ++ results computed with 9 clusters. The most frequent class/classes in a cluster have been emphasized with bold font and the proportion of those classes in a cluster in percentages have also been represented. where the rows represent classes and columns clusters. Numbers in Table 7 represent the number of signals in a specific cluster from a specific class.
These results show that also unsupervised machine learning produced clustering to different groups of data peaks where the majority classes were found for clusters except for LQT1. Note that more than one cluster may represent the same class. In Table 7 there are two such clusters for HCMM, CPVT and WT. Class LQT1 did not reach the signal majority in any cluster in Table 7 . Possibly, this came partially from its relatively small number of peaks from among all classes ( Table 1 ). Poorer results were obtained compared with the results of the classification methods which is natural and typical, since unsupervised machine learning methods of clustering do not utilize the class information which supervised classification methods do. These clustering results support our preceding classification results that the classification of the different genetic diseases included can be possible.

Conclusion
The classification accuracies are smaller than in our earlier studies so that, for example, the best accuracy in Table 3 was approximately 5 % smaller than the best one of three disease classes and controls [6] or 4 % smaller than the best of four disease classes and controls [10] . This is rational because the current task of six disease classes and controls is more complicated. Nevertheless, the best accuracy of above 70 % gained individual sensitivities above 59 % of random forests in Table 3 show that the six diseases classes and controls are possible to separate from each other by machine learning techniques.
The comparison of leave-one-out and cross-validation indicated that leave-one-out produced 1 % better accuracies on an average for the current data and test set-ups. This is virtually insignificant. Obviously, it was caused by the essential difference of the two techniques, the larger training set of leave-one-out and from its almost constant character. While comparing any two different training sets and the corresponding test sets in the leave-oneout training process, two instances only, their test instances (one present and the other absent or the opposite as to two training sets) are exceptional and otherwise the training sets are identical. Instead, 10 training sets in cross-validation are slightly more different, which may produce a little more variation for machine learning models computed with the same method. For 10-fold crossvalidation 80 % of instances are the same for the training sets of two different folds. To summarize, the results show that in principle both techniques could be applied as well for the current data with these numbers: 1173 signals containing 18387 peaks used for training from seven classes. Of course, applying leave-one-out for large datasets would require long running times, at least for such time-consuming methods as support vector machines with radial basis function (RBF) and random forests being also typically among the best, and therefore their use would not then be practical.
This relatively large dataset utilized in this study included two to six hiPSC cell lines from each seven conditions (control or dis-ease) and by using complex classification tasks we gained the classification accuracy of above 70 %. We see this as an indicator that these diseases could be separated from each other by machine learning techniques. Nevertheless, even larger datasets could strengthen our finding. Abnormalities in calcium transients in many cases represent a patient's cardiac phenotype [ 7 , 8 , 9 , 12 , 13 ]. However, a known problem with hiPSC-CMs is their immature structural and functional characteristics like immature calcium handling that differs from those of adult cardiomyocytes. In the literature variation between hiPSC lines and in their phenotypic responses, e.g. in baseline action potential characteristics and drug responses, have been reported even between control cell lines [44][45][46] . This variability could exceed differences between parameters of patient and control hiPSC-CMs. Since these cells can be phenotypically immature and culture and assay methods are not standardized, it can be a disadvantage to the development of predictive computational models [45] . One option to reduce variability between cell lines, in addition to increase cell maturation with certain techniques, is recent advances in genome-editing techniques, like CRISPR/Cas9 method [47] . This allows modification of the stem cell genome when isogenically matched controls are generated for diseased hiPSC lines [45] . In the future, experimental datasets for machine learning should be broadened with isogenic controls and larger datasets of hiPSC lines to better understand the variation between healthy and diseased lines.
The classification tests performed here strengthens our previous results showing that it is achievable to get a good classification accuracy with disease-specific iPSC-CM calcium transient data, even though the number of test signals here was increased with signals of two additional diseases. The result strengthens our previous finding that the machine learning method could be utilized in identification of several genetic cardiac diseases, but may also separate mutations in different genes resulting in the same clinical phenotype. Machine learning classification of disease-specific CM calcium transients could be exploited to diagnose genetic cardiac disease and could even predict the type of disease mutation. For this more advanced cardiac differentiation methods would be needed in the future, e.g. direct differentiation of blood cells into CMs, to achieve a realistic additional tool for diagnosing genetic cardiac diseases.

Declaration of Competing Interests
As to the manuscript of 'On computational classification of genetic cardiac diseases applying iPSC cardiomyocytes' submitted to Computer Methods and Programs in Biomedicine, there is no conflict of interest.