A Neural Network Based on the Johnson $S_\mathrm{U}$ Translation System and Related Application to Electromyogram Classification

Electromyogram (EMG) classification is a key technique in EMG-based control systems. The existing EMG classification methods do not consider the characteristics of EMG features that the distribution has skewness and kurtosis, causing drawbacks such as the requirement of hyperparameter tuning. In this paper, we propose a neural network based on the Johnson $S_\mathrm{U}$ translation system that is capable of representing distributions with skewness and kurtosis. The Johnson system is a normalizing translation that transforms non-normal data to a normal distribution, thereby enabling the representation of a wide range of distributions. In this study, a discriminative model based on the multivariate Johnson $S_\mathrm{U}$ translation system is transformed into a linear combination of coefficients and input vectors using log-linearization. This is then incorporated into a neural network structure, thereby allowing the calculation of the posterior probability of the input vectors for each class and the determination of model parameters as weight coefficients of the network. The uniqueness of convergence of the network learning is theoretically guaranteed. In the experiments, the suitability of the proposed network for distributions including skewness and kurtosis is evaluated using artificially generated data. Its applicability for real biological data is also evaluated via an EMG classification experiment. The results show that the proposed network achieves high classification performance without the need for hyperparameter optimization.


I. INTRODUCTION
Biosignals such as electroencephalograms (EEGs), electrocardiograms (ECGs), and electromyograms (EMGs) strongly reflect a human's internal state and intentions, and have therefore been applied to human-machine interfaces and diagnosis [1]- [4].In particular, EMG-based control systems have been widely studied, because EMGs can be voluntarily controlled.Many practical applications have been developed, typified by myoelectric prosthetics, which are prosthetic hands that can be controlled using surface EMGs [5], [6].
According to Oskoei and Hu [7], EMG-based control systems include four main stages: data segmentation, feature extraction, classification, and control.The raw data are first segmented, and then converted into feature vectors.These vectors are classified into predefined categories, before the controller generates output commands for the instruments based on the classification results.
To realize highly intuitive and dexterous control, it is particularly important to achieve a high level of classification performance, both in terms of accuracy and speed of training and prediction.Classifiers such as the support vector machine (SVM) [8], multilayer perceptron (MLP) with back propagation learning [9], and k-nearest neighbors algorithm (k-NN) [10] have been widely used.These popular techniques, however, are not always the most suitable for EMG classification in spite of their high classification abilities.For example, SVMs are computationally expensive for hyperparameter optimization, and the MLP requires a very long training time.Similarly, it is difficult to use k-NN in real-time applications because of the large computational cost of prediction.
To improve the classification ability for a certain purpose, stochastic models can be incorporated into the structure of the classifier if there is some prior knowledge of the input signals [11]- [14].For instance, Tsuji et al. [14] proposed a Gaussian mixture model-based neural network, known as the log-linearized Gaussian mixture network (LLGMN), by assuming that the input signals obey a Gaussian mixture model.
The authors now consider the following assumptions to be the prior knowledge of the feature vectors obtained from EMG signals: • The distribution of input signals for each class is unimodal • The distribution has skewness and kurtosis The derivation of these assumptions are explained in Section III.
To satisfy the above assumptions, we can use a flexible distribution known as the Johnson distribution [15], which represents the mean, variance, skewness, and kurtosis using four parameters.Its extension to higher dimensions is enabled by the multivariate Johnson translation system [16]- [18].If we could construct a classifier that incorporates the multivariate Johnson distribution in its structure, it would be applicable to EMG classification and EMG-based systems.
This paper proposes a neural network (NN) based on the Johnson S U translation system.The proposed NN represents a flexible distribution by including a discriminative model based on the multivariate Johnson S U translation system, thereby supporting the accurate classification of data with skewness and kurtosis.The parameters of the model can be determined as weight coefficients of the proposed NN via learning.
This paper is related to our previous workshop paper [19].Our previous work was preliminary, and has the following drawbacks: • The network structure does not correctly represent the Johnson distribution due to the lack of the Jacobian.• The training algorithm does not guarantee the uniqueness of convergence.• The dataset variation and comparison are limited in the EMG classification experiments.In this paper, the above problems have been solved.
The rest of this paper is organized as follows: Related studies and their characteristic comparisons are described in Section II.Section III explains the derivation of the above assumptions, then describes a discriminative model based on the multivariate Johnson translation system and its transformation to linear combinations of weight coefficients and input vectors via log-linearization.The structure and learning algorithms of the proposed NN are presented in Section IV, and the results of a simulation experiment using artificial data are described in Section V. Section VI outlines the application potential for biosignal classification based on an EMG classification experiment.Finally, Section VII concludes the paper.

II. RELATED WORK
This section summarizes popular algorithms for EMG classification, and compares each of their characteristics.The algorithms compared in this section are: • SVM [8] • LLGMN [14] • MLP [9] • Linear logistic regression (LLR) [20] • k-NN [10] • Random forest These algorithms are compared in terms of the following significant factors for EMG classification and EMG-based control systems: (non-)requirement of hyperparameter optimization, speed of training, uniqueness of solutions, speed of prediction, nonlinearity, and computability of posterior probabilities.The first three factors are associated with the effort needed to construct the classifier.Hyperparameter optimization is conducted before training, and typically takes a very long time; hence, the usability of the system is enhanced if this step is not required.Fast training and a unique solution are also desirable to avoid effort and uncertainty in the training of the classifier.Systems that are to be deployed in an online manner require fast prediction, while the nonlinearity and computability of posterior probabilities is related to the accuracy of classification.Although EMG classification problems are unlikely to be linear, their nonlinearity is not overly complex, because each class of EMG signals can be clustered to some extent.Calculating the posterior probabilities has some powerful merits, such as minimizing risk, rejecting options, compensating for class priors, and combining models [20].
Table 1 summarizes the characteristics of the classification algorithms.An SVM is distinguished classifier that realizes fast training and a unique solution.Its problem, however, is that two hyperparameters must be optimized.Additionally, because an SVM was originally developed as binary classifier, multi-class classification can take a long time.
The LLGMN is a discriminative model that incorporates Gaussian mixture models into an NN structure, allowing the posterior probability to be accurately calculated.The number of components (how many Gaussian distributions are summed in the model) should be carefully determined, because the classification ability of the LLGMN for data following a non-Gaussian distribution decreases when there are few components.
The MLP is generally more compact than an SVM, and hence gives faster predictions, although training has a large computational cost.The number of layers and units should be determined as hyperparameters.
The LLR is a probabilistic discriminative model that can be trained using Newton's method.The structural limit of the LLR is its inability to solve nonlinear separation problems.The k-NN is a very simple algorithm that does not require training.However, predictions entail large computational expense, as k-NN compares the distance between the input vector and every training vector.The constant k, the number of nearest neighbors used in the voting, is a user-defined constant, and is selected by heuristic techniques in general.
The proposed NN is designed to optimize the above characteristics as much as possible.In particular, there is no need to optimize the hyperparameters, and the uniqueness of solutions and computability of posterior probabilities can be theoretically explained.

III. MODEL STRUCTURE A. SKEWNESS AND KURTOSIS IN PROCESSED EMG SIGNALS
After EMG signals have been acquired, we extract significant features.Although the raw EMG signals can be considered to obey a Gaussian distribution with zero mean [21], the distribution of the extracted features may exhibit some skewness and kurtosis.We describe how the features obtain skewness and kurtosis in the process of feature extraction.Although many feature extraction methods have been proposed [22]- [24], we focus on the method of Fukuda et al. [6] because of its simplicity and universality.
Fukuda's method consists of two main parts: rectification and smoothing based on a Butterworth low-pass filter.Rectification takes the absolute value of the raw EMG signals, converting negative EMG values to positive values.Rectification is also used in methods such as integrated EMG (IEMG), mean absolute value (MAV), and modified mean absolute value (MMAV) [25], and is strongly related to the occurrence of skewness and kurtosis.Let x be a raw EMG signal that obeys a Gaussian distribution with a mean of 0 and a standard deviation of σ.The skewness and kurtosis of x are both 0. Fig. 1 (a) shows an example raw EMG signal x and its histogram.
The probability density function of the rectified EMG signal y = |x| is represented as The mean M y and the variance V y of y is then calculated as Gaussian distribution with zero mean (this is also discussed in Hogan and Mann [21]).The histograms of (b) rectified EMG y and (c) smoothed EMG z, however, become asymmetric and include skewness and kurtosis.
In the rectified signal y, the skewness S y and the kurtosis K y are no longer 0. They can be calculated as follows [26]: The influence of rectification is also visible in the smoothed signal z (see Fig. 1 (c)).
Because the extracted features include skewness and kurtosis, conventional Gaussian-based models cannot readily model the data.In the next subsection, we therefore adopt the multivariate Johnson translation system [16], which is suitable for data with skewness and kurtosis.

B. MULTIVARIATE JOHNSON TRANSLATION SYSTEM
The Jonson translation system is a family of transformations of a non-normal variate to a standard normal variate proposed by N. L. Johnson in 1949 [15].Based on this translation, Johnson derived a system of distributions that is suitable for representing distributions with skewness and kurtosis.Its multivariate extension is also proposed [16].
Consider a d-dimensional continuous random vector x ∈ R d with skewness and kurtosis.The multivariate Johnson translation system [16] involves the normalizing translation: where z is a random vector obeying a normal distribution with mean 0 and variance T is a location parameter, and T denotes the transformation function that determines the family of a system.g i (•) (i = 1, . . ., d) is defined by the following four functions: .
(7) The domains of x i for S L , S U , S B , and S N are (ξ, +∞), (−∞, +∞), (ξ, ξ + λ), and (−∞, +∞), respectively.In (6), parameters λ and ξ affects the location and scale of the distribution of x, respectively.The combination of γ and δ are associated with skewness and kurtosis, and g[•] decides the shape of distribution tails, i.e. whether the distribution tails have boundary or go to infinity.
Since EMG can be seen as a random process, the systems with bounds are not suitable for EMG classification because probabilities cannot be calculated if an observation is out of the domain.In unbounded systems, S U is expected to be an extension of the normal distribution in particular, and have enough flexibility to fit data from arbitrary unimodal distribution.This paper therefore focuses on S U as the form of the function g i (y).Fig. 2 shows an example of the Johnson S U distribution, which can be calculated from the Johnson S U translation system (d = 1).This asymmetric distribution represents skewness and kurtosis, and seems to be adaptable to the histogram of smoothed EMG shown in Fig. 1 (c).

C. POSTERIOR PROBABILITY ESTIMATION
To classify the vector x ∈ R d into one of the given C classes, we must examine the posterior probability P (c|x) (c = 1, . . ., C).First, x is translated into a vector z (c) using ( 6), which has the four parameters γ (c) , δ (c) , λ (c) , and ξ (c) .Assuming that the translated vector obeys a normal distribution, the posterior probability of x for class c is calculated as where P (c) is the prior probability of c, Σ (c) is the variance matrix of z (c) , and J (c) is the d × d Jacobian matrix, whose (i, j)th element is given by where .
The determinant of J (c) is therefore calculated as

D. LOG-LINEARIZATION
To incorporate the probabilistic model described above into a network structure, we transform the calculation of the Johnson translation and posterior probability estimation to linear combinations of coefficient matrices and input vectors.
First, let y (c) be a calculation in the function g(•) of ( 6).y (c) is then transformed as follows: Hence, y (c) is expressed by multiplying the coefficient matrix (1) W (c) ∈ R (d+1)×d and the augmented input vector X ∈ R d+1 .
Second, the translated vector z (c) is also transformed and expressed as the product of a coefficient matrix and an augmented vector as follows: where (2) W (c) ∈ R (d+1)×d is a coefficient matrix and Finally, setting and taking the log-linearization of ζ c gives where s d,d are elements of the inverse matrix Σ (c) −1 , and δ i,j is the Kronecker delta (1 if i = j, 0 otherwise).Note that (2π) − d 2 in ( 9) is omitted because it is canceled out in (8).Additionally,

2
) is defined as Taking the exponent of ( 16), ζ c (i.e., the numerator of ( 8)) is ultimately expressed by As outlined above, the Johnson translation and posterior probability estimation are calculated as linear combinations of coefficient matrices and nonlinearly transformed input vectors.If these coefficients are appropriately determined, the parameters and structure of the model can be defined, and therefore the posterior probability of the input vectors can be calculated for each class.
The next section describes how the NN weight coefficients (1) W (c) , (2) W (c) , and (3) W (c) are determined via learning.
Structure of the proposed NN.This network is constructed by incorporating the posterior probability calculation based on the Johnson SU translation system into the network structure, and consequently consists of five layers.Symbols , , and ⊗ denote a summation unit, identity unit, and multiplication unit, respectively.The weight coefficients between the first/second layers and the second/third layers correspond to the parameters of the Johnson translation system.The weight coefficients between the fourth/fifth layers correspond to the probabilistic parameters such as the prior probability and the variance matrix.Because of this structure, the output (5) Oc of this network estimates the posterior probability of each class c given x, P (c|x).

IV. PROPOSED NEURAL NETWORK
A. NETWORK STRUCTURE Fig. 3 shows the structure of the proposed NN.This is a fivelayer feedforward network with weight coefficients (1) W (c) , (2) W (c) , and (3) W (c) between the first/second, second/third, and fourth/fifth layers, respectively.Symbols , , and ⊗ denote a summation unit, identity unit, and multiplication unit, respectively.Because of this structure, the output (5) O c of this network estimates the posterior probability of each class c given x, P (c|x).
The first layer consists of d + 1 units corresponding to the dimensions of the input data x.The relationship between the input and the output is defined as (1) where (1) I i and (1) O i are the input and output of the ith unit, respectively.This layer corresponds to the construction of X in (13).
The second layer is composed of C(d + 1) units, each receiving the output of the first layer weighted by the coefficient (1) w (c) i,j .The relationship between the input (2) I c,j and the output (2) O c,j of unit {c, j} (c = 1, . . ., C, j = 1, . . ., d + 1) is described as (2) where the weight coefficient (1) w (c) i,j is an element of the matrix (1) W (c) , which is given as: This layer is equal to the multiplication of (1) W (c) and X in (13), the construction of Y (c) in ( 14), and non-coefficient part of Jacobian in (12).The third layer is comprised of Cd units.The relationship between the input (3) I c,k and the output (3) O c,k is defined as where the weight coefficient (2) w (c) j,k is an element of the matrix (2) W (c) , which can be written as: This layer corresponds to the multiplication of (2) W (c) and Y (c) in ( 14).

B. LEARNING ALGORITHM
This subsection describes a learning algorithm that can acquire a unique optimal solution without any hyperparameters.
The learning algorithm consists of two steps.
In the first step, we estimate (1) W (c) and (2) W (c) , which contain the parameters of the Johnson translation system.Although various parameter estimation algorithms have been proposed for the Johnson translation system, this paper adopts the percentile method [27], because it analytically estimates the parameters with a certain degree of accuracy.The percentile method calculates Johnson system parameters by comparing distances in the tails with distances in the central portion of the distribution.Using the percentile method, the parameters for the Johnson translation system are determined as follows: where z > 0 is chosen depending on the number of data points, and i For more detail, refer to [27]. (1)W (c) and (2) W (c) can then be determined by substituting δ i , and ξ (c) i in ( 13) and (14).
The second step concerns the discriminative learning of the remaining weight (3) W (c) , which includes probabilistic parameters such as the prior probability P (c) and covariance matrix Σ (c) .A set of vectors x (n) (n = 1, . . ., N ) is given for training, with the teacher vector Class 1 0.15 0.04 0.9 -0.9 0.6 0.7 0.05 0.8 0.5 Class 2 0.5 0.05 0.8 0.5 0.9 0.55 0.01 0.5 -0.5 involves minimizing the energy function E, which is defined as to maximize the log-likelihood.Here, (5) is the output for an input vector x (n) .The weight modification for (3) w (c) h based on Newton's method is defined as where (3) W old and (3) W new are the weight coefficients before and after weight modification, which have (3) W (c) in the cth block.∇E n is the gradient vector whose hth element in the cth block can be calculated as H is the Hessian matrix comprised of H × H blocks, where the (h, l) element of block (c, k) is Note that H is positive semi-definite (see Appendix A).It follows that E is a convex function of (3) W (c) , and hence has a unique minimum.Using this algorithm, the process of training the network converges to a unique solution without the need for any hyperparameters.

V. SIMULATION EXPERIMENT A. METHOD
To verify that the proposed network can properly calculate the posterior probability for data with skewness and kurtosis, we performed a simulation experiment using twodimensional (d = 2) two-class (C = 2) data.The data were artificially generated using the inverse of the multivariate Johnson S U translation system [16].Table 2 lists the parameters used for each class in this generation.An example of a dataset used in the experiments is shown in Fig. 4.Each class has different skewness and kurtosis qualities, as well as a different mean and variance.
In the experiment, 100 samples were treated as training samples for each class.The function g i (y) was of type S U (unbounded).After training, the proposed NN was tested using inputs in the range 0 ≤ x 1 ≤ 1 and 0 ≤ x 2 ≤ 1.The corresponding posterior probabilities were compared with those given by LLR and LLGMN [14].The LLR was trained using Newton's method [20], and LLGMN was trained by terminal learning [28] with an ideal convergence time of 1.0 and a learning sampling time of 0.001.The number of components M c in the LLGMN was varied from 1 to 10.

B. RESULTS AND DISCUSSION
Fig. 5 shows the posterior probability of class 1 (P (c = 1|x)) given by the proposed NN, LLR, and LLGMN.The probability of class 2 (P (c = 2|x)) is clear from this graph, because it can be calculated as P (c = 2|x) = 1 − P (c = 1|x).
From Fig. 5 (a), it is clear that the posterior probability given by the proposed NN resembles the distribution shape of the experimental data of class 1 (Fig. 4).The probability given by LLR (Fig. 5 (b)) is totally different from the experimental data distribution.Although LLGMN with M c = 1 (Fig. 5 (c)) is also different from the experimental data distribution, this classifier produces probability that becomes closer to the experimental data as the number of components increases.
It can be inferred that the proposed NN is capable of appropriately dealing with data including skewness and kurtosis, because it is based on the Johnson translation system.In contrast, LLR cannot be adapted to data with skewness and kurtosis, because it is a linear classifier.LLGMN is capable of handling data with skewness and kurtosis when the number of components is sufficiently large, but does not represent the data distribution well with few components.The above results demonstrate that the proposed NN can handle data including skewness and kurtosis without hyperparameters, although conventional methods require a hyperparameter optimization step.

A. METHOD
To evaluate the suitability of the proposed network for real biological data, a classification experiment was conducted using EMG data.Details of the data acquisition are described in the next subsection.Table 3 shows the characteristics of six We compared the performance of the proposed NN with that of ν-SVM [29] with a one-vs-one classifier, LLGMN [14], MLP, LLR, and k-NN.The hyperparameters of ν-SVM (γ and ν) were optimized by 10-fold cross-validation (CV) and a 10 × 10 grid search (γ ranging from log 10 5.0 to log 10 1.0 −5 , and ν ranging from ν max to log 10 1.0 −5 at even intervals in logarithmic space, where ν max is dependent on the ratio of labels in the training data).LLGMN was trained by terminal learning [28] with an ideal convergence time of 1.0 and a learning sampling time of 0.001.The number of components (from 1 to 5) in the LLGMN was determined using 10-fold CV.The number of nodes (from d to d + 10) in the hidden layer of MLP was also determined using 10fold CV, and MLP was trained using the back propagation algorithm with a learning rate of 0.1.The LLR was trained using Newton's method, and the value of k in the k-NN algorithm was chosen in the range 1 to 10 using 10-fold CV.All algorithms were programmed using C++ and the dlib C++ Library [30].The experiments were run on a computer with an Intel Core(TM) i7-3770K (3.5 GHz) processor and 16.0 GB RAM for Datasets I-V, and an Intel Core(TM) i7-7700K (4.2 GHz) processor and 16.0 GB RAM for Datasets VI.
To evaluate the usefulness of a classifier for real-world applications, it is necessary to measure not only the classification accuracy, but also the training/preparation time and the prediction time.We therefore compared the performance of the above algorithms through four metrics: accuracy, CV time, training time, and prediction time.Accuracy is defined as 100 × N correct /N total , where N correct is the number of correctly classified test samples and N total is the total number of test samples.The healthy 22-year-old male subject performed six successive motions in a relaxed state (C = 6; M1: hand opening; M2: hand grasping; M3: wrist extension; M4: wrist flexion; M5: pronation; M6: supination).EMG signals were recorded at 1 kHz and digitalized using a 16-bit A/D converter.Feature extraction was then conducted according to the method of [6].The signals were rectified and smoothed using a secondorder Butterworth low-pass filter with a cut-off frequency of 1 Hz.These features were defined as EM G i (n) (i = 1, . . ., d, n = 1, . . ., N : N is the number of data) and normalized as follows: where EM G st i is the mean of EM G i (n) in a state of muscular relaxation.x (n) i was then used as the input for the network.
Datasets II and III are those used in [31] and [32], respectively 1 .Dataset II contains measurements from eight subjects (aged from 20-35 years old) while seated on an armchair, putting their hands on a steering wheel attached to a desk and performing twelve classes of finger pressures and two classes of finger pointing (i.e., a total of 14 classes (C = 14)).For dataset III, eight subjects (aged from 20-35 years old) performed fifteen classes of finger and hand movements while seated on an armchair with their arm supported and fixed in one position.In both these datasets, EMG signals were recorded using eight-channel electrodes at 4 kHz and digitalized using a 12-bit A/D converter.Feature extraction for these datasets was conducted by rectification and smoothing, as for dataset I.
Dataset IV contains measurements from a healthy 23-yearold male performing sixteen forearm motions (C = 16).EMG signals were recorded using thirteen pairs of electrodes (d = 13) at 1 kHz with a 60-Hz notch filter and a bandpass filter of 0.1-200 Hz.Details of the experimental conditions are described in [33].Additionally, to evaluate the performance of the proposed NN for more difficult classification problems, we also confirmed the change of accuracy according to the decrease of the number of electrodes.The elimination of electrodes was conducted in order from the ones having the largest channel numbers, and then average classification accuracy and standard deviation were calculated for all the trials.
Dataset V is the Ninapro Database 3 exercise 1 [34], which is available on Ninaweb2 .EMG signals were recorded from 11 trans-radial amputated subjects using 12 electrodes (d = 12) at 2 kHz while conducting 17 movements (C = 17).Each movement lasted five seconds and was repeated six times with a rest interval of three seconds.Because some electrodes were missed for two subjects, nine out of the 11 subjects were used to uniform the number of channels in the classification experiment.Feature extraction for these datasets was conducted by rectification and smoothing, as for other datasets.
Dataset VI is the sEMG for Basic Hand movements Data Set provided by Sapsanis et al. [35].Five healthy subjects (two males and three females) were asked to perform six grasping movements (C = 6): holding a cylindrical tool, supporting a heavy load, holding a small tool, grasping with palm facing the object, holding a spherical tool, and holding a thin and flat object.Each movement lasted six seconds and was repeated 30 times.EMG signals were collected from two forearm surface EMG electrodes (d = 2) at a sampling rate of 500 Hz.Feature extraction was conducted in the same way as for other datasets, and the initial 1,000 samples for each movement were then discarded to remove transition states.For this dataset, classification accuracy is calculated based on 5 × 2 CV approach referring to the original paper that provides this dataset [35].The number of training data is limited also in this dataset by randomly sampling 1% of the training set in each fold.

C. RESULTS
Table 4 summarizes the results of EMG classification.Values are the average and standard deviation of scores measured for each trial of each subject, and are presented as "average value ± standard deviation" or "average value" if the standard deviation was 0. " * * " in the accuracy column denotes a significant difference, based on the Holm method, between that algorithm and the proposed NN (p < 0.01).The absence of " * * " in accuracy denotes no significant statistical difference.Fig. 7 shows the confusion matrix of the classification results for the dataset VI using the proposed NN.
Fig. 8 shows accuracy for each number of electrodes, while decreasing the electrodes for dataset IV.For comparison, the accuracies of ν-SVM are also plotted.Significant differences between the proposed NN and ν-SVM were confirmed when the number of electrodes was d = 2 and d = 3 (p < 0.05).Fig. 9 shows the relationship between accuracy and preparation time (CV time + training time), representing the time until the classifier becomes available, and between accuracy and prediction time for each classification method.Proximity to the upper-left corner indicates superior performance.

D. DISCUSSION
In terms of accuracy, the proposed NN, ν-SVM, and k-NN achieve the same level of performance, demonstrating a suitability for EMG signal classification.The proposed NN involves Johnson distribution in its structure based on prior knowledge of EMG signals for appropriate modeling of EMG distribution in the network.ν-SVM showed strong generalization ability derived from margin maximization.k-NN can be used to express arbitrary complex decision bound-aries based on determination of the parameter k, ensuring a fit to the skewness and kurtosis of EMG data.On the other hand, the accuracies of LLGMN, MLP, and LLR were notably low in some cases.This is because LLGMN and MLP require many parameters to fit data with skewness and kurtosis, which results in over-fitting, and LLR cannot solve nonlinear classification problems because it is a linear classifier.For dataset V, accuracies were relatively low in all of the algorithms.This is because this dataset is recorded from amputated subjects, and therefore EMG signals were unstable and the reproducibility of motions was low compared with the data recorded form intact subjects.
In Fig. 7, confusion between class 3 and class 6 is relatively frequent compared with other classes.This is reasonable since the motions of these classes are similar (class 3: holding a small tool, class 6: holding a thin and flat object).However, there was no extreme bias toward a certain class; therefore the proposed NN worked properly for multi-class classification.In Fig. 8, the accuracies of the proposed NN and ν-SVM both decreased according to the decrease of the number of electrodes.This is because the decrease of electrodes yielded the loss of information enough to classify the motions, making the classification problem difficult.In particular, the accuracies were sharply reduced when the number of channels was reduced d = 3 to d = 2, although the proposed NN exceeded ν-SVM.One possible explanation is that the substantial reduction of the input dimensions yielded the overlap of distribution for each class, and thus the proposed NN could not model the data distribution precisely.
With respect to CV time, LLGMN and MLP took particularly long.In contrast, the proposed NN and LLR had CV times of 0 because they have a unique solution of learning and therefore do not require hyperparameters such as a learning rate.
The training time for the proposed NN was relatively short for dataset I.In datasets II, III, IV, and V, however, significant training time was required.This is because the cost of calculating the Hessian matrix and finding its inverse (see (39) and ( 41)) increases with the number of classes and input dimensions.Although the overall time until the classifier becomes available is relatively short (because the CV time is 0), there is room for improvement by making the numerical calculations more efficient.
Regarding the prediction time, ν-SVM and k-NN took particularly long.This is because ν-SVM was originally a binary classifier, and thus solves multi-class classification problems by calculating all combinations of two-class classification.k-NN also has a long computation time, because it calculates the distance between the input sample and every training sample.The prediction time of the proposed NN is relatively short because it realizes a compact model for EMG classification by incorporating prior knowledge of the processed EMG characteristics.
Overall performance is summarized in Fig. 9.The plots of the proposed NN are concentrated toward the upper-left corner, demonstrating well-balanced performance for accuracy and computation cost.
Finally, the performance of the proposed NN can be summarized as follows: • High accuracy for classification of EMG signals • Relatively short time until the classifier becomes available • Shorter prediction time than ν-SVM and k-NN

VII. CONCLUSION
In this paper, we proposed a NN based on the Johnson S U translation system.The NN includes a discriminative model based on the multivariate Johnson S U translation system, with the model transformed into linear combinations of weight coefficients and nonlinearly transformed input vectors.This enables the representation of more flexible distributions for data with skewness and kurtosis.Parameters describing the shape of the distribution can be determined as network coefficients via network learning.The proposed NN can be trained without hyperparameter optimization, and the training converges to a unique solution.In addition, the posterior probability of input vectors for each class can be calculated as the output of the NN.
In a simulation experiment, the proposed network was shown to be more suitable than a conventional GMM-based network and linear logistic regression for data with skewness and kurtosis.The applicability of the proposed NN to biosignal classification was also demonstrated by the results of an EMG classification experiment.
In future research, we plan to construct an expanded model of the proposed NN.As the function g i (y), which determines the shape of the distribution, was only examined in relation to S U in this study, future work will investigate other functions.Despite of the assumption of S U distribution, EMG data are occasionally distributed like a different type of distribution such as S B ; thus in such situation S U distribution is used as an approximation.Although S U distribution worked well even in such situation according to the classification accuracy, more detailed comparison with other types of function and development of the selection criteria for the distribution type are needed.Using a different type of function for each dimension will also enable the classification of multivariate biosignals, such as the combination of EMG and EEG.Furthermore, the learning algorithm will be improved in future work, and the training time will be shortened by contriving numerical calculations for the Hessian matrix.Complete discriminative learning for (1) W (c) and (2) W (c) will also be developed using backpropagation-based learning. .

FrequencyFIGURE 1 .
FIGURE 1. Examples of the time-series signal and histogram of (a) raw EMG x, (b) rectified EMG y, and (c) smoothed EMG z.(a) raw EMG x obeys aGaussian distribution with zero mean (this is also discussed in Hogan and Mann[21]).The histograms of (b) rectified EMG y and (c) smoothed EMG z, however, become asymmetric and include skewness and kurtosis.

= 1 ,
. . ., d, c = 1, . . ., C are indices corresponding to the dimension and class, respectively.The variable x (c) ζ,i (ζ = −3z, −z, z, 3z) is the P ζ th percentile of the ith dimension of training data for class c, where P ζ is the percentage of the area in the normal distribution corresponding to ζ.Using percentiles, m

TABLE 2 .
Parameters for data generation in the simulation experiment

2 FIGURE 4 .
FIGURE 4. Scattergram of a dataset used in the simulation experiment.Each class has different skewness and kurtosis, and they should be considered to accurately calculate the posterior probabilities.

FIGURE 6 .
FIGURE 6. Locations of electrodes for dataset I

TABLE 1 .
Characteristics of classification algorithmsAlgorithm Hyperparameter-free Fast training Unique solution Fast prediction Nonlinearity Posterior probability SVM

TABLE 3 .
Summary of dataset for the EMG classification experiment , number of trials for each subject, and number of samples for each trial.The number of motions and the number of electrodes correspond to the number of classes and the number of input dimensions, respectively.The training samples were randomly chosen from the available samples for each trial, with the remaining samples used for testing.Because it is difficult to procure many training samples in real-world applications, only 1% of the available samples were selected for training to evaluate the validity of the proposed NN for learning with limited training data.

TABLE 4 .
Results of EMG classification **: significant difference with the proposed NN (p < 0.01)

7 .
Confusion matrix of the classification results for the dataset VI.Values are normalized by the number of test samples for each class.