String Grammar Unsupervised Possibilistic Fuzzy C-Medians for Gait Pattern Classification in Patients with Neurodegenerative Diseases

Neurodegenerative diseases that affect serious gait abnormalities include Parkinson's disease (PD), amyotrophic lateral sclerosis (ALS), and Huntington disease (HD). These diseases lead to gait rhythm distortion that can be determined by stride time interval of footfall contact times. In this paper, we present a new method for gait classification of neurodegenerative diseases. In particular, we utilize a symbolic aggregate approximation algorithm to convert left-foot stride-stride interval into a sequence of symbols using a symbolic aggregate approximation. We then find string prototypes of each class using the newly proposed string grammar unsupervised possibilistic fuzzy C-medians. Then in the testing process the fuzzy k-nearest neighbor is used. We implement the system on three 2-class problems, i.e., the classification of ALS against healthy patients, that of HD against healthy patients , and that of PD against healthy patients. The system is also implemented on one 4-class problem (the classification of ALS, HD, PD, and healthy patients altogether) called NDDs versus healthy. We found that our system yields a very good detection result. The average correct classification for ALS versus healthy is 96.88%, and that for HD versus healthy is 97.22%, whereas that for PD versus healthy is 96.43%. When the system is implemented on 4-class problem, the average accuracy is approximately 98.44%. It can provide prototypes of gait signals that are more understandable to human.


Introduction
Neurodegenerative diseases (NDDs) are the diseases of neuronal destruction in the central nervous system. The NDDs cause the volume of the brain and the amount of nerve deterioration over time. The diseases reduce the ability of patient and destroy tissue and nerves of the brain because nerves or neurons in the brain normally cannot reproduce themselves. Some neurodegenerative disorders such as Parkinson's disease (PD), Huntington disease (HD), and amyotrophic lateral sclerosis (ALS) usually occur at an older age and can lead to serious gait abnormalities [1]. Since balancing and sequencing of movement are controlled by the central nervous system, the gait of patient with neurodegenerative disorders will become abnormal. The main symptoms of PD are legs trembling, slowed moving, and impaired posture and balance. It may grow worse over time [2]. The main symptoms of HD are mood change, coordination of muscles problem, uncontrolled movement, and difficulty in walking. The patient with HD may lose their intellectual and behavioural abilities and may also experience psychiatric symptoms [3]. For ALS patient, a part of nerve cells that control muscle function is destroyed. Characteristic of this disease is continuous muscle atrophy. It causes muscle weakness and tenderness. The general symptoms in ALS are difficulty in walking, swallowing, breathing, and speaking [4]. In [5], they found that the patients with neurodegenerative diseases had decreased stride length as compared to healthy control subjects. From above reasons, the strideto-stride of gait information is utilized for gait pattern classification in patients with neurodegenerative diseases because of the gait pattern difference between healthy and NDD subjects.
In recent related studies, the information from time series of stride intervals, swing intervals, and stance intervals of stride-to-stride is utilized to classify the gait pattern of the patient with NDDs and healthy control subjects. Some research works involved detecting either PD or ALS only [9,13,14]. Some of them involved HD, ALS, and PD classification [8,[10][11][12]; however, the information from left and right feet is used in the system. A few of them utilized only right-foot information to classify HD, ALS, and PD [7]; however, this method only detected a patient with one disease against a healthy patient, not finding a patient with one of the diseases against a healthy patient. All previous researches utilized a regular numeric classifier, e.g., the support vector machine and K * classifier. Hence, these methods cannot provide a prototype signal for each disease.
In this paper, we propose the syntactic method for gait pattern classification from time series information. In particular, we introduce a string grammar unsupervised possibilistic fuzzy C-medians (sgUPFCMed) to recognize PD, ALS, and HD from the left-foot stride interval. It is worthwhile noting that the sgUPFCMed is a brand new algorithm proposed by our research group. It is a part of the recent doctoral thesis of one of our group members [6] and has never been published elsewhere. In the thesis, it was implemented on some standard data sets that are syntactic data set by nature, e.g., the Copenhagen chromosomes data set [15][16][17], the MNIST database of handwriting digit data set from http://algoval.essex.ac.uk/data/sequence/ as described in [18][19][20][21] collected by Professor Simon M. Lucas, and the USPS handwritten digit data set collected by Professor Simon M. Lucas and downloaded from http://algoval.essex.ac.uk/data/sequence/ [18][19][20][21]. Example from each data set is shown in Figure 1. The histogram of each image in the Copenhagen chromosomes data set was encoded into a string. It should be noted that we downloaded the encoded data set, not the images in these three data sets. The experiment results on both 10-fold cross validation and the blind test data sets from all three data sets are shown in Table 1. This shows that the algorithm is capable of classifying syntactic data set and also providing good classification results.
Since our algorithm is not a numeric classifier but a syntactic classifier, we transform the gait time series into a string using the symbolic aggregate approximation (SAX) [22]. The sgUPFCMed is utilized to find a string prototype(s) for each disease. Then the fuzzy k-nearest neighbor [23] is utilized to find the best match for a test data sample. The paper is structured as follows. The description of the NDDs detection system is introduced in Section 2. The results of gait classification are shown in Section 3. Finally, we draw the conclusion in Section 4.

System Description
In this section, we introduce the details of our system for gait pattern classification of patients with neurodegenerative diseases (NDDs). We take the gait data set from gait dynamics in neurodegenerative disease database (http://www.physionet .org/physiobank/database/gaitndd/). This data set consists of 64 subjects from 15 subjects with PD, 20 subjects with HD, 13 subjects with ALS, and 16 healthy control subjects [24]. Subjects were requested to walk along a 77-meterlong hallway for 5 minutes without stopping. Force-sensitive switches underneath each subject's feet were recorded at 300 Hertz sampling rate. From the recorded force, the time series of the stride time, stance time, and swing time were derived. To eliminate the startup effects, we follow the same method in [25]. The first 20 values of each samples are removed. The 3-SD median filter is utilized for eliminating the outliers that are far away from the median value [25]. The raw data are obtained using force-sensitive resistors, with the output roughly proportional to the force under the foot. Stride-tostride measures of footfall contact times are derived from these signals as shown in Figure 2. In the experiment, we only use left-foot stride-to-stride interval data set. The proposed scheme of the detection system is shown in Figure 3. We transform each time series data into a sequence string using the symbolic aggregate approximation (SAX) representation [22] to convert any time series into a sequence of symbols.
The gait time series ⇀ of length n is converted into its The time series data ( ⇀ ) is normalized into a series data with 0 mean and 1 standard deviation. Then it is divided into several frames with the size of w and each frame is converted to PAA data ( ⇀ ). Then each ⇀ (for = 1, . . . , ⌊ / ⌋) is mapped into a symbol. In our experiment, w is set to be equal to the length of the time series. There are 8 symbols used in the experiment. Example of the string generation is shown in Figure 4. In this figure the gait time series is transformed to "fbfdbcaddfgh. . . . . .dffhdd". Now, we are ready to create prototypes with the string grammar unsupervised possibilistic fuzzy clustering (sgUPFCMed). The sgUPFCMed is a modified version of the unsupervised possibilistic fuzzy C-means (UPFCM) [26], a combination of the possibilistic fuzzy C-means (PFCM) [27] and the unsupervised possibilistic clustering (UPCM) [28]. It is to solve the problem of generating coincident clusters of the UPCM. The UPFCM is developed based on the characteristics of both fuzzy and possibilistic Cmeans. Hence, the UPFCM should be able to deal more effectively with noise, overlapping, and outliers. Since the sgUPFCMed is modified from the UPFCM, it should have the same properties as the UPFCM. The brief description of the algorithm is as follows. Assume = { 1 , 2 , . . . , } be a set of strings. Each string ( ) is a sequence of symbols (primitives). For example, = ( 1 2 . . . ), a string with length , where each is a member of a set of defined symbols or primitives. Suppose V = ( 1 , 2 , . . . , ) represents atuple of string prototypes, each of which characterizes one of the clusters. V( , )] is the Levenshtein distance [29][30][31][32] between string and string prototypes . U is a membership matrix [ ] × and T is a possibilistic matrix [ ] × . The objective function of the sgUPFCMed is where is the membership value of string in the cluster , is the possibilistic value of string in the cluster , is the fuzzifier (normally > 1), > 1, > 0, > 0, > 0, ∑ =1 = 1 for = 1, . . . , , and 0 ≤ , ≤ 1. is defined as the sample covariance [23] based on the Euclidean distance. Since our data set is a string data set, the calculation of will be where Med is the median string of the data set; i.e., The theorem for the sgUPFCMed and its corresponding proof are shown in Theorem 1. This theorem shows that the update    = exp (− √ V ( , ) ) .
To update a cluster center, we utilized the fuzzy median string [23,[33][34][35][36] as follows: However, it has been proved in [35,36] that the modified median string provides a better classification than the regular median string. Hence, in [23,[33][34][35][36], the modified fuzzy median string is used. Let Σ * be the free monoid over the alphabet set Σ and a set of strings ⊆ Σ * . Then, the modified fuzzy median, i.e., an approximation of fuzzy median using Algorithm 1 Store unlabeled finite strings = { ; = 1, . . . , } Initialize string prototypes for all C classes Set , , , Compute using fuzzy median equation (3) Do { Compute Levenshtein distance between input string and cluster prototype ( V( , )) Update membership value using equation (5) Update possibilistic value using equation (6) Update center string of each cluster ( ) using equation (10) and (11) } Until (stabilize) Algorithm 2 edition operations (insertion, deletion, and substitution) over each symbol of the string, will be = arg min The cluster center update equation of the sgUPFCMed is shown in Algorithm 1. The sgUPFCMed algorithm is summarized in Algorithm 2.
Afterwards, the multiprototype generation, i.e., = { 1 1 , . . . , 1 1 , 2 1 , . . . , 2 2 , 1 , . . . , }, where is string prototype of class , is created. The fuzzy k-nearest neighbor (FKNN) [23,37] is used as a classifier. The membership value of string in class is where is the membership value of the th prototype from class ( ) in class , is the number of classes, and is the number of nearest neighbors. The decision rule for the test string is is assigned to class if ( ) > ( ) for ̸ = . (13) Because the class of each prototype is known, we set membership value to 1 for in class and zero membership values in all other classes.

Experiment Results
We implement three 2-class problems, i.e., the classification of ALS against healthy patients, HD against healthy patients, and PD against healthy patients. We also implement one 4class classification, i.e., the classification of all three NDDs diseases (ALS, HD, and PD) against healthy patients. In all of the experiments, we implement 4-fold cross validation to evaluate our proposed algorithm. The parameters and are set to 2, and the parameters and are set to 1 and 6, respectively. These parameters are chosen based on trial and error method from an extensive experiment. The stopping criteria of the sgUPFCMed are set to 0.01 with the maximum number of iterations of 100. To create multiprototype of each class, the sgUPFCMed is used to cluster each class with 2, 3, 4, and 5 number of clusters. In the testing process, the FKNN is utilized with = 1, 3, and 5. Tables 2-5 show Computational Intelligence and Neuroscience 7      Table 6. Figures 5-8 show time series that are closest to prototypes of the best model of the ALS, HD, PD, and NDDs classification experiment, respectively. We can see that the shape of each prototype is not exactly similar to the others. Although, there are some overlapping between prototypes of the disease gait signal and the healthy gait signal, the detection system can provide a good classification rate. For example, in Figure 6, the prototypes of HD gait signals are overlapped with that of the healthy control prototypes. However, the shapes are different. The string sequences will be different as well. Hence, the classification result 8 Computational Intelligence and Neuroscience Table 7: Comparison of the proposed method with the existing methods.

Conclusions
In this paper, the NDDs, i.e., Parkinson's disease (PD), amyotrophic lateral sclerosis (ALS), and Huntington Disease (HD), detection system is introduced. In particular, the NDDs left-foot gait time series (left-foot stride-stride interval) is transformed into a sequence of strings. The string grammar unsupervised possibilistic fuzzy C-medians (sgUPFCMed) first introduced in this paper is utilized to generate prototypes of each disease. Then the fuzzy k-nearest neighbor is used as a classifier in the testing process. We found that the best validation results of the 2-class problem, i.e., ALS versus healthy patient, HD versus healthy, and PD versus healthy, are 96.88±6.25%, 97.22±5.56%, and 96.43±7.14%, respectively. For the 4-class problem (three NDDs versus healthy), the best classification rate is 98.44±3.13%. From the indirect comparison, we found that our algorithm performs better than the existing algorithms on average. In addition, our system can provide the prototype signal that is more understandable to human than the previous methods that are based on numeric algorithm.

Data Availability
The data set is downloaded from http://www.physionet.org/ physiobank/database/gaitndd/. It is a public data set provided by physionet.org.