Classification of micro-Doppler signatures of human motions using log-Gabor filters

: In recent years, Doppler radar has been used as a sensing modality for human gait recognition, due to its ability to operate in adverse weather and penetrate opaque obstacles. Doppler radar captures not only the speed of the target, but also the micro-motions of its moving parts. These micro-motions induce frequency modulations that can be used to characterise the target movements. However, a major challenge in Doppler signal processing is to extract discriminative features from the radar returns for target classification. This study presents a feature extraction method for classification of human motions from the micro-Doppler radar signal. The proposed method applies the log-Gabor filters at multiple spatial frequencies and orientations on a joint time – frequency representation. To achieve invariance to the target speed, features are extracted from local patches along the torso Doppler shift. Then, the (2D) 2 PCA (two-directional two-dimensional principal component analysis) method is applied to create a compact feature vector. Experimental results based on real radar data obtained from multiple human subjects demonstrate the effectiveness of the proposed approach in classifying arm motions.


Introduction
With the advances in radar technology, there is an increasing interest in using Doppler radar for human gait recognition and activity monitoring. A modern Doppler radar detects not only the velocity of a target but also the local dynamics of its moving parts. The micro-movements induce frequency modulations around the main Doppler shift are commonly known as micro-Doppler (μ-D) effects. Several studies have been conducted to analyse μ-D signatures of moving targets [1][2][3][4][5]. An early study on μ-D effects investigated the jet engine modulation of radar returns for target identification [1]. Later, Chen et al. developed mathematical models and performed simulations to analyse μ-D effects of targets under translation, rotation, and vibration [2]. Other researchers conducted numerical simulations and real experiments, which demonstrated that the μ-D signature represents the kinetic motions of an object and provides a viable means for object identification [3][4][5]. μ-D signals have been used for classifying rigid targets, such as helicopters and aircrafts [6], and wheel and track vehicles [7]. They have also been used to differentiate rigid and non-rigid targets, for example, humans and vehicles [8][9][10]. In recent years, the research on μ-D signals has been focused on analysing human movements [11][12][13][14][15][16][17][18][19][20][21]. In these applications, one common challenge is to extract discriminative features from the radar returns for classification.
In this paper, we propose a μ-D feature extraction method for classifying human movements. First, the μ-D radar signature is obtained using time-frequency (T-F ) analysis. Then, two-dimensional (2D) filters are applied on the T-F representation. Since the μ-D modulations induced by the arm and leg motions appear close to the main Doppler frequency, a local T-F patch centred on the torso frequency shift is located for feature extraction. This step makes the proposed method stable against variations in the target speed. Then, log-Gabor filters are used to extract features from the T-F patch. This type of filters has neither DC component nor bandwidth limitation. Therefore, a small set of filters is adequate to cover the desired frequency spectrum. The rest of the paper is organised as follows. Section 2 presents a brief description of the μ-D signal and the related work on human μ-D radar signature classification. Section 3 describes the proposed feature extraction method. Section 4 presents the experimental results, and finally, Section 5 gives the concluding remarks.

Related work
In this section, a brief description of μ-D radar signal is given in Section 2.1, followed by a review of existing μ-D radar signature classification approaches in Section 2.2. Then, three T-F analysis techniques used to depict μ-D radar signature are presented in Section 2.3.

μ-D radar signal
When a radar signal is backscattered from a target moving at a constant radial velocity, the carrier frequency of the radar signal is shifted according to the target velocity. If the target is a point scatterer, the received signal can be expressed as where A(t) is the time-varying amplitude and f d (t) is the instantaneous Doppler phase [11]. Let l be the transmitted signal wavelength and v(t) be the target velocity. The instantaneous Doppler phase is given by A complex target such as a human can be represented as a set of point scatterers. For example, Bilik and Tabrikian [11] modelled a human body as a set of K segments; each segment moves at its own velocity and has M points. The signal received by the antenna can be written as where A k,m is the amplitude for the mth point-target on the kth segment, β k (t) is the instantaneous angle from the zenith of the segment,b k (t) is the instantaneous angular velocity, and l k,m is the distance of the mth point along the segment. In practice, the human locomotion is much more complex than the model given in (3). Therefore, advanced modelling tools, for example, electromagnetic wave scattering model, have been employed to simulate the μ-D signals exhibited by different parts of a moving person [22].

Existing μ-D classification approaches for human motions
The various μ-D classification approaches require first the extraction of salient features from the radar signal. The features are calculated from the time, frequency, or joint T-F domain. In the time domain, Fairchild and Narayanan [12] employed the empirical mode decomposition (EMD) algorithm to decompose the radar signal into a set of intrinsic mode functions (IMFs) and computed the IMF energies as features. In the frequency domain, Bilik and Khomchuk [13] applied speech and audio processing techniques to extract three types of features: real cepstrum, linear predictive coding coefficients, and Mel-frequency cepstrum coefficients (MFCC). Molchanov et al. [14] computed discrete cosine transform features from μ-D signals for target classification. In the joint T-F domain, Kim and Ling [15] defined six features from the spectrogram of a moving person: (i) the torso Doppler frequency, (ii) the total bandwidth of the Doppler signal, (iii) the offset of the total Doppler, (iv) the bandwidth without μ-D, (v) the normalised standard deviation of the Doppler signal strength, and (vi) the period of the limb motion. Orovic et al. [16] applied the Hermite S-method to convert the radar signal into a T-F representation and developed an envelope detection method to capture the evolution of the arm swing. Mobasseri and Amin [17] extracted features from the spectrogram by applying principal component analysis (PCA). Li et al. [18] employed the matrix-based PCA technique on spectrogram to classify different arm motions of a walking person. Bjorklund et al. [19] computed the cadence velocity diagram (CVD) and extracted the cadence frequencies and velocity profiles as μ-D features. Tivive et al. [20] proposed a machine learning method to extract μ-D features from the spectrogram. Groot et al. [21] proposed to use particle filters to differentiate walking from running motions and estimate the person's height from the spectrogram. Given the importance of converting the μ-D signal into a joint T-F representation for classification, the next section describes three T-F analysis methods.

T-F representations
We review three main T-F analysis representations for depicting μ-D radar signature: the short-time Fourier transform (STFT), the pseudo Wigner-Ville distribution (PWD), and the S-method. Consider a time-domain signal x(t). The STFT of the signal is given by where w(t) is a time window. The spectrogram (SP) is the squared magnitude of the STFT The STFT has a simple implementation, but it generally provides a low resolution. In comparison, the PWD produces a high-resolution T-F representation, and is related to the STFT as However, the PWD produces cross-terms which may cause difficulties in interpreting the T-F distribution. By contrast, the S-method achieves similar auto-term concentration, but it does not suffer from the cross-terms. The T-F distribution of the S-method (SM) is given by where Q(θ) is a finite frequency domain window. The S-method behaves as the STFT when Q(θ) = πδ(θ), and as the PWD when Q(θ) = 1. The discrete form of the S-method can be written as follows where n is the discrete time index, k is the discrete frequency index, and N is the number of frequency samples. For a rectangular window, that is, Q(i) = 1 for |i| ≤ J and zero otherwise, the S-method with J terms can be written as The parameter J is usually set to a small value (J ∈ [3,10]) since most of the auto-term energy is located around the maximum value of the auto-term [23]. As an illustration, the above three T-F analysis techniques are applied to a μ-D signal; the respective T-F representations are shown in Fig. 1. The Doppler frequency shift induced by the torso is around 200 Hz. The main peak represents the leg swing. Although the PWD generates a high-resolution T-F representation, shown in Fig. 1b, it is hard to discern the μ-D modulations, not to mention the cross-terms. Therefore, in this paper only the STFT and S-method will be investigated to generate the joint T-F representation.

μ-D feature extraction
In this section, we describe the proposed approach for signal classification using μ-D radar signatures. Fig. 2 shows the main steps of the feature extraction, followed by the classification step. After the T-F analysis, the μ-D radar signature is converted into a low-dimensional feature vector by using log-Gabor filters and a dimensionality reduction technique.

T-F patch and contrast enhancement
When a person is walking, the μ-D modulations induced by the arm and leg motions occur mainly around the human torso frequency. Therefore, we extract local patches centred on the torso frequency instead of the entire T-F representation. This strategy improves invariance to translational speed of the target. The torso frequency can be easily determined by locating the main peak in the frequency profile. From the joint T-F representation X(n, k), the frequency profile can be expressed as where M is the number of time samples. The frequency profile is passed to a median filter and then normalised by dividing by the maximum value to obtain F p (k). The location of the torso frequency is computed as The patch height is determined relative to the height of the main peak, whereas the patch width is determined based on the task at hand; here, it is chosen so as to maximise the classification rate (CR) on individual patches. Assuming the person is inducing positive μ-D modulations, the height of the main peak is estimated by finding the smallest frequency index k o that satisfies the following condition where N is the number of frequency samples and η is a fixed threshold (0 < η < 1). The threshold η is chosen based on the T-F representation noise level. When the T-F representation is noisy, the threshold is set to a low value and vice versa. In this paper, it is set to 0.99. The vertical span of the patch is given by the Since different individuals swing their legs at different speeds, the patch height is fixed to a predefined value N y through down-sampling or up-sampling operation on the columns of the patch. The patch width is determined by the length of the input signal and the length and the step of the window in the T-F representation. Furthermore, the patch is aligned with respect to the main peak. Let T i denote the time of the ith main peak, ΔT denote the time duration between two consecutive main peaks, and N x be the length of the input signal in samples. The horizontal span of the patch is given by the time interval Next, contrast enhancement is performed on the patch. We use the Naka-Rushton equation [24] because it not only enhances the weak μ-D amplitudes but also suppresses the small amplitudes, which usually represent noise. Furthermore, it constrains the input to the range [0, 1), facilitating the subsequent feature extraction. Let W(i, j) denote the patch containing the absolute value of the T-F representation, where i ∈ [1, N y ] and j ∈ [1, N x ]. The contrast-enhanced patch is computed as where μ is the mean value of the patch. In this paper, the parameter r is set to 1 (r = 1). Fig. 3 presents the patches extracted from the spectrogram that are contrast-enhanced by the logarithmic scale and Naka-Rushton equation. The difference between Figs. 3b and c shows that the Naka-Rushton equation is better than the logarithmic scale at enhancing weak μ-D modulations and suppressing clutter.

Log-Gabor filtering
Log-Gabor filters, which were proposed by Field [25], are used to analyse the pre-processed T-F patch. A log-Gabor filter has similar shape to a Gabor filter on the logarithmic frequency scale with an extended tail in the high-frequency region. Due to the singularity at the origin, a log-Gabor filter is designed in the frequency domain and is computed as where f k is the kth centre frequency of the filter, θ l is the lth orientation angle, N f is the number of scales, N θ is the number of orientations, β is the bandwidth of the filter, and σ θ is the angular bandwidth. Here, the bandwidth of the filter is set to two octaves (β = 0.55), and the angular bandwidth σ θ is set to 1.5 for even spectrum coverage. Fig. 4 shows examples of the log-Gabor filter.
The filtering operation is performed in the frequency domain. Let W F denote the 2D Fourier transform of the normalised patch W , see (13). The output of the (k, l)th log-Gabor filter is computed as where IFFT2 denotes the 2D inverse Fast Fourier transform. The output map is further normalised as This normalisation step provides some degree of intensity invariance. Next, each output map is partitioned into R = d 1 × d 2 non-overlapping sub-regions, and the means of all the sub-regions are concatenated to form a mean vector μ k,l = [μ 1 , …, μ R ] T . Then, the mean values of the L output maps (where L = N f N θ ) are Finally, the feature matrix A is normalised to the range [0,1] before applying a dimensionality reduction technique to generate a compact feature vector.

Dimensionality reduction
Two different matrix-based subspace techniques are considered to reduce the size of the feature matrix A: two-directional 2D PCA, (2D) 2 PCA [26] and two-directional 2D linear discriminant analysis, (2D) 2 LDA [27]. In comparison to standard PCA and LDA, the matrix-based methods do not require a matrix-to-vector conversion to compute the image covariance matrices, thereby reducing the computational cost significantly. Unlike LDA, (2D) 2 LDA does not suffer from the singularity problem for small training sets. Both (2D) 2 PCA and (2D) 2 LDA will be investigated in the proposed feature extraction. A brief description of each method is given in the following subsections.
where Φ r and Φ c are the projection matrices with orthonormal components. Then, the columns of matrix D are concatenated to form a feature vector for classification. Let Y = AΦ c . The projection matrix Φ c can be determined by maximising the following criterion [26] V The image covariance matrix can be defined as , which is an L × L non-negative definite matrix. Suppose that the training set comprises P patches {A 1 ,…, A P }. The image covariance matrix G c can be computed as where A is the global mean given by The criterion in (19) can be expressed as The criterion function Ω(Φ c ) is maximised when Φ c is composed of the m c most dominant eigenvectors of G c : F c = [f 1 , . . . ,f m c ]. The number of eigenvectors m c is determined using the following condition where l i denotes the ith eigenvalue and g is a fixed threshold. Similarly, the other projection matrix Φ r contains the m r most dominant eigenvectors of the image covariance matrix G r , which is given by The number of eigenvectors m r in the projection matrix Φ r is estimated using a similar condition to (23).

Two-directional 2D LDA, (2D) 2 LDA:
The LDA technique considers the class information when forming the projection matrices. Its principle is to find a linear transformation that maximises the between-class scatter and minimises the within-class scatter of the training set. For (2D) 2 LDA, the optimisation of the between-class and within-class scatters is performed in both row and column directions simultaneously as where Ψ r and Ψ c are projection matrices. Let C be the number of classes and N i be the number of training samples in the ith class. Let P be the total number of training samples, P = C i=1 N i . Let A i p denote the pth sample and A i be the mean of all samples of the i th class, The between-class and within-class scatter matrices for the row direction are given, respectively, by and where A is the global mean given in (21). Similarly, the between-class and within-class scatter matrices for designing the projection matrix Ψ r are, respectively and The projection matrices Ψ r and Ψ c are obtained by maximising the following Fisher criteria and The discrimination vectors in the projection matrices Ψ r and Ψ c are the eigenvectors of G −1 wr G br and G −1 wc G bc , and the number of eigenvectors in these matrices is estimated using a similar condition to (23).

Classification stage
Support vector machines (SVMs) are used as a classifier because they possess good generalisation capability. The key concept of SVMs is to determine an optimal hyperplane that separates two different classes. A hyperplane is optimal if it maximises the margin between the two classes, where the margin is defined as the distance between the hyperplane and a closest training vector. Consider a training set {x i , y i } P i=1 , where x i ∈ R n is the ith input vector and y i ∈ {1,−1} is the corresponding class label. Training an SVM classifier involves solving the following optimisation problem min w,b,j Here, f(x i ) maps an input vector x i into a higher-dimensional space so that the classification problem becomes simpler (i.e. linearly separable). Parameter C is a positive regularisation constant to control the trade-off between the margin and the misclassification rate. A dual optimisation problem is given by where Once α i has been calculated, the decision function is given by and N s is the number of support vectors. In this paper, the linear kernel, K(x i , x j ) = x T i x j , is used and parameter C is obtained from the training set via a cross-validation procedure.

Experimental results
In this section, we first describe the experimental setup, and then investigate the effects of different steps in feature extraction. Finally, we compare the classification performance of the proposed method with other feature extraction methods.

Experimental setup
A 24 GHz frequency modulated continuous wave radar was used for data acquisition. The beam width of the radar antenna is 7°h orizontal and 25°vertical. The radar was positioned at a height of 0.7 m from the ground. The Doppler data were acquired in two environments (outdoor and indoor) from 18 subjects (7 females and 11 males). Each subject performed three motion types: (i) walking with both arms swinging; (ii) walking with one arm swinging; and (iii) walking with no arm swinging. Each subject walked towards the radar at azimuth angles of 0°and 3.5°, and repeated each motion type three times. The radar signal was recorded at a sampling rate of 7.812 kHz. Overall, 234 Doppler signals of length 10 s were recorded. Fig. 5a shows images of a subject walking with different arm motions and their respective T-F representations.

Analysis of feature extraction steps
In the proposed method, several adjustable parameters need to be considered, for example, the window length in the T-F representation, the number of scales and orientations of the log-Gabor filters, and the dimensionality reduction settings. A six-fold cross-validation is performed to investigate the effect of these parameters on the CR. In each validation fold, five subsets are used for training, and the remaining subset is used for testing. This is repeated six times for different choices of the test subset. The final CR is computed as the percentage of correctly classified samples, which are aggregated across all the validation folds. Initially, a signal length of 1 s is used to determine the adjustable parameters of the proposed method.

T-F representation:
In the STFT and S-method, the choice of the window length can affect the performance significantly. In the first experiment, the number of discrete Fourier transform points and the overlap between consecutive windows are set to 2048 and 90%, respectively. Then a variable window length is used to form the T-F representation. When the window length is smaller than 2048 samples, the signal is padded with zeros prior to T-F analysis. The extracted patch is fixed to a height of 128 pixels (N y = 128) and contrast enhanced by the Naka-Rushton equation. Initially, a set of 32 log-Gabor filters (four scales and eight orientations) are used. To compute the mean values of the filtered map as features, the height d 1 of the non-overlapping sub-region is set to 8, whereas the width d 2 is chosen to have a time duration of 125 ms. Therefore, for an input signal length of 1 s, d 2 is equal to 8. In the following two experiments, the feature vector is directly classified by the SVMs without dimensionality reduction. Fig. 6 depicts the CR as a function of the window length. Both T-F analysis methods reach a peak CR with a window length of 139.3 ms. Further increasing the window length reduces the CR; this is due to poor time resolution. Both T-F representations will be used in the succeeding experiments.

4.2.2
Log-Gabor filtering: Discriminative features are extracted by convolving the pre-processed patch with log-Gabor filters of different scales and orientations. To design a set of log-Gabor filters, we vary the number of scales from 3 to 5, and the number of orientations from 4 to 10. Figs. 7a and b show the CR as a function of the number of scales and orientations for the STFT and S-method, respectively. The CR improves markedly when the number of scales increases from 3 to 4; it reaches a steady state when the number of scales reaches 4. With four scales, the STFT obtains a peak CR of 90.0% with eight orientations, whereas the S-method achieves a CR of 90.4% with nine orientations. Therefore, we use a set of 32 log-Gabor filters (i.e. four scales and eight orientations) for the STFT, and 36 log-Gabor filters (i.e. four scales and nine orientations) for the S-method.

Dimensionality reduction:
Two subspace methods, (2D) 2 PCA and (2D) 2 LDA, are evaluated for feature compression. Fig. 8 presents the CRs as a function of the number of features obtained by dimensionality reduction. When the number of features is small, (2D) 2 LDA performs better than (2D) 2 PCA. However, increasing the number of features improves the CR of (2D) 2 PCA to the same level as (2D) 2 LDA. For this classification problem, (2D) 2 PCA achieves the highest CR; therefore, it will be used to reduce the number of features in the following experiments.

Input signal length:
The input signal length is an important factor in the classification of μ-D radar signature. A too short signal will not contain adequate cycles of the arm/leg swings to differentiate between the three human motions, whereas a too long signal will lead to data redundancy. To investigate the effects of input signal length, we vary the signal length from 0.5 to 3.0 s with a step of 0.5 s. For each input signal length, a new SVM classifier is trained. Fig. 9 presents the CRs as a function of input signal length. The CR improves when the input signal length increases from 0.5 to 3 s and reaches a plateau for a signal length of 1.5 s for S-method and 2 s for STFT. Based on these results, we use a signal length of 2 s in the following experiments.

Effects of azimuth angle and clutter:
So far in Section 4.2, the proposed method has been tested on radar data for subjects walking towards the radar at an azimuth angle of 0°in an outdoor environment. We also evaluate the proposed method in different configurations, especially in the presence of clutter. Here, the clutter or noise in the received signals includes multi-path propagations and reflections from other stationary targets. Four datasets are used. Datasets outdoor-0.0 and outdoor-3.5 are acquired from subjects walking in an outdoor environment, at an  azimuth angle of 0°and 3.5°, respectively. Datasets indoor-0.0 and indoor-3.5 are acquired from subjects walking along a corridor inside a building, at azimuth angles of 0°and 3.5°, respectively. Table 1 lists the CRs of the proposed method on different datasets using a two-fold cross-validation. The proposed method has a CR of 91.6% for the indoor-0.0 and 91.4% for the outdoor-0.0. For subjects walking obliquely towards the radar at an azimuth angle of ±3.5°, the CR reduces by 0.4% for the outdoor-3.5 and 2.5% for the indoor-3.5.
In summary, based on the experiments presented in Section 4.2, we select the following configurations for the proposed method: (i) S-method for T-F analysis; (ii) the Naka-Rushton equation for pre-processing the patch; (iii) a set of nine orientations and four scales log-Gabor filters for generating the feature map; (iv) (2D) 2 PCA for dimensionality reduction; (v) input signal length of 2 s; and (vi) SVM classifier.

Comparison of feature extraction methods
A six-fold cross-validation is employed to compare different feature extraction methods. The indoor radar dataset (indoor-0.0) is used to generate the training and test sets. The dataset is partitioned so that no human subject appears simultaneously in the training set and the test set. The radar signals are divided into segments of 2 s. Each segment is aligned so that it comprises two full gait cycles.
For comparison, four different feature extraction methods are also tested: (i) MFCC, (ii) CVD, (iii) EMD, and (iv) Gabor filters. In the MFCC-based method, 40 triangular bandpass filters are used to produce 64 mel-scale cepstral coefficients. The analysis window length and the overlap between successive windows are set to 0.5 and 0.01 s, respectively. In the CVD-based method, the first three harmonic frequencies and the velocity profile at each harmonic frequency are used to form a feature vector. Then, the feature vector is compressed using the standard PCA technique. In the EMD method, the energies of the IMFs are used as features. The Gabor filter-based method employs the same number of filters and applies the same step of converting the filtered outputs into a feature vector. All the feature extraction methods employ SVMs as classifier. Furthermore, they are implemented using MATLAB software and executed on a PC with a 2.9 GHz i7-CPU. Table 2 presents the CRs and the processing times of different feature extraction methods. The proposed method achieves the highest CR of 91.3%, followed by the Gabor filter-based method with a CR of 79.9%. The EMD method gives the lowest CR, which indicates that using only the energy of the IMF is not sufficient to discriminate the subtle differences of the arm swings. In terms of processing time, the proposed method takes about 0.2405 s on average to extract a feature vector, and its most time consuming stage is the filtering operation. We should note that no code optimisation was used in the implementation of the proposed method; the processing time can be reduced by optimising the implementation or using a different programming language. Compared with the proposed method, the Gabor filter-based method is about 1.14 times faster, but it yields a CR of 11.4% lower. The MFCC-based method is about 8.84 times faster than the proposed method, but its CR is 18.6% lower. The CVD-based method is about 11.5 times faster than the proposed method, but its CR is 29.0% lower. The EMD method is significantly slower and less accurate compared with the proposed method. The EMD high processing time is due to the iterative technique for extracting the IMFs.
The results shown in Table 2 are obtained using different number of features for each method. To compare the classification   performances of the different methods using the same number of features, Fig. 10 illustrates the CRs of the feature extraction methods as a function of the number of input features. Clearly, the CRs of the proposed method are still higher than those of the other methods, when using the same number of features.

Conclusion
This paper presents a 2D feature extraction method for classifying μ-D radar signature of human motions. The radar return is transformed into a T-F representation using the S-method. Instead of processing the entire T-F representation, a small patch centred on the torso frequency is automatically extracted to improve stability against the target speed. The T-F patch is then contrast normalised to highlight the weak μ-D modulations. Log-Gabor filters are employed to detect salient features at multiple scales and orientations, and (2D) 2 PCA is applied for dimensionality reduction. The proposed method is validated using μ-D radar signals obtained from human subjects walking with various arm motions. Experimental results show that it achieves promising results in classifying different human motions.