DeepActivity: a micro-Doppler spectrogram-based net for human behaviour recognition in bio-radar

: The movements of the human body and limbs result in unique micro-Doppler signatures, which can be exploited for classifying human activities. In this work, the authors propose a Convolutional Gated Recurrent Units Neural Network (CNN-GRU) to classify human activities of varying duration based on micro-Doppler spectrogram. Unlike conventional deep learning approaches which often treat the micro-Doppler spectrogram the same way as natural image, the authors extract local feature of micro-Doppler signatures via convolutional layer and encode temporal information with gated recurrent units. Through this unified framework, the temporal evolution of body motions within a short time can be better utilised. It avoids the resolution limitation caused by the fixed-size time window of input data and identifies human activity of duration shorter than the time window length. The experiment shows that CNN-GRU model is capable of recognising and temporally localising activity sequence contained in the spectrogram.


Introduction
Human activity classification is a field of particular interest to researchers varying from physical security to intelligent interface. Visual perception of the human body motion can be affected by distance, variations in lighting, deformations of clothing, and occlusions on the appearance of human body segments [1]. Owing to excellent day and night performance and ability to penetrate obstacles, radar system is an appropriate alternative to analyse human behaviour. The radar echoes from non-rigid-body contain valuable information related to human behaviours, which is known as micro-Doppler signatures [2]. Taking advantage of this distinctive effect, various human activities are promising to be classified.
Motivated by the successful application of deep neural networks in various fields, deep convolutional neural network architectures have been proposed for activity classification based on micro-Doppler spectrogram [3][4][5][6] and significantly outperform the previous state-of-the-art schemes that mainly relied on domain knowledge-based features. In addition to convolutional gated recurrent (CNN) trained from scratch, transfer-learned neural networks are also used for this task [7,8]. Based on the pre-trained model, much deeper convolutional networks, such as VGG, Inception-Net, and ResNet, can also be used to classify human behaviour without the need to be trained on huge database.
Although convolutional neural network architectures have been proved effective in micro-Doppler classification task, current approaches seem almost no difference from natural image classification task. Unlike visual system, micro-Doppler signatures carried by radar backscattering echoes contain temporal evolution of body motions within a short time (varying from a few seconds to a few minutes). This spatial temporal evolution of human motions is transformed in time-frequency plane and presented as spectrogram via joint time-frequency analysis. Limited to the physical structure of human body, the space and velocity of human motion cannot change abruptly, the past state is crucial for the prediction and analysis of next state. Considering the sequential characteristic behind micro-Doppler spectrogram, the pipeline framework should involve the part that exploits temporal correlations of input data to improve the processing performance.
In this work, we combine convolutional layers with gated recurrent neural network to propose a novel deep neural network architecture, namely Convolutional Gated Recurrent Units Neural Network (CNN-GRU), to classify various human motions. The convolutional parts consist of convolutional layers and pooling layers convolve and compress the Doppler spectrogram, respectively. The outputs of CNN are fed to the gated recurrent units (GRUs). The GRUs encode the temporal information of feature maps with both current observation and memory, model the activities' temporal progression. Through this unified framework, both local features and temporal information can be used to recognise and temporally localise activity sequence contained in the spectrogram. Experiments based on Motion Capture (MOCAP) database demonstrate that although CNN-GRU is trained on samples with fixed-size window, it can recognise and temporally localise activities of varying duration, which shows finer temporal resolution and better generalisation than most current approaches. In contrast to previous works in bio-radar field, we not only recognise activities but also detect their start and end time points.

Models
A micro-Doppler spectrogram-based human activity classification framework embodying the CNN-GRU model is shown in Fig. 1. Raw human backscattering radar echo is transformed into micro-Doppler spectrogram at first. After applying the short-time Fourier transforms (STFTs), the Doppler spectrogram is processed with convolution and pooling operations to extract local features. Then, two-layer GRU encodes the temporal patterns. The output of GRUs flows into a fully-connect layer with a softmax activation function to classify the identity of the micro-Doppler spectrogram.

Motivation
The core idea behind the designed model is that we treat the micro-Doppler spectrogram as time-sequence data rather than general RGB image. The micro-Doppler spectrogram, which is the STFT of the returned radar echo signal, represents the time-varying velocity of the separate body parts exactly. As human activity is a complex combination of the time-varying motion of the wholebody segments, it is possible to analyse human behaviour based on backscattering echo more effectively with the use of the regularities in temporal domain. However, due to the backscattering phenomenon of human body, the time-varying velocity of separate body parts may overlap to some extent in time-frequency plane (which can be seen from Fig. 2), which makes it difficult to analyse the temporal information conveyed in spectrogram directly. To address the problem, we combine J. Eng convolutional neural network and GRUs in one model, convolutional and pooling layer is used to extract local features, GRUs are introduced to analyse feature maps along time axis. Compared with stacked convolutional layers or auto-encoders which are generally used for this task, this unified model fits the time-varying nature of micro-Doppler spectrogram better in theory.

Local feature learning with convolutional neural network
The convolutional neural network used for local feature learning is two-layer architecture. We use 5 × 5 frequency-time filter for the first convolutional layer, followed by the same size filter for the second convolutional layer. These filters slide to every position of the spectrogram and compute a new element as a weighted sum of the elements it floats over. The aim of the convolution operation is to extract robust feature from the blurred curves in joint velocitytime plane. Unlike natural images, each 'pixel' in one column of spectrogram corresponds to the echo intensity and radial velocity of the illuminated parts of human body in one-time step. Each column records the time-varying characteristics of human motion in a time interval. Max-pooling function replaces the output of the convolutional layer at a certain location with the maximum output within neighbourhood. It can ensure the representation become approximately invariant to small translations of the input. This operation is useful, because motion speed of different human is usually not the same and this difference can be seen as the translation of the corresponding spectrogram. Through maxpooling operation, we can focus more on whether some feature is present than exactly where it is and learn more robust and informative representations of the spectrogram.

Temporal encoding with GRUs
The proposed pattern of temporal encoder is GRUs network with recurrent connections between hidden units. It reads an entire pooled sequence and then produces output to make classification. Built on a simpler architecture compared with long short-term memory network (LSTM), the GRU alleviates the computational burden to some extent and is promising for embedded platform. What is more, in emotion classification from noisy speech tasks, it is found that LSTM performs better than GRU in the case of continuous noise, while GRU performs better for the noise which is not usually continuous [9]. Since the background noise in radar backscattering echo is often non-periodic, we decide the GRU block to encode temporal information.
The information modulation process inside the GRUs seems similar to LSTM except that the GRUs do not have a separate memory cell. The GRU block diagram is illustrated in Fig. 3. The hidden state h t , which is also the activation of the GRU at time t, is a linear interpolation between the previous activation h t − 1 and the candidate activation h t . The update gate z t decides how much the unit updates its activation. The candidate activation h t is computed similarly to that of the recurrent unit. The value of reset gate r t decides whether to remember or forget the previously computed state. Equations (1)-(4) describe the mathematical model of the GRUs. Here, z t and r t is the update gate and reset gate separately. h t is the hidden state, W terms denote different weight matrices, and ⊙ denotes an element-wise multiplication (1)

Training for implemented model
The model is trained in a fully-supervised way, calculating error between the predicted outputs and ground-truth labels with crossentropy loss function then backpropagating the gradients from the softmax layer to the convolutional layer. We choose the Adaptive Moment Estimation (Adam) as the gradient descent optimisation method. The size of mini-batches data is 32. Model is trained with a learning rate of 0.001. Drop-out function is introduced during the training phase. As a form of regularisation, it connects the inputs of every dense layer, setting the activation of randomly-selected units during training to zero with probability of 0.
5. An open source toolkit Pytorch is used for training the network. The training process is accelerated by NVIDIA GTX 1080 GPU and CUDA library (cuDNN).

Results
In this section, we evaluate the proposed CNN-GRU model on MOCAP-based non-parametric simulation.

Empirical non-parametric human model
We examine the performance of the classification method on the simulated micro-Doppler spectrograms from the MOCAP database, which is also used in [10][11][12]. The MOCAP database collected by the Graphics Laboratory at the Carnegie Mellon University is available in the public domain and is very useful for studying human movements. It contains 2605 different motion clips of full body MOCAP data (performed by 144 subjects). Each file records the time-varying coordinate information of 30 body segments. In our work, these human body segments are modelled by prolate ellipsoids (see Fig. 4). It should be mentioned that no shadowing or multiple interactions are accounted for in this model. The segment volume measurements, including male and female, are given in [13]. We set the referenced measurements as the mean value and generate diverse human models, which obey Gaussian distribution. In our simulation, the bandwidth of the radar sensor was 1.5 GHz. The carrier frequency was set at 4.0 GHz. We interpolated the MOCAP data to obtain a pulse repetition frequency of 400 MHz. In order to get various simulated data, the 3-D coordinate position of radar changed in every simulation situation. We used the STFT to calculate the micro-Doppler spectrograms. The simulated micro-Doppler spectrograms consist of running, walking, boxing, kicking, and standing casually. The number of simulated samples of each activity was 600. The time window size of each sample is 1 s. For illustration, the micro-Doppler images of walking, running, boxing and kicking are shown in Fig. 5.

Fine temporal resolution of human activity sequence
In activity classification, the input data is trimmed to a fixed size which only contains time-varying information depicting a human action, and most classifiers' temporal resolution is thus limited by the time window size of input. In the proposed CNN-GRU model, the GRU recurrent cells takes CNN features of multiple neighbouring elements as input to detect actions at every time step. Feeding the hidden states of the GRU recurrent cells to the softmax layer, a class probability distribution for every single time step can be obtained, which can avoid the limitation caused by the fixedsize training samples. Fig. 6 demonstrates a test sample that contains two kinds of motion (boxing and running) in one spectrogram and the corresponding CNN-GRU's classification performance. The activity sequence is composed of boxing and running motion, the former lasts for 0.4 s and the latter lasts for the remaining 0.6 s.
The plain CNN or stacked auto-encoder models cannot tackle this kind of input, since they are trained for one action in one image and their capability is limited in classifying an action in the fixedsize input image. Fig. 6b shows that the CNN-GRU is able to recognise both boxing and running motions and detect their approximate start and end time points. The misclassification on the test data at one time step (0.18 s) may be attributed to the similarities among the local spectrogram images of boxing and walking. Since action cannot last for such short time, the walking motion can be excluded. Fig. 7 demonstrates the classification accuracy of test samples (Fig. 7a is walking motion sample and Fig. 7b is running motion sample) at each time step. The classifier makes a correct recognition when it only observes a few components of the input. Compared with previous applied deep learning approaches whose temporal resolution cannot be shorter than the time window length, the CNN-GRU model outperforms in temporal resolution of activity.

Generalisation ability
In the previous section, we tested the proposed model's temporal resolution of human activity sequence. We were also interested in our model's classification accuracy. We introduced leave one user out (LOUO) cross-validation scheme to verify the ability of the classifier. In LOUO scheme, each actor's data was used for testing once, while the remaining of the other actors was used for training. Performance comparison is conducted by CNNs network proposed by Kim et al. in [6] on our dataset. Confusion matrices for the classification methods are presented in Fig. 8. Each row represents the true label that input spectrograms belong to, while each column represents the predicted label of the input by the classifier, and the diagonal of the confusion matrix shows the accuracy of correct classification. It is found that our network outperforms the plain CNN by 7%. Among six activities, boxing is the most difficult item to be classified and often be mistaken for kicking. It is found that the generalisation ability of our method is better than the competitive neural network architecture. Fig. 9 shows the loss/ accuracy curves of the two neural networks. We observe that CNN-GRU converges faster than CNN. This is because although running through the same number of training, CNN-GRU 'captures' various levels of abstract feature more effectively and therefore learns better.

Conclusion
In this work, we have developed a CNN-GRU neural network that combines convolution operation and temporal encoding to classify human activities based on micro-Doppler spectrogram. The MOCAP database-based experiments show that the proposed model avoids the resolution limit caused by the fixed-size training samples and yields a class probability distribution for every single time step. In contrast to previous works, it is able to recognise and temporally localise activities of varying duration, which shows finer temporal resolution and better generalisation. Featured with fine temporal resolution and generalisation ability, this approach is not only suitable for classification tasks but also promising towards capturing context and reasoning about human pose.