On view-invariant gait recognition: a feature selection solution

: The authors present an improved feature selection solution for the view-invariant gait recognition problem, based on their previously proposed method called view-invariant feature selector (ViFS), which automatically reconstruct an optimised gallery template from a set of multi-view gallery templates. They improved ViFS by introducing a constraint to make sure that the reconstructed features have the same scale as the original features, thus reducing the number of misclassifications caused by data misalignment. They evaluate the improved ViFS on the CASIA B and OU-ISIR large population datasets by performing a wide range of comparative studies in order to explore and confirm its effectiveness. Evaluation results indicate that the proposed framework is very effective for view-invariant gait recognition tasks.


Introduction
The ever-growing demand of reliable human identification systems for law-enforcement, national security and commercial use has promoted the development of biometrics in the past 50 years.Advanced biometric techniques provide solutions to a wide range of challenges with the overarching intention of preventing imposters from accessing protected resources.Various biometric systems are capable of offering solid performance in real-world applications, such as fingerprint, face, and iris recognition.
Despite the fact that biometric traits are intrinsic to humans, they cannot always be easily captured by Close Circuit Television (CCTV) cameras or other types of sensors.Furthermore, it is highly preferable that the biometric analysis and authentication be performed at a distance in a non-invasive and non-obtrusive manner to avoid the cyber arms race (Development of deception and anti-deception techniques.).Gait, which is considered to be a behavioural biometric trait, can be measured unobtrusively at a moderate distance, thus it is predominant in remote human tracking and identification tasks.The past two decades have witnessed an important development of gait recognition systems.However, there are still important challenges that confine the practical application of gait analysis, one of which is concerned with the view angle variation between gallery data (data with known identities) and probe data (query data with unknown identities).It is therefore imperative to develop view-invariant gait recognition systems for the sake of its competitiveness in practical applications.
Gait recognition approaches can be broadly classified into two categories: model based and appearance based.Model-based gait recognition refers to identifying people by modelling their distinctive gait characteristics with underlying mathematical structures [1].Most model-based methods rely on high-quality gait sequences captured under controlled environments (e.g.indoor environments, a close-distance between subject and camera, multiview cameras, and depth cameras), thus they are effective at handling occlusions and changes in scale, as well as view-angle changes.However, the restrictions imposed by the underlying sensors used to acquire the data and their low tolerance to lowquality video makes these methods less applicable for outdoor gait recognition.
Appearance-based methods usually adopt gait silhouettes as the feature source to build effective gait templates.The silhouettes are obtained by subtracting the subject profile from the background using gait sequences acquired by video cameras.The classification is usually performed by measuring the pixel-to-pixel distance between gallery and probe templates.A commonly used appearance-based template is the gait energy image (GEI), which is computed as the average of the binary silhouettes from a gait cycle [2].Experiments on several large gait datasets (over 4000 subjects) suggest that GEI is the most statistically stable and efficient template for gait recognition, while other templates such as chronogait image [3] and gait entropy image [4] failed to show such robustness across various datasets [5,6].The advantages of appearance-based approaches based on GEIs are their robustness to low-quality video and low-computational complexity.However, GEIs are usually not robust to view angle and scale changes.
View-invariant gait recognition is one of the major challenges in people identification.Many researchers have evaluated view angle transformation techniques, discriminant analysis and manifold learning approaches for cross-view gait recognition.Their proposals are usually based on a common factor; namely to establish a cross-view mapping between gallery and probe templates.However, the effectiveness of many of these proposals is restricted to small view-angle variances.A promising approach to perform view-invariant gait recognition is through multi-view feature learning.
We have previously proposed the view-invariant feature selector (ViFS) [7], which is a linear regression-based feature selector that reconstructs gallery templates from arbitrary view angles, thus minimising the cross-view variance between gallery and probe features.Within the context of multi-view gait recognition, this equals to reducing the intra-class variance.Subspace learning methods, i.e. linear discriminant analysis (LDA), have been applied to ViFS as feature enhancers to reduce the computational cost and improve the recognition accuracy.In this work, we introduce a feature scaling process to ViFS to further improve its performance by making sure that all the templates, original and reconstructed, are regularised before the similarity measurement.The scaling process reduces the noise in the reconstructed templates and thus minimises misclassifications.We test the proposed framework on the CASIA dataset B (CASIA B) and the OU-ISIR large population (OU-ISIR LP) dataset.The average recognition accuracy of our framework over 11 different views exceeds 99%.
This paper is organised as follows: Section 2 reviews the literature on view-invariant gait recognition.Section  explains the formulation of ViFS, as well as other fundamental concepts.Section 4 explains in detail the improvements proposed to ViFS.Section 5 presents the evaluations on the CASIA B and the OU-ISIR large population datasets.Section 6 concludes the paper.

Related work
In general, there are two types of view-invariant gait recognition: cross-view recognition, where only a single view angle is available in both, the gallery and probe sets (the view angles are different), and multi-view recognition, where templates from multiple view angles are available in the gallery set while the probe templates are from a single view angle, or vice versa (This case can be reversed, i.e. multi-view templates available in the probe set, while the gallery templates are from a single view angle.).
Depending on the underlying algorithms used, current viewinvariant gait recognition algorithms can be classified into one of three categories: (i) those based on human models, (ii) those based on view-invariant features, and (iii) those based on unitary projections.Methods within the first category are concerned with creating models that represent the human anatomy.When multiview gait templates can be obtained, or depth information is available, it is possible to reconstruct 3D or 2.5D models, from which arbitrary views of gait sequences can be obtained by projection, and the parameters associated with various body parts can be easily determined.Tang et al. [8] propose to construct parametric 3D gait models from three cameras and use partial similarity matching to improve recognition rates.Their method achieves promising results on several major gait datasets.Similarly, Luo et al. [9] propose to use 3D gait models and sparse representation-based classification to perform view-invariant classification.Their framework and its performance are very similar to the one prosed by Tang et al. [8].
Methods within the second category attempt to obtain viewinvariant features from single-view gait silhouette sequences to perform recognition under lateral views, i.e. those views different from the frontal and back views.For example, Kusakunniran et al. [10] and Goffredo et al. [11] employ view-invariant gait features for cross-view recognition.In [10], the authors propose the gait texture image and apply domain transformation obtained through invariant low-rank textures to obtain common canonical side view gait features (i.e. the walking trajectory is perpendicular to the camera's viewpoint) from other view angles.Despite the good performance of this method, it is difficult to project features from the front or back views to the side view.In [11], the authors propose model-based view-invariant gait features, which use lower limb pose estimation to perform view angle rectification.However, as with other model-based methods, it is difficult to extract model's parameters (height, length of limbs, joint angles etc.) from gait sequences acquired from a distance at low resolutions and with occlusions.
Methods within the third category usually adopt appearancebased features, e.g.GEIs, and learn the mapping relationship of features from two different views.Makihara et al. [12] propose the view transformation model (VTM) to effectively project gait features between two different views.However, as with any other singular value decomposition (SVD)-based method, VTM is sensitive to noise in the training dataset and requires a huge amount of memory and high computational power to compute the matrix factorisation if the training set is large.To solve these issues, Kusakunniran et al. [13] use truncated SVD and LDA to enhance the performance.In [14], Muramatsu et al. further enhance the VTM by matching gait features locally, i.e. by separating the gait into head, torso, thigh and shank regions, in order to avoid overfitting and reduce the influence of local-feature corruption.However, VTM-based methods still fail to provide satisfying results with large view-angle differences (over 30 ∘ ) between gallery and probe.
An alternative to feature mapping is to learn a unitary subspace where features from the same subject but at different view angles are clustered together, while features from different subjects but at the same view angle are far from one another [15][16][17].After learning such subspace, features from various views can be projected into it for distance matching.Within this context, Hu et al. [18] propose a novel unitary liner projection method named ViDP, which enables cross-view gait recognition to be conducted without knowing the query view angle.The recent work by Zhang et al. [19] proposes a list-wise constrained discriminative projection framework on a novel gait representation to tackle the view angle variance.Apart from reporting results for cross-view matching, they also report results for the multi-view case, which outperforms other conventional subspace learning methods.
Convolutional meural networks (CNNs) have been recently used to tackle gait recognition challenges.Alotaibi et al. apply a full CNN with four convolutional layers and a softmax layer for simple gait recognition tasks, i.e. matching gallery and probe data under the same view angle [20].Yan et al. use a five-layer CNN with three convolutional layers and two fully connected layers for gait recognition.They also introduce a multi-task learning approach, which performs gait recognition, view angle prediction and scene prediction simultaneously.According to their findings, multi-task learning can accelerates the convergence of CNNs in the training process.However, the cross-view recognition performance of their network appears to attain small improvements compared with traditional approaches using principal component analysis (PCA) + LDA.Shiraga et al. successfully use a four-layer CNN, consisting of two convolutional layers and two fully connected layers, for large-scale gait recognition on the OU-ISIR large population dataset [21].Their network has important advantages over other approaches on large-scale datasets when the view angle difference between gallery and probe data is small < 30 ∘ .They also show that CNN-based methods can significantly reduce the equal error rates and thus improve the gait verification accuracy.Feature maps learned by CNNs have strong discriminant power and thus provide robustness in gait recognition.However, performance of cross-view gait recognition with large view angle variations > 54 ∘ is still not ideal.Wu et al.'s work in [22] represents the state-of-the-art of CNN-based cross-view gait recognition.
In our previous work [7], we introduce ViFS to reconstruct gallery templates from arbitrary view angles, and help to transfer the cross-view gait recognition problem to the identical-view gait recognition problem.However, despite its very good performance, some important aspects of ViFS require improvements.For example, the selected features occasionally degrade the performance compared with the performance attained when using single-view features.We realise that this is due to the fact that the ViFS does not force the features to be normalised, thus the reconstructed templates do not align with other templates.This misalignment of data inevitably introduces noise in the reconstructed templates and thus leads to misclassifications.Based on these observations, in this work we introduce a feature scaling process to ViFS to make sure that all the templates, original and reconstructed, are regularised before the similarity measurement.This is explained in detail in Section 4.

Gait recognition pipeline using GEIs
GEI-based gait recognition focuses on both the human body shape in the spatial domain and the movement in the temporal domain.A conventional gait recognition pipeline based on GEIs consists of three steps: • Acquisition of the gait signature: This involves extracting the binary silhouette of the subject from a video sequence.Several well-known techniques have been adopted for this task, including least median of squares [23], Gaussian mixture models [24], and most recently, fully convolutional networks [25].• Construction of the GEIs templates: Fig. 1 shows an example of the binary silhouettes extracted from a video sequence for a complete gait cycle (the two columns on the left hand side) and the corresponding GEI computed as the average of the binary silhouettes (right hand side).By compressing spatial and temporal information into one image, a GEI is able to reduce the effect of noise and increase computational efficiency.• Similarity measurement: After modelling gallery and probe data using GEIs templates, the distance between them is measured to find the matching identity.

View-invariant feature selector
Let us assume that h samples (i.e.GEIs) from h different unknown view angles are available in the gallery set G = {x i } i = 1 h , as well as one probe sample, y, from an unknown view angle in the probe set, P. Due to the view angle difference between gallery and probe samples, the intra-class distance can be larger than the inter-class distance for the same subject, leading to misclassifications.To reduce the negative effects of view angle differences on the classification results, one can minimise the cross-view distance between gallery and probe samples.If the view angles of the gallery and probe samples are unknown, one would like to find a feature vector w = {w i } i = 1 h that minimises the objective function: The minimiser [26].Then w ^ can be calculated as follows: Since the gallery set G and its covariance matrix G ⊺ G are highly unlikely to be upper-triangular, we cannot solve (2) directly.Instead, we use QR-factorisation, i.e.G = QR, to generate an orthogonal matrix Q and upper-triangular matrix R from G. Thus (2) can be formulated as We can obtain w ^ by solving Rw ^= Q ⊺ y with back substitution.We call minimiser w ^ the ViFS, as it selects features from the multi- view gallery samples to reconstruct an optimal template G ^= Gw ^⊺ that accurately matches probe sample y.In Fig. 2, we present a set of examples to demonstrate the effectiveness of ViFS for feature reconstruction.We take four samples of the same subject from the gallery set at view angles 18 ∘ , 72 ∘ , 126 ∘ , 180 ∘ .We denote these set of samples by G.We train ViFS so to reconstruct gallery samples representing 11 different view angles.For example, for the reconstruction of the 0 ∘ gallery sample, we use the four gallery samples and one probe sample from 0 ∘ to generate the ViFS for 0 ∘ , denoted as w ^0, and obtain the reconstructed template G ^0 = Gw ^0 ⊺ .The reconstructed gallery samples in the third row of Fig. 2 are visually similar to the ground truth samples in the fourth row of Fig. 2, suggesting that ViFS achieves an accurate viewtransformation on gallery samples.

2D PCA:
Yang et al. [27] propose the 2D extension of PCA.Consider the training set I i | i = 1, …, n , where I i is a single sample (e.g. a GEI) in 2D form with size d r × d c , and n is the total number of samples.The image covariance matrix, C, is then calculated as where By performing the eigen-decomposition of C, we can obtain the 2D PCA projection basis In this work, we use I i to represent sample i in 2D form.In the following, all samples are assumed to be vectorised into features vectors instead of being in 2D form.Therefore, we denote feature vector i by x i and y i for gallery and probe, respectively.

Proposed improved ViFS
In this section, we present the methodology followed to improve ViFS.Specially, we introduce a constraint in the objective function to achieve data normalisation by performing feature scaling in the minimiser w ^.

Feature scaling
The reconstructed feature template of a specific view, v, is denoted as As presented in Section 3, the QR-factorisation solution does not guarantee that the l −1 norm of w ^ is equal to 1, thus it is not guaranteed that G ^v and G v have an identical feature scale, i.e. max (G v ) ≠ max (G ^v) (Generally, the sparsity of the feature matrix results in min (G v ) = min (G ^v) = 0.).Therefore, we reformulate the object function of ViFS as

Feature enhancement using subspace learning
In order to further enhance the extracted features and increase the inter-class variance, we apply subspace learning.Since subspace learning methods are designed to project the input features into another space with lower dimensionality, the redundant information is removed and the discriminant features are preserved.Furthermore, since they are linear transformations, the computational cost and processing times are very low.In our previous work, we have shown that when used as features enhancers, locality preserving projection and LDA attain a nearly identical performance.Hence, in this paper, we only implement LDA for its low computational cost and popularity.Before applying LDA, we reduce the dimensionality of the data and make sure that the matrices are non-singular by applying PCA.Let us denote the 3D matrix containing n T training samples by n T , as well as the corresponding eigenvalues n T , are obtained by eigen-decomposition of the covariance matrix.We select the first p eigenvectors according to: Thus, we obtain V pca = v i i = 1 p , a d c × p subspace projection matrix.The subspace projection is then T i = T i V pca , which results in matrix T = T i i = 1 p .We reshape the 3D matrix T to 2D form with dimensions d pca = d r × p and n samples.We then use T and the corresponding class labels to train the LDA projection matrix, V lda .Let us assume that there are h views in gallery set G, and n G samples in total.After obtaining the ViFS, w ^, the subspace projection matrices, V pca and V lda , and the reconstructed gallery set, G ^= Gw ^⊺, we project G ^ onto the subspace matrices to obtain an enhanced gallery feature set: Following the same procedure, we also obtain the enhanced probe set P lda .For simplicity, we use G and P to represent the enhanced gallery and probe feature sets, respectively, in the formulation of the similarity measurement.

Similarity measurement
We use the Euclidean distance to obtain matching scores between gallery and probe.The Euclidean distance between gallery feature set G and probe feature set P is calculated as where c is the number of classes, and l denotes the unknown probe data.If D(G k , P l ) = min i = 1 c D(G i , P l ), the probe feature vector is assigned to the same class label k of the gallery feature.

Performance evaluation
In this section, we validate the performance of the improved ViFS on two datasets: the CASIA Gait B and OU-ISIR Large Population Datasets.As part of these evaluations, we also analyse the trade-off between accuracy and speed of different feature enhancers when used in conjunction with the improved ViFS.

Experiments on the CASIA B dataset
The CASIA B dataset is a multi-view gait dataset that contains 124 subjects in total [28].The size of each silhouette image is normalised to 128 × 88; one video sequence produces a single GEI.Since this work focuses on studying the performance of the improved ViFS across different view angles, we only employ those sequences that are not affected by changes in clothes or carrying objects.The sequences of the first 74 subjects are used for training, and the other 50 subjects are used for testing.In the testing set, each subject has six sequences; the first four sequences are regarded as gallery sequences, and the remaining two sequences as probe sequences.
We first present the performance of PCA + LDA for viewinvariant gait recognition on the CASIA B dataset.This performance can be used as a baseline to assess the effectiveness of ViFS.Table 1 presents the cross-view matching results when using PCA + LDA.It is obvious that the diagonal values in the table are the highest among each row.As expected, the accuracy drops dramatically when the view angle difference is equal or above to 36 ∘ .The same pattern can be identified in Table 2, which corresponds to the case of 2D PCA + LDA.[7], view angles widely spread usually result in higher matching accuracy.Thus, we select two views that differ from each other as much as possible.Specifically, we evaluate view angles 72 ∘ 180 ∘ and 18 ∘ 108 ∘ in the gallery set.In other words, we make sure that we have gallery data captured from one frontal-view and one lateral-view.Tables 3 and 4 tabulate the results, along with the baseline performance.Baseline_1D refers to PCA + LDA, while Baseline_2D refers to 2D PCA + LDA.The matching accuracy values for Baseline_1D and Baseline_2D are the average of each corresponding row in Tables 1 and 2, respectively.When gallery data from two views are available, OViFS outperforms the baseline framework by ∼28%, on average.In Table 3, note that when the probe data are at 0 ∘ or 36 ∘ , OViFS outperforms the original ViFS by 2%.The underlying reason for this improvement is that the minimiser of the original ViFS for 0 ∘ is w ^0 = {0.49, 0.47}, and ∥ w ^0 ∥ = 0.96, thus the reconstructed gallery features do not align with the probe features.By normalising w ^0, as in (5), w ^0 = {0.51,0.49} with ∥ w ^0 ∥ = 1.00.The improvement attained for the probe data at 36 ∘ can be explained in the same way.These results suggest that the proposed feature scaling method works effectively when the original ViFS minimiser fails to retain the same feature scale.
We also observe identical improvements when gallery data from three view angles are available.Let us denote this case by ViFS1_3 for the original ViFS and by OViFS1_3 for the improved ViFS.Table 5 tabulates results for gallery view angles 72 ∘ , 126 ∘ , 180 ∘ .Note that the feature scaling method manages to improve the recognition accuracy for the 0 ∘ view angle from 84.00 to 88.00%.For this view angle, the minimiser of the original ViFS is w ^0 = {0.23,0.45, 0.29}, and ∥ w ^0 ∥ = 0.97, which represents a situation similar to the one explained before.It is interesting to note the results for view angles 0 ∘ , 18 ∘ and 144 ∘ in Tables 3 and 5.It is well known that the frontal and back view GEIs have greater variation with other GEIs.The additional gallery data, i.e. the 126 ∘ view angle, strengthens the reconstructed features for other views (except for the front view -0 ∘ ), thus improving the overall performance.
Table 6 tabulates the highest accuracy that OViFS1_3 can achieve with gallery data from view angles 18 ∘ , 108 ∘ , 180 ∘ .The average accuracy over all 11 views reaches 99.27%, with very good results for views angles 0 ∘ , 144 ∘ and 162 ∘ .
As expected, the performance of our improved ViFS increases as the number of different view angles are available in the gallery set, as depicted in Fig. 3.Note that OViFS1_2 achieves the poorest performance, especially for the view angles that are the most different from the two available view angles in the gallery set.OViFS1_3 achieves a high accuracy across all views, while the performance of OViFS1_4, OViFS_5 and OViFS_6 is very similar and the best.Therefore, one can conclude that if gallery data contains samples from more than three different view angles that are widely spread, our improved ViFS has a very good   performance across all views, i.e. it achieves robust view-invariant gait recognition.

Evaluation of different eigenvector ratios:
We evaluate the effect of using different eigenvector ratios for the feature enhancers on the improved ViFS.This evaluation assumes that gallery data from three views angles are available: 18 ∘ , 108 ∘ , 180 ∘ .For the case of PCA + LDA, denoted by OViFS1_3, the evaluated eigenvector ratios, e_ratio, are {0.8,0.9, 0.95, 0.99}.Table 7 tabulates these results.As e_ratio increases, the performance improves significantly, indicating that a sufficient number of eigenvectors are required to preserve useful information from the original feature space.For the case of 2D PCA + LDA, denoted by OViFS2_3, the evaluated eigenvector ratios are e_ratio ∈ {0.7, 0.8, 0.85, 0.9, 0.95}.Table 8 tabulates these results.Differently from the results presented in Table 7, the average performance of OViFS2_3 has very small variance for different values of e_ratio.Note that when e_ratio is set to 0.95 or higher, the performance drops, indicating that the additional eigenvectors might introduce redundant information to the feature subspace, thus decreasing the performance.One can conclude that OViFS2_3 has a higher tolerance to this parameter change, thus providing a more stable performance than OViFS1_3.However, a welladjusted OViFS1_3 can achieve higher accuracy, which may be suitable for tasks under controlled environments.Another benefit of using OViFS2_3 is the low computational cost.We evaluate both OViFS1_3 and OViFS2_3 on a laptop with Intel i7-6820HK and DDR4 16 GB memory.The average training time for OViFS1_3 is 51.19 s, while OViFS2_3 takes 1.57 s, ∼32 times faster than OViFS1_3.Therefore, PCA + LDA is suitable for high accuracy, while 2D PCA + LDA is suitable for real-time processing tasks.

Evaluation of different dataset partitions:
We evaluate the influence of dataset partition on the proposed framework using OViFS1_3 with view angles 18 ∘ , 108 ∘ , 180 ∘ .It is well known that for machine learning approaches, including the adopted subspace learning method, the amount of available training data has a huge impact on the performance.Furthermore, different dataset partition strategies are adopted by the state-of-the-art view-invariant methods, e.g.Wu et al. [22] evaluate the performance of CNNs with 74 subjects for training, and 50 for testing (74-50, hereafter), and with 24 subjects for training, and 100 for testing (24-100, hereafter).Therefore in this section, we evaluate OViFS using different dataset partitions.accuracy drops from 99.27% (see Table 6) to 93.36%, which is mainly caused by the poor recognition accuracy on probe data at 72 ∘ .However, as the number of training samples increases to 34, the accuracy on the 72 ∘ probe data increases to over 90%.One can then conclude that the poor accuracy of Training # 24 on the 72 ∘ view angle is due to the lack of sufficient good quality training data for this view angle, i.e. noisy data may be present causing overfitting.However, if the training set is not badly affected by noise, even with a small amount of training samples, the proposed framework can still attain a reliable performance, as the accuracy for other view angles is relatively high.Fig. 4 shows the variation on the average performance of OViFS1_3 across all views when the number of training samples varies from 24 to 74.In general, our proposed framework is robust to training set size variation, and as expected, as the number of training samples increase, the recognition rate increases.

Comparison with the state of the art:
Finally, we compare our improved ViFS against the state-of-the-art multi-view gait recognition approaches.Table 10 tabulates the recognition accuracy of our improved ViFS and the recently proposed methods by Tang et al. [8].Tang_9 refers to the experiment that uses nine training views from 18 ∘ to 162 ∘ .Tang_4 refers to the experiment that uses four training views, 36 ∘ , 72 ∘ , 108 ∘ , 144 ∘ .It is important to note that no results for the 0 ∘ and 180 ∘ view angles are reported in their work.We use four different settings for comparison: OViFS1_3: PCA + LDA, e_ratio = 0.99, available views: 18 ∘ , 72 ∘ and 162 ∘ .
OViFS2_2: 2DPCA + LDA, e_ratio = 0.9, available views: 18 ∘ and 108 ∘ .From Table 10, one can observe that OViFS achieves higher accuracy, while requiring less view angles in the gallery set compared to Tang et al.'s work.Note that for the three-view setting, PCA + LDA achieves higher accuracy than 2D PCA + LDA, on average; while in the two-view setting, both attain identical average results.
Table 11 tabulates the recognition accuracy of OViFS_2 and different state-of-the-art methods using a 24-100 dataset partition with a gallery view angle of 54 ∘ and probe data with view angles of 36 ∘ , 72 ∘ , 18 ∘ , 90 ∘ and 0 ∘ , 108 ∘ .Table 12 tabulates the recognition accuracy for the case of a gallery view angle of 54 ∘ and probe data with view angles of {108 ∘ , 144 ∘ }, {90 ∘ , 162 ∘ } and {72 ∘ , 180 ∘ }.In these tables, Zhang et al. (1) refers to the feature level fusion adopted by [19], Zhang et al. (2) refers to the scorelevel fusion from the same work, and Zhang et al. (3) refers to their multi-view DPLCR (DPLCR is the acronym of discriminative projection with list-wise constraints with rectification, which is the framework proposed by Zhang et al. in their paper [30].In their paper its performance reported on the mainstream datasets is thestate-of-the-art.).From these two tables, one can observe that OViFS1_2 outperforms the state-of-the-art methods.Let us recall that ViFS is designed to match gallery data from multiple view angles with single-view angle probe data.However, it is possible for ViFS to work with the opposite situation, since the feature selection is a feature mapping process from a multi-view set to a single-view template.Note that OViFS1_2 is also more robust to large view angle variances according to the tabulated results.For example, in Table 11, when the view angle of the gallery data is 54 ∘ and the view angles of the probe data are {18 ∘ , 90 ∘ }, our framework outperforms Zhang al.'s method by 8%.In Table 12, when the view angle of the gallery data is 126 ∘ and the view angles of the probe data are {72 ∘ , 180 ∘ }, OViFS_2 outperforms other methods by up to 12%.

Experiments on the OU-ISIR LP dataset
The OU-ISIR large population dataset includes more than 4000 subjects, each recorded using cameras from four view angles: 55 ∘ , 65 ∘ , 75 ∘ , and 85 ∘ .Among all the datasets commonly used for gait recognition evaluation, this dataset is one of the largest in terms of the number of subjects, thus it is more statistically reliable for performance evaluation.According to the existing evaluation protocols [21,29,30], a common experiment setting is to use a sub-set of 1912 subjects, which is divided into two groups, where 956 subjects are used for training and the rest for testing.We refer to this subset as the OU-ISIR LP dataset in the following discussions and results.As shown in Section 5.1, our improved ViFS when used with 2D PCA + LDA attains a strong performance on the CASIA B dataset can be trained faster and is less affected by parameters (e_ratio values).Hence, we focus here on evaluating OViFS2.13 tabulates the recognition accuracy using 2D PCA + LDA (without OViFS) on the OU-ISIR LP dataset.The purpose of these results is to set a baseline to measure the improvements attained by OViFS2.From this table, one can observe than when the view angle difference between gallery and probe data is small (e.g. 10 ∘ or less), 2D PCA + LDA achieves a high accuracy, close to the identicalview matching results.However, when the view angle difference is larger than 20 ∘ , the accuracy decreases fast.

Multi-view evaluation:
We evaluate OViFS2 for multi-view matching on the OU-ISIR LP dataset.We use four views, 55 ∘ , 65 ∘ , 75 ∘ and 85 ∘ , to train OViFS2_4.Here, we compare the performance of OViFS2_4 with two CNN-based approaches proposed by Wu et al. [22] and Shiraga et al. [21].As done in [21,22], five-fold crossvalidations are employed to reduce the effect of randomness.Specifically, the training and testing sets (each contains 956 subjects) are randomly selected five times; each time we record the recognition accuracy, and the final accuracy is the average of the five experiments.

Conclusion
In this paper, we introduced feature scaling to improve ViFS, which is a feature selector that achieves robust view-invariant   recognition by reconstructing gallery data at different view angles to be matched with single view angle probe data.The improvements introduced to ViFS in this paper normalise the associated minimiser so that the reconstructed gallery features are aligned with the probe features.To enhance the reconstructed features, our improved ViFS, denoted by OViFS, employs PCA + LDA.We evaluated OViFS with a wide range of gallery view angles, for a different number of eigenvectors.Our results showed that for high precision tasks, OViFS with PCA + LDA is most appropriate, which can attain an average recognition accuracy of 99% with gallery data from three widely spread view angles.For real-time processing, our results showed that OViFS with 2D PCA + LDA is the best choice due to its small sensitivity to parameter changes and low computational cost.Our results also showed that OViFS achieves a better performance than the state-of-the-art methods, while requiring less available gallery data from different view angles.OViFS has the potential to be used in practical scenarios when more than two cameras are available, such as in border control, smart homes and surveillance systems.

Acknowledgment
This work was supported by the EU Horizon 2020 -Marie Sklodowska-Curie Actions through the project entitled Computer Vision Enabled Multimedia Forensics and People Identification (Project No. 690907, Acronym: IDENTITY).

Fig. 2
Fig. 2 Example reconstruction by ViFS of gallery templates for missing view angles.The ground truth shows the gallery templates from all views provided by the CASIA B dataset

Fig. 3
Fig. 3 Recognition rates of the proposed improved ViFS when data from various view angles are available in the gallery set as the p orthonormal eigenvectors corresponding to the p largest eigenvalues.Compared with the canonical PCA, 2D PCA is much more computationally efficient.For example, for GEIs of size 128 × 88, the covariance matrix of vectorised samples using canonical PCA has a complexity O(2 d ), d = d r × d c = 11264; while the complexity of calculating the image covariance matrix, C, is only O 2 d r , d r = 128.

Table 1
Cross-view matching accuracy (%) using PCA + LDA on the CASIA B dataset 290IET Biom., 2018, Vol.7 Iss.4, pp.287-295 This is an open access article published by the IET under the Creative Commons Attribution-NonCommercial-NoDerivs License (http://creativecommons.org/licenses/by-nc-nd/3.0/) 5.1.1Evaluation of feature scaling: We examine the effect of feature scaling on the recognition accuracy.The feature enhancer is PCA + LDA, and the eigenvalue ratio of PCA is set to 99%.The feature dimensions are reduced from 11,264 128 × 88 to 207.Since ViFS is designed to be applied to a multi-view gallery set, the combination of gallery data from different views is verified and analysed.Let us denote by ViFS1_2 the case of the original ViFS with PCA + LDA with data from two different view angles in the gallery set (Here the 1 refers to 1D PCA, and two refers to the two views available in gallery set.).Let us denote by OViFS1_2, the case of the improved ViFS with PCA + LDA with data from two different view angles in the gallery set.As shown in our previous work

Table 9
tabulates results for three cases: Training # 24, Training # 34 and Training # 44, where the number indicates the number of subjects used for training the framework.Note that when we reduce the number of training samples from 74 to 24, the average
IET Biom., 2018, Vol.7 Iss.4, pp.287-295 This is an open access article published by the IET under the Creative Commons Attribution-NonCommercial-NoDerivs License (http://creativecommons.org/licenses/by-nc-nd/3.0/) Table 14 tabulates the recognition accuracy (%) of OViFS2_4 and the CNN-based approaches.Wu et al. (i) refers to the case of identical-view matching, while Wu et al.(a) is the average matching accuracy of the gallery view angles with a certain probe view angle.The same notations apply to Shiraga et al.'s method.We notice that Wu et al.'s method achieves the highest accuracy for identical-view matching, which confirms the effectiveness of CNN to extract discriminative features, especially when sufficient number of training samples is available.Shiraga et al.'s method uses a shallower network than that used by Wu et al., and it does not use the pair-image approach to train the network, thus its performance is lower than that attained by Wu et al.'s method.OViFS2_4 attains a very good performance.Moreover, its overall performance is very close to that attained by Wu et al.'s method for the identical-view scenario.It is worth mentioning that the training and testing time of ViFS2_4 is shorter than that required by the CNN-based approaches.

Table 12 Recognition
accuracy (%) of various methods for 126 ∘ gallery data.Two view angles are available in the probeThe bold values indicates the values that outperform the previous works.