Towards Reading Beyond Faces for Sparsity-aware 3D/4D Affect Recognition

In this paper, we present a sparsity-aware deep network for automatic 3D/4D facial expression recognition (FER). We ﬁrst propose a novel augmentation method to combat the data limitation problem for deep learning, speciﬁcally given 3D/4D face meshes. This is achieved by projecting the input data into RGB and depth map images and then iteratively performing randomized channel concatenation. Encoded in the given 3D landmarks, we also introduce an effective way to capture the facial muscle movements from three orthogonal plans (TOP), the TOP-landmarks over multi-views. Importantly, we then present a sparsity-aware deep network to compute the sparse representations of convolutional features over multi-views. This is not only effective for a higher recognition accuracy but also computationally convenient. For training, the TOP-landmarks and sparse representations are used to train a long short-term memory (LSTM) network for 4D data, and a pre-trained network for 3D data. The reﬁned predictions are achieved when the learned features collaborate over multi-views. Extensive experimental results achieved on the Bosphorus, BU-3DFE, BU-4DFE and BP4D-Spontaneous datasets show the signiﬁcance of our method over the state-of-the-art methods and demonstrate its effectiveness by reaching a promising accuracy of 99.69% on BU-4DFE for 4D FER. (cid:1) 2021 The Authors. Published by Elsevier B.V. ThisisanopenaccessarticleundertheCCBYlicense(http:// creativecommons.org/licenses/by/4.0/).


Introduction
Facial expressions (FEs) are one of the most important, powerful and natural ways to convey and understand human emotions.Many works have been reported in this regard to automate facial expression analysis due to its tremendous potential in many applications [1], such as medical diagnosis, social robots, self-driving cars, educational well-being and several other human computer interaction methods.This paved the way for the rise of many facial expression recognition (FER) systems [2] using the recentlytrending computer vision technologies with deep learning.
Despite the promising progress reported in the literature by 2Dbased FER methods, they suffer from inevitable issues.Specifically, such methods face challenging problems of sensitivity towards pose variations, lighting condition changes and occlusions.With the advent of high-speed and high-resolution 3D data acquisition equipment, an essential alternative has been provided to achieve robust FER performance.This is due to the fact that 3D data additionally provides information about facial expressions in the form of deformation of various muscle movements on facial surfaces [19].They also offer immunity to the variations in viewpoint and lighting.As a result, the research direction is recently steered towards FER systems based on static and dynamic 3D face models.
More recently, the rapid development of imaging systems for 4D data acquisition has enabled the access to the dynamic flow of 3D shapes of improved quality for a detailed investigation of facial expressions.Importantly, being more robust towards the aforementioned problems, and providing additional geometry information [24], the 4D data conveniently stores the facial deformations when a facial expression is triggered.In this regard, the release of large-size facial expression datasets containing 3D/4D face scans (e.g., the Bosphorus [25], BU-3DFE [26], BU-4DFE [27] and BP4D-Spontaneous [28] datasets) has allowed 3D/4D affect recognition by fetching facial deformation patterns both spatially as well as temporally.

. Advances in 3D facial expression recognition
To learn from the underlying 3D facial geometry, a number of methods are reported, which only rely on the static 3D data, i.e., apex frames.Generally, the most popular approaches can be categorized as local region or feature-based, template-based, 2D projections-based, and curve-based approaches.
In local region or feature based techniques, different topological and geometric characteristics are extracted from several regions of a 3D face scans, and quantization of such characteristics is adopted to represent various expressions for better classification.Some typical examples are local normal patterns (LNP) [29], low level geometric feature pool [30], and histograms of differential geometry quantities [18].On the other hand, template-based methods requires to fit deformable or rigid generic face model to an input face scan under certain criterion, and the computed coefficients are then regarded as extracted features of facial expressions.Some representative methods belonging to this category are annotated face model [15], bi-linear models [31], and statistic facial feature model [32].
With an added advantage of reusing the conventional solutions for 2D FER, 2D projections-based approaches exploit various 2D representations from the given 3D face models.In such works, deep features are used for learning the facial attributes in 2D images projected from the 3D face models.For example, the authors in [16] presented a 3D face model in terms of six types of projected 2D maps of facial attributes.These maps are then fed into a deep learning model for learning distinctive features along with fusion learning.Similarly, Oyedotun et al. [33] proposed a deep learning model for joint learning of robust facial expression features from fused RGB and depth map latent representations.
In comparison with the above-mentioned methods, the curvebased approaches have also shown effective performance.By computing various shapes via a set of sampling curves in the Riemannian space, deformations caused by different facial expressions are represented.For instance, the authors in [34] presented a framework for representing facial surfaces with the help of an indexed set of 3D closed curves.These curves correspond to level curves of a surface distance function which is defined as the shortest distance between a point of the computed curve on the facial surface and a reference nose tip point.Such representation allows comparing the 3D shapes by corresponding curves.Similarly in [12], the authors used the curve-based idea for 3D FER.They computed the distance of the geodesic path between the corresponding regions on a 3D face.This method produces quantitative information from the surfaces of various facial expressions which ultimately helps the classification performance.
Generally, 3D based FER approaches rely on the apex frame only.This means that despite having the capability to extract additional features from the depth, the movements in facial muscles, in terms of temporal information, is not available.In such scenarios, an ideal method should look beyond the apparent feature representations.One promising way to achieve this is the additional use of landmarks and sparse-features to fundamentally encode underlying facial features -which the existing methods do not take into account.

Towards 4D facial expression recognition
In the past decade, 4D FER has been widely investigated due to its significance towards better performance.This is mainly because 4D data provides complimentary spatio-temporal patterns allowing deep models to significantly understand and predict the facial expressions.For instance, Sun et al. [35] and Yin et al. [27] proposed a method based on Hidden Markov Models (HMM) to learn the facial muscle patterns over time.In another attempt, using random forest, Ben Amor et al. [24] presented a deformation vector field mainly based on Riemannian analysis to benefit from local facial patterns.Likewise, Sandbach et al. [36] proposed free-form deformation as representations of 3D frames and then used HMM and GentleBoost for classification.Moreover, the authors in [13] used support vector machine (SVM) and represented geometrical coordinates and its normal as feature vectors, and as dynamic local binary patterns (LBP) in another work [20].Similarly, to extract features from polar angles and curvatures, a spatiotemporal LBP-based feature was proposed in [37].
By using the scattering operator [38] on 4D face scans, Yao et al. [39] applied multiple kernel learning (MKL) to produce effective feature representations.Similarly, the authors in [40] proposed a statistical shape model with local and global constraints to recognize FEs.They claimed that local shape index and global face shape can be combined to build a required FER system.For automatic 4D FER via dynamic geometrical image network, an interesting model was proposed by Li et al. [22].They generated geometrical images after the differential quantities were estimated where the final emotion prediction was a function of score-level fusion from different geometrical images.
In a similar work [21], a collaborative cross-domain dynamic image network was proposed for automatic 4D facial expression recognition.They computed geometrical images and combined their correlated information for a better recognition.Another recent method exploits the sparse coding-based representation of LBP difference [41].The authors first extracted appearance and geometric features via mesh-local binary pattern difference (mesh-LBPD), and then applied sparse coding to recognize FEs.
Importantly, 4D data offers several aspects to improve the system's performance.For instance, 4D face scans offers additional temporal information along with extended spatial information.Such extended spatio-temporal data carries the potential to demonstrate superior recognition performance by using effective features/descriptors to capture better facial cues.However, this comes at the cost of computational expense demanding more resources.As well, the limited availability of such 4D data can create bottlenecks for a network's learning, and therefore, asks for promising augmentation methods.

Motivation
Despite many attempts to automate 3D/4D FER, we believe there exist many loopholes that need attention.For example, the data available to train a deep network for 3D/4D FER is very limited.This calls for efficient data augmentation techniques to satisfy the data-hungry nature for deep learning.Moreover, despite the fact that the multi-views of 3D/4D faces jointly capture facial deformations, and that landmarks also encode specific movement patterns, their role is often ignored.Importantly, the need of appropriate facial feature representations is vital in the success of a 3D/4D FER system.Specifically, sparse representations have shown to offer tremendous performance results [42].Therefore, we offer a solution that uses sparse features effectively to perform 4D FER in a collaborative fashion.The sparse features do not only provide effective cues to a deep learning network for better learning but they are also handy in reducing the computational time as demonstrated in the later subsections.Importantly, we must high-light that with reduced computational costs, our proposed sparse approach allows processing more samples achieved with the augmentation method, which combinedly provide a computationally elegant solution.

Contributions
In the light of above discussion, our 3-fold contributions in this work are as follows: We present 1-augmentation which is a simple yet efficient method to combat 3D/4D data limitation problem.We introduce TOP-landmarks over multi-views to extract landmark cues from three orthogonal planes.We compute sparse representations from extracted deep features, which outperforms the state-of-the-art performance with reduced computational complexity.
Our contribution and novelty could be emphasized more by mentioning that from the given 3D/4D point cloud data in datasets, some recent works [22,23] also attempted to compute projected 2D images.However, to the best of our knowledge, we can fairly claim that it is the first time a multi-view projection of 3D landmarks, as TOP-landmarks, is used in literature.Regarding sparse representation, it has shown promising results in the literature [42], and the closest to using sparse features in 3D/4D FER is [41].However, the authors in [41] apply sparse coding to extracted mesh-LBPD features, where as we compute sparse features from multi-views and then proceed with collaborative prediction.This is indeed the novelty of our work and has not been reported before.Also, the existing 3D/4D FER methods either do not address the issue of limited data availability, or rather rely on regular augmentation methods and do not go out of the box to explore this aspect of the 3D/4D FER.Therefore, we can fairly claim that our augmentation method has not been reported before and we are the first ones to propose it.
The rest of the paper is organized as follows: our proposed 3D/4D FER method is explained in Section 2. In Section 3, we present and discuss our extensive experimental results to validate the efficiency of the proposed method.Finally, Section 4 concludes the paper.

Proposed method
In this section, we explain our proposed method for automatic 3D/4D FER.First, we augment our data to extend the dataset size for better deep learning.Second, we extract the facial patterns encoded in landmarks on three orthogonal planes to aid our affect recognition system.We then extract deep features which are then collaboratively learned with landmarks for an accurate expression recognition.To leverage the correlations between 1augmentation, TOP-landmarks and our sparse representations, the proposed sparsity-aware deep learning framework is explained as follows.

TOP-landmarks
Inspired by LBP-TOP [43], we propose TOP-landmarks to extract effective landmark cues from three orthogonal planes and then use it in our deep learning network over multi-views (left, front and right).For each of the given multi-views, we first project all the given 3D landmarks over the three XY, XZ and YZ orthogonal planes as shown in Fig. 1.This means that for a given set of 3D landmarks with m vertices, the corresponding projections in three orthogonal planes are To avoid data anomalies in model's training, we use normalization to make sure the projected points appear similar across different samples.For this, we use a regular normalization method, and scale values from 0 to 1, to ensure correspondence across different frames.Next, we compute the Euclidean distance of each landmark point from its origin and store them as distance vectors.Finally, we concatenate the three distance vectors from each orthogonal plane to compute the resultant TOP-landmarks, denoted by X, as where g : ð Þ and (_) denote the normalization and concatenation operators, respectively.

Sparsity-aware deep learning
Sparsity has recently played a remarkable role in reducing the computation time and boosting classification performance in deep learning [42].Therefore, we also propose our deep network to be sparsity-aware as shown in Fig. 2. As shown in this figure, once we project the 3D/4D data as RGB texture and depth maps over multi-views, we then augment the data for increasing the training data.Note that instead of using frontal view only, we resort to multi-views and then have a score-level fusion to incorporate the facial muscle movements from the side-views as well.For computational convenience, we use GoogLeNet [44] to extract deep features from the augmented data.From a given 4D face scan, consider an input deep feature vector x k with length P, we aim to transform it into the equivalent sparse representations as where k is the index for current 3D face, A 2 R PÂQ ; Q ) P, is an overcomplete dictionary having wavelet basis.As discussed in the literature [45], we use wavelet basis since most of the natural and synthetic images are better represented in the wavelet domain.Keeping in mind the scope of the paper and for computational rea- sons as followed in the literature, we do not train a dictionary from scratch rather we employ a pre-determined dictionary.Furthermore, h k 2 R Q is the equivalent sparse representation of the deep feature vector x k .For the sparse reconstruction, we let b h k denote an estimate of the sparse vector h k obtained via a sparse estimation algorithm [45], and let S k represent the set of active indices in the sparse vector, i.e., its support set.For better estimates, each feature vector is processed individually.The estimate of sparse vector The following explains how the sum, the posterior p S k jx k ð Þ , and the expectation E h k jx k ; S k ½ in (4) are evaluated.Given the support S k , (3) becomes where A S k is the matrix containing S k indexed columns of A. Likewise, h S k is formed by S k indexed entries of h k .Since the distribution of h k is unknown making E h k jx k ; S k ½ very difficult to compute, we use the best linear unbiased estimate (BLUE) as where : ð Þ H defines the Hermitian conjugate operator.Using Bayes rule, we can write the posterior as We can ignore p x k ð Þ in (7) as it's a common factor to all posterior probabilities.Since entries of h k are activated with Bernoulli distribution having p as success probability, then Þwould also be Gaussian which is easy to compute.On the contrary, evaluating p x k jS k ð Þ is difficult for unknown or non-Gaussian h S k distribution.To tackle this, it can be noted that x k is formed by a vector in the subspace that is spanned by A S k columns.By projecting x k on orthogonal complement space of A S k , the distribution can be computed.This is achieved using the projection matrix Dropping some exponential terms, and simplifying gives us In this way, we can evaluate the sum in ( 4) and the sparse representations are computed conveniently as depicted in Fig. 2.This is not only useful for accurate FER due to effective sparse representations, but is also computationally convenient by resizing deep features to fewer samples as expressed originally in (3), i.e., computing their sparse representations.For experimental convenience, we use a sparse feature vector with thirty samples for each frame.Contrary to the approaches employed by the existing methods reported in the literature, we resort to this sparse approach in order to capture the facial information better in the sparse domain.As mentioned previously, this does not only help in reducing the computational cost, since lower number of samples are processed, but it also allows the deep network to learn better from the sparse domain and, hence, recognize expressions effectively in a collaborative fashion.Note that our pre-determined dictionary contains wavelet basis which are suitable for the deep features as also validated by the results in the next section.Finally, we create a long short-term memory (LSTM) network with a sequence input layer, Bi-LSTM layer with 2000 hidden units, 50% dropout layer followed by fully connected (FC), Softmax and classification layer.For 3D FER, we use the provided apex/key frames, whereas for 4D FER, we use the entire sequences.Consequently, we use the extracted vectors of TOP-landmark and sparse features, all over multi-views, to first train the LSTM network.Afterwards, score-level fusion is performed to collaboratively recognize expressions.
It is worth mentioning that there exist end-to-end networks in the literature with lower performance.However, due to the scope of our paper, we do not propose an end-to-end framework and rather focus more on improving the ability of the network to perform well under generalized scenarios.We would also like to explain that we do not compute landmarks in our experiments, and the landmarks (24 facial points for Bosphorus dataset, while 83 facial points for BU-3DFE, BU-4DFE and BP4D-Spontaneous datasets) are rather provided within the datasets.Therefore, in this paper, we do not rely on computing the landmarks.However, there exist a number of methods and resources in the literature for efficiently computing both 2D landmarks1,2 [46] and 3D landmarks3 [47][48][49].

1-Augmentation
To overcome the limitation of unavailability of large-scale 3D/4D dataset (e.g., only 606 videos in the BU-4DFE [27] dataset), we propose a novel augmentation method as shown in Fig. 3.With our simple yet efficient method, the 3D/4D data can be theoretically augmented infinite times, hence the name, 1augmentation.For a given 4D dataset with N examples, we process each 3D point-cloud independently.Therefore, we define ; 8 t ¼ 1; 2; 3; . . .; T n f g and 8n ¼ 1; 2; 3; . . .; N where I 4D is the set of 4D face scan examples, and I 3D nt denotes nth example and tth temporal frame.Note that ( 11) ) jI 4D j ¼ N, and jI 3D nt j ¼ T n .Consequently for a given 3D point-cloud with M vertices, let us denote its corresponding mesh as where v j ¼ x j ; y j ; z j ð Þ represents the sharpening operator.Consequently, we get the three projected images as shown in Fig. 3.
Once the projected 2D images are computed, the next step is to separate R/G/B and gray channel(s) of these images and stack them together to form a channels train.Afterwards, randomly selected channels from the channels train are concatenated to generate an augmented image I G .Inspired by the RGB color model, we then propose to extract the luminance information I L from the selected channels using a weighted sum as where a i is the weight for each of the three R/G/B channels of the generated image I G .Apart from the standard weights used for extracting luminance in converting a texture RGB image to a gray image (a 1 ¼ 0:3; a 2 ¼ 0:59; a 3 ¼ 0:11,), the flexibility in a wide range of weights allows us to perform 1-augmentation conveniently.Additionally, the R/G/B color channels of the texture, depth and sharper depth images come along when these images are projected from the 3D mesh to 2D images.Importantly, by varying the order in which the selected channels are concatenated, and by including the weighted luminance information of an obtained augmented image as an input channel for the next rounds, we iteratively achieve 1-augmentation.Note that this process is repeated for each 3D point-cloud or video frame.This is worth mentioning again that our augmentation method concatenates the channels in a shuffled order.This means that the concatenated channels work as R/G/B to generate an augmented image.Another new image with extra information created by applying weighted luminance (a linear combination of the channels) to the generated image is then added to the pool of projected 2D images as shown in Fig. 3.This concatenation allows to trigger the inherent correlated information for a better learning experience in the deep network.Additionally, we have reported the results in the subsequent sections for augmenting up to 25Â samples, and we do not compute an optimal number of augmented samples.The optimal number is a trade-off between the recognition performance, computational complexity and how well the network can generalize, i.e., prone to overfitting.With the help of detailed experimental results, we also validate how our augmentation method is much better than using a regular data augmentation method in terms of providing new information to train a deep network.
In Fig. 4, we show augmented images generated using our proposed 1-augmentation method for each of the six universal expressions. 4The rows in this figure represent the rotation angles at which the expressions are generated, while the columns correspond to the facial expressions.For an expression, we show three different profiles (right, front and left) and twenty-five randomly selected samples to validate the effectiveness of our proposed augmentation method.We use a rotation angle of 20 degrees to extract left and right profiles.In this figure, the different colors help in visualizing how the underlying structure of the generated faces is different yet they exhibit similarities which fundamentally helps a deep learning network in identifying meaningful patterns.Additionally, apart from the standard weights, the flexibility in choosing weights, the rotation angles and the order of channel concatenation leverages our proposed 1-augmentation method and introduces new paradigms in the augmentation community.

Dataset and experimental settings
To validate the efficiency of our proposed method, we conduct experiments on the Bosphorus [25] and the BU-3DFE [26] datasets 4 For a detailed illustration of our proposed 1-augmentation method, we suggest the readers to watch the video on YouTube provided in the following link for a better understanding of how the augmented images/videos appear in real time: https:// youtu.be/pnmzjpGLkb0.for 3D FER.The BU-3DFE dataset contains 56 females and 44 males (in total of 100 subjects) having all six human facial expressions (anger, disgust, fear, happiness, sadness, and surprise).Following the experimental protocols for 3D FER in the previous works [16,33], the dataset is divided into two subsets: Subset I and Subset II.The first subset (Subset I), which is the standard dataset used for 3D FER, includes expressions with two higher levels of expression intensity.The second subset (Subset II), which is seldom applied for the 3D FER, contains all four levels of intensity except the 100 neutral samples.With the Bosphorus dataset, only 65 subjects perform the six prototypical expressions, and each person only has one sample for each expression.
For experiments on 4D FER, we use the BU-4DFE [27] and the BP4D-Spontaneous [28] datasets.The BU-4DFE dataset contains posed video clips of 58 females and 43 males (in total 101 subjects) having all six human facial expressions.Each clip has a frame rate of 25 frames per second (fps) lasting approximately 3 to 4 s.On the other hand, the BP4D-Spontaneous dataset contains video clips of 23 females and 18 males (in total of 41 subjects) showing spontaneous expressions with nervousness and pain as two additional expressions.Note that for 4D FER, we do not use key-frames [39] or employ any kind of sliding windows [36].In Table 1, we summarize the four datasets used for experiments on 3D/4D FER in our paper.
For the experimental settings, we a 10-fold subjectindependent cross-validation (10-CV).For extracting deep features, we use the pre-trained GoogLeNet [44].However, we also show our results on other pre-trained models.All of our experiments are carried out on a GP100GL GPU (Tesla P100-PCIE), and the approximate training time is as follows: 2 days for Bosphorus and BU-3DFE datasets each, 3 days for BU-4DFE dataset, and 4 days for BP4D-Spontaneous dataset.Additionally, unless otherwise stated, no augmentation and three views (right, front and left) are used for all the experiments.

Effect of 1-augmentation
We evaluate our proposed 1-augmentation method over different samples to prove its effectiveness towards deep networks.The accuracies of recognizing facial expressions in the BU-4DFE dataset are compared when different number of samples are generated while augmenting the data.For comparisons, we also conduct experiments by employing a traditionally used simple/regular augmentation solution [50].For augmenting the data with this regular solution, we specifically use horizontal and vertical shifting, horizontal and vertical flipping, random rotations and zoom angles.For our proposed 1-augmentation method, we used random weights for each of the three channels in weighted luminance since the choice of weights is very flexible from 0 to 1. Importantly, to show the effectiveness of our augmentation method, we only do recognition over multi-views, and skip using TOP-landmarks or sparse representations.
As shown in Table 2, the results achieved by our proposed method indicate an accuracy jump of 3.52% when augmentation with 5Â samples is used.This is related to the effective augmentation strategy where the underlying facial patterns are captured and restored to exhibit similar fundamental patterns that could help a deep network learn efficiently.In comparison, the performance gain by the regular augmentation method is only marginal, and is, therefore, ineffective for the task at hand.Similarly, in Fig. 5, we show the performance comparison of our augmentation method and a regular method in terms of bar charts to show the effect of increase in accuracy.As observed in this figure, our method demonstrates superior performance and, therefore, provides a desirable solution.We also observe that as we increase  the augmentation size, a trend with smaller increase in the accuracy is noticed.This is worth mentioning again that we use a 10fold subject-independent cross-validation for our experiments.This proves that our augmentation method is robust to overfitting and allows it to perform better in generalized experimental settings.Note that we have used a maximum of 25Â augmented samples for conducting our experiments.More augmented samples can be chosen at the cost of increased training time.

Importance of sparse representations and TOP-landmarks
To highlight the importance of sparse and TOPlandmarks, we show the confusion matrices of our experiments in Fig. 6.As shown in Fig. 6a, when we use the dense representations only, an accuracy of 93.70% is achieved.In comparison, the upshot of using sparse representations instead of dense representations is an increased accuracy of 94.50%, as shown in Fig. 6b.An intuitively added benefit of the sparsity-aware learning is also the smaller computational time needed to process fewer samples of the sparse representations.Importantly, as shown in Table 3, we also report quantitative comparisons of the computational cost.We conclude that the network with dense features takes approximately 3Â more training time.Therefore, our sparse features are not only convenient in better learning, but they also enable a faster learning.A similar trend can be seen while using the TOP-landmarks.For instance, as shown in Fig. 6c, a promising accuracy of 98.78% is achieved when a joint recognition is performed with the help of effective cues computed via TOP-landmarks.It can be analyzed from this figure that despite stronger similarities between angry, disgust and fear expressions, the results indicate that our method accurately predicts the expressions validating its effectiveness.Finally, as shown in Fig. 6d, a much higher accuracy of 99.69% is achieved when employing sparsity-aware deep learning and TOPlandmarks over multi-views.
We further demonstrate the effectiveness of our sparse representations on multiple datasets in the upcoming subsections, specifically in Table 4, Table 7 and Table 8.In Table 4, higher performances are achieved by all the tested pre-trained models when using sparse features.Table 7 shows that the sparse representations work better (than the dense ones) not only when used alone (accuracy improved from 93.70% to 94.50%), but also when used together with TOP-L (accuracy improved from 98.78% to 99.69%).The findings are further supported by results later in Fig. 7 and Table 8 (a).All these results show that the approach of using sparse representations leads to substantial performance improvements, and the advantage is consistent with various models across different datasets.

Comparison with other pre-trained models
In Table 4, we compare the results achieved by extracting deep features from other pre-trained models to validate the effectiveness of our method.Specifically, we use AlexNet [51], VGG16 [52], VGG19 [52], Inception-v3 [53], ResNet18 [54], ResNet50 [54] and ResNet101 [54].The table gives an overall impression that TOP-landmarks significantly improve the recognition accuracy.It also shows a similar trend, as reported previously, that sparse features lead to a better system performance.The superior performance of GoogLeNet is due to several very small convolutions in order to drastically reduce the number of parameters.

Comparison with the state-of-the-art 3D FER methods
To show the effectiveness of our proposed method, we perform extensive experiments to compare our method with several stateof-the-art methods [16,33,55,56,14,18] for performing 3D FER.From Table 5, we can see that our method outperforms the existing methods by a considerable margin in terms of recognition accuracy on the BU-3DFE Subset I dataset.This validates that by incorporating multi-views of a 3D face, more information can be given to the deep network for a better learning experience.More importantly, this table translates that the incorporation of TOP-landmarks and sparse features help the network to learn the expression better by finding significant correlations in important facial regions.
Similarly, the results on the BU-3DFE Subset II and Bosphorus datasets are presented in Table 6 for 3D FER.For Li et al. [16], the results are reproduced since these two datasets have been rarely used for 3D FER in the literature.The results from this table  indicate an analogous behavior to the results shown in Table 5.Note that although our solution outperforms the competing methods, a decrease in the recognition accuracy is observed against the BU-3DFE subset I.An explanation for this is the low-intensity expressions that are present in the Subset II which makes it more challenging to accurately learn the expressions.For the Bosphorus dataset, since it contains only limited amount of data samples for the training, the accuracy is thereby lower when compared against the accuracy achieved on the BU-3DFE Subset I dataset.

Comparison with the state-of-the-art 4D FER methods
Towards 4D FER, we also conduct several experiments to report the effectiveness of our method in detail.For this purpose, we resort to the highly used BU-4DFE dataset.In Table 7, we compare the accuracy of our method with several state-of-the-art methods [19,35,24,36,13,20,39,22,57,23,21] on the BU-4DFE dataset.As illustrated, our method outperforms the existing methods in terms of correct expression recognition.This is mainly due to the extensively collaborative scheme of our method for an accurate expression recognition where the prediction scores are refined from its collaborators.Importantly, we show the effect of using dense vs. sparse representations and also how the TOP-landmarks assist in achieving substantial improvement.As shown, the sparse representations not only intuitively reduce the computational burden but also lead to a more accurate system by less over-fitting and better generalization, hence, raising the accuracy from 93:70% to 98:78%.
Additionally, we report that by using TOP-landmarks only, an accuracy of 71:34% is achieved.However, this behavior is expected.The reason for this behavior is that TOP-L alone provide insufficient information for a significant performance gain since it only contains the landmarks information and massively miss the  important facial information stored in the texture and depth facial images.On the other hand, by using TOP-landmarks in collabora-tion with the rest of the framework, our method reaches a promising accuracy of 99:69%.Additionally, an accuracy of 98:78% achieved by Dense + TOP-Landmarks also indicate the effectiveness of TOP-landmarks, and shows that it provides significant amount of information from multi-views and is, therefore, a key component for the network.Lastly, we propose a solution that is not limited by the length of the input video frames and, therefore, we do not conduct experiments by using the key-frames only.

Towards spontaneous 4D FER
Due to its complex nature, the BP4D-Spontaneous dataset is often used in the literature to challenge the performance of FER systems.This is mainly because of the spontaneous facial expressions it presents.That is why several methods compete to validate their ability for better generalization.For a similar reason, in Table 8, we report our experimental results in the form of two sub-tables.The recognition results on this dataset are compared against the competing methods in Table 8(a).Although our Dense + TOP-L method offers a comparable recognition performance, the superior performance obtained with our Sparse + TOP-L method highlights its dominance.
More importantly, following the settings in [28,60], we evaluate the results of our cross-dataset experiments in Table 8(b).For this purpose, the BU-4DFE dataset is used for training while a subset of the BP4D-Spontaneous dataset (i.e., Task 1 and Task 8, consisting of happy and disgust faces) is used for testing.As shown in the accuracy column of Table 8(b), a similar conclusion can be drawn in favor of our method showing its effective performance.The results conclude that our method carry the potential to be generalized to spontaneous situations better than other methods.

Role of multi-views
The results depicted in Fig. 7 compare the contribution and effect of each of the multi-views.For the BU-4DFE dataset, the results shown in Fig. 7a dictate that our method performs better when all the views are taken into account.For example, when we use only the dense representations over multi-views in our model, a promising accuracy of 93.70% is still achieved which shows the effectiveness of the multi-views.A similar performance trend is shown for the BP4D-Spontaneous dataset in Fig. 7b.Note that the lower recognition accuracy on the BP4D-Spontaneous dataset, as compared to the BU-4DFE dataset, is due to its complicated nature and spontaneous expressions.Importantly, although the frontal view is more effective in comparison with left and right  [56] coords., normals, shape index 84.50Yang et al. [14] depth, normals, curv./scattering84.80 Li et al. [18] meshHOG/SIFT 86.32 Li et al. [16] depth, RGB, deep feature 86.86 Oyedotun et al. [33] depth, RGB, deep feature 89.31 Ours (Sparse + TOP-L) depth, RGB, deep feature 93.64  [20] 10-CV, Full sequence 75.82Xue et al. [57] 10-CV, Full sequence 78.80 Sun et al. [35] 10-CV, -83.70 Zhen et al. [19] 10-CV, Full sequence 87.06 Yao et al. [39] 10-CV, Key-frame 87.61 Fang et al. [13] 10-CV, -91.00 Li et al. [22] 10-CV, Full sequence 92.22 Ben Amor et al. [24] 10-CV, Full sequence 93.21Zhen et al. [23] 10-CV, Full sequence 94.18 Bejaoui et al. [41] 10-CV, Full sequence 94.20 Zhen et al. [23] 10-CV, Key-frame 95.  views, more effective results are achieved when a joint recognition is performed with the help of all these multi-views.More importantly, as discussed previously, a higher accuracy is achieved when both TOP-landmarks and sparse representations are utilized.

Conclusions
We proposed sparsity-aware deep learning to automate 3D/4D FER.First, we combated the problem of data limitation for deep learning by introducing 1-augmentation.This method uses the projected RGB and depth map images, and then proceed to randomly concatenate them in channels over an iterative process.Second, we explained the idea of TOP-landmarks to capture the encoded facial deformations stored in the 3D landmarks.TOPlandmarks store the facial features from three orthogonal planes by using a distance-based approach.Importantly, we presented our sparsity-aware deep network where the convolutional deep features are used to compute deep sparse features which are then used to train an LSTM network and recognize expressions collaboratively with TOP-landmarks.With a promising accuracy of 99:69% on the BU-4DFE dataset, our method outperformed the existing state-of-the-art 3D/4D FER solutions in terms of expression recognition accuracy.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.[60] 75.60 Zhen et al. [60] 81.70 Ours (Sparse + TOP-L) 82.86

Fig. 2 .
Fig. 2. Overview of the proposed sparsity-aware deep learning for 4D affect recognition.Given 3D/4D data, we first compute multi-views (left, front and right) of the 2D projected images in depth and texture domains.Our framework optionally offers to augment the incoming data for ensuring model's better training.The resultant data is then feed forward to first extract deep features, and then compute the sparse features, which are then learned alongside the extracted TOP-landmarks.A final collaborative decision is then performed to predict facial expressions.
face-centered Cartesian coordinates of the jth vertex.Let us define f T : I 3D !I T to compute the projected RGB texture images, and f D : I 3D !I D to compute the projected depth images from the given mesh via 3D to 2D rendering, where f T and f D represent the function mapping from 3D mesh to texture image I T 2 R K 2 , and depth image I D 2 R K 2 , respectively, where K 2 is the number of pixels.For an image with richer details of the facial deformations, we apply contrast-limited adaptive histogram equalization on the depth images to get sharper depth images as

Fig. 3 .
Fig. 3. Working flow of the proposed 1-augmentation method to generate new images.

Fig. 4 .
Fig.4.Augmented images generated using our proposed 1-augmentation method for each of the six facial expressions.The rows represent the rotation angles at which the expressions are generated, while the columns correspond to the facial expressions.For an expression, we show three different profiles (right, front and left) and twenty-five randomly selected samples to validate the effectiveness of our proposed augmentation method.The different colors help in visualizing how the underlying structure of the generated faces is different yet they exhibit similarities which fundamentally helps a deep learning network in identifying meaningful patterns.

Fig. 5 .
Fig. 5. Performance trend of the regular augmentation and the proposed 1-Augmentation method on the BU-4DFE dataset for 4D facial expression recognition (FER).

Table 1
Summary of the datasets used in our experiments for 3D/4D FER.

Table 2
Evaluation of the proposed 1-Augmentation method on the BU-4DFE dataset for 4D facial expression recognition (FER).The values in brackets with up-arrows refer to the increase in accuracy (%) due to augmentation.

Table 3
Comparison of computational costs on different datasets via a GP100GL GPU (Tesla P100-PCIE) machine to complete training.

Table 5
Performance (%) comparison of 3D facial expression recognition with the state-of-theart methods on the BU-3DFE Subset I dataset.

Table 6
Performance (%) comparison of 3D facial expression recognition with the state-of-theart methods on the BU-3DFE Subset II and Bosphorus datasets.

Table 7
Performance (%) comparison of 4D facial expression recognition with the state-of-theart methods on the BU-4DFE dataset.[TOP-L = TOP-landmarks]

Table 8
Performance (%) comparison of 4D facial expression recognition with the state-of-theart methods on the BP4D-Spontaneous dataset.[TOP-L = TOP-landmarks].