SeqFace: Learning discriminative features by using face sequences

Deep convolutional neural networks (CNNs) have greatly improved the Face Recognition (FR) performance in recent years. Almost all CNNs in FR are trained on the carefully labeled datasets containing plenty of identities. However, such high-quality datasets are very expensive to collect, which restricts many researchers to achieve state-of-the-art performance. In this paper, a framework, called SeqFace, for learning discriminative face features is proposed. Besides a traditional identity training dataset, the designed SeqFace can train CNNs by using an additional dataset which includes a large number of face sequences collected from videos. Moreover, the label smoothing regularization (LSR) and a new proposed discriminative sequence agent (DSA) loss are employed to enhance the discrimination power of deep face features via making full use of the sequence data. Only with a single ResNet model, the method achieves very competitive performance on several face recognition benchmarks, including LFW, YTF, CFP, AgeDB, and MegaFace. The code and model are publicly available at the website https://github.com/

usually combined together for learning face features.All these loss functions aim to maximize the inter-identity variations and minimize the intra-identity variations under a certain metric space.No matter which loss functions are applied, these methods share the same type of the training data, called identity data in the paper.
An identity dataset includes M faces of N identities, and each face in the dataset is clearly labeled as the image of the i th (0 ≤ i < N ) identity.Currently most public or private datasets for training deep face features, such as CASIA [11], MS-Celeb-1M [12] and CelebFaces [13], belong to identity datasets.However, a large-scale high-quality identity dataset is very expensive to construct, since it could cost lots of effort and money.Identity data need two kinds of information: face image and identity annotation.Identities in most public and private datasets are celebrities, because celebrity photos are rather easily crawled and annotated from the Internet.However, a celebrity dataset might be not a satisfied training dataset, if there are obvious differences between the evaluated faces and the celebrity faces in age, race, pose, and so on.Beside facial images collected from the Internet, videos (movies, TVs, surveillance videos, etc.) can also provide large quantities of face images, but few works utilize these face images so far because labelling identities is relatively difficult.Face detection and tracking on videos can automatically generate data with lots of face sequences, and each sequence contains several faces of one identity.We call this type of data as sequence data.
A sequence dataset includes M faces of N sequences, and each face is labeled as the image of the i th (0 ≤ i < N ) sequence.A large-scale high-quality sequence dataset can be efficiently and automatically constructed by using state-of-the-art face detection and tracking methods.Although face sequences are broadly used in video FR applications, previous works have rarely utilized these unlabelled sequence datasets as the training data to learn face features, because identity labels cannot be easily assigned.However, we know that faces in one sequence should belong to one identity, and it is possible to utilize this feature to reduce the intra-identity variations while training face feature CNN models.
In this paper, we propose SeqFace to learn discriminative face features on both identity data and sequence data (see Figure 1).SeqFace aims to make full use of sequence data in training.In SeqFace, a CNN model is jointly supervised by two loss functions.The first one is a chief softmax or angular margin softmax loss with label smoothing regularization (LSR).The second one is a discriminative agent (DSA) loss.These two losses can maximize the inter-identity variations and minimize the intra-identity variations simultaneously.With the help of sequence data, CNN models can be trained with high feature discrimination in Seq-Face.
To summarize, our major contributions are as follows: 1) We present SeqFace to learn discriminative face features.
Besides the traditional identity data, unlabeled sequence data are used as the training data to enhance the discriminative power of face features for the first time, since faces in a sequence should belong to one identity.
2) To make full use of sequence data, we employed the LSR to help the softmax-like loss to deal with sequence data.A new DSA loss function, which contributes to the intra-class compactness and the inter-class dispersion of the features, is also proposed to train CNNs.Experiments demonstrate that the LSR and the DSA losses both boost the FR performance greatly.3) We conduct experiments on some popular and challenging FR benchmark datasets with one single ResNet-64, and these experiments demonstrate that the proposed SeqFace obtains state-of-the-art performance on these benchmarks.Section 4 presents some experiments to demonstrate the performance of SeqFace.Finally, conclusion and discussion are provided in Section 5.
The paper is organized as follows.In Section 2, we give a brief introduction of previous works related to the proposed method.Section 3 provides a detailed description of the proposed SeqFace for learning face features on sequence data.
In Section 4, we analyze the performance of the SeqFace method on the public datasets.Eventually, Section 5 draws the conclusions.

RELATED WORKS
In this section, we briefly review works of deep face recognition, and face sequences related works in FR are also introduced.

Deep face recognition
Deep face recognition is one of the most active field, and has achieved a series of breakthroughs in recent years thanks to the great success of CNNs [14][15][16][17].Many methods [3,4,13,[18][19][20] have proven that CNNs outperform humans in FR on some benchmark data sets.
Face features are discriminative if their intra-class compactness and inter-class separability are well maximized.The loss function should be carefully designed to accomplish this goal.FR is firstly treated as a multi-class classification problem and CNN models are supervised by the classical softmax loss in many methods [3,4], but the softmax loss cannot fully guarantee the above goal in theory.Later, some metric learning loss functions, such as contrastive loss [10,11], triplet loss [9,18] are applied to boost FR performance greatly.However, well-designed sample mining strategies restrict the application of these losses.Recently, some angular margin softmax [5,6,8] and normalization [7,21,22] based methods are proposed and achieve outperforming performance, since they encourage larger inter-class and smaller intra-class variance at the same time, and are constructed on the classical softmax loss functions.Other loss functions [23,24] based on metric loss also demonstrate effective performance on FR.Moreover, some auxiliary loss [25,26] are employed to train models together with classification loss functions.Recently, some approaches focus on hard examples [27] or AutoML searching [28,29] to get better loss functions.Though these loss functions achieve better and better performance on FR, but they all depend on fulllabeled identity datasets, such as MS-Celeb-1M and Casia, and cannot treat face sequences as the training data to learn discriminative face features.

Sequences in face recognition
In many applications, sequences or image sets are the most natural form of input to the FR system.Video face recognition methods [30][31][32][33][34][35][36][37] based on face sequences, or face sets, also are expected to achieve better performance than ones based on individual images.Most of these studies attempt to utilize redundant information of face sequences/sets to improve recognition performance, but not to learn discriminative features from sequence data.Recently, some approaches [34,[36][37][38][39] aim to learn deep video features for video face recognition.In [37], large-scale unlabeled face sequences are employed as the training data, but these sequence data are only utilized to learn transformations between image and video domains.
[38] trained a self-supervised Siamese network to obtain the features of a face cluster instead of a single face.[39] also aims to cluster faces in videos more accurately.To conclude, learning discriminative face features still depends on traditional large-scale identity datasets in this deep CNN approaches of video FR.

SeqFace framework
The proposed SeqFace is a framework for learning discriminative face features on identity datasets and sequence datasets simultaneously.
In the identity dataset, faces of one identity are labeled as the same ID identity .In the sequence dataset, faces in one sequence are labeled as the same ID sequence .Two faces with the different ID identity must belong to different identities, but two faces with the different ID sequence might (or might not) belong to one identity.Considering these two datasets together, there are two circumstances: the identity overlap between these two datasets exists or does not exist.
SeqFace can deal with these two circumstances.However, in order to achieve better performance, we encourage to remove this identity overlap before training, because more constrains can be added in the loss functions as discussed later.Fortunately, removing the identity overlap is not a time-consuming task in many real scenarios.For example, it is almost certain that people in an Asian street surveillance video will not appear in the MS-Celeb-1M [12] dataset.
In SeqFace, a CNN model (ResNet-like models in our implementation) is jointly supervised by one chief classification loss and one auxiliary loss in SeqFace.The final loss can be formulated as where  is a parameter used to balance two loss functions.Similar with many methods, we also treat the FR problem as a classification task to train CNNs, and CNNs are mainly supervised by a chief classification loss, such as the Softmax loss, the A-Softmax loss in SphereFace [8], and so on.However, although all faces in identity data is labeled as belonging to one class (identity) in the classification loss, all input faces in sequence data cannot belong to any class (identity) in the classification loss.Traditional classification loss cannot deal with sequence data.In SeqFace, we employ the LSR in the chief classification loss to solve this problem as discussed in Section 3.2.
We know that faces in one sequence certainly belong to one identity.Therefore, if a loss encourages the intra-sequence feature compactness, and does not penalize the inter-sequence feature compactness, it could supervise CNNs to learn discriminative face features on sequence data, and it could naturally deal with identity data too.Because this loss mainly affects the intra-sequence and intra-identity compactness, it has to be an auxiliary loss.The center loss [25] is such a loss, but it only concerns the intra-identity and intra-sequence compactness.In order to make full use of sequence information, a DSA loss is presented as an auxiliary loss in the SeqFace as discussed in Section 3.3.

Label smoothing regularization
The softmax loss is applied to supervise CNNs classification, and its simplicity and probabilistic interpretation make the softmax loss widely adopted in FR issues.The softmax loss is the combination of a softmax function and a cross-entropy loss, and the cross-entropy loss is formulated as where C is the class number, p(i ) ∈ [0, 1] is the predicted probability (the output of the softmax function) of the input belonging to class i, and q(i ) is the ground truth distribution defined as where y is the ground truth class label of the input.
In FR problem, q(y) should be set to 1 for each face belonging to the y th identity in identity data.The softmax loss cannot be directly employed to deal with sequence data because of two sequences might belong to the same identity.We employ Label smoothing regularization (LSR) introduced in [40] to deal with sequence faces in Softmax loss.In the LSR, the value of q(i ) can be a float value between 0 and 1 for the input which cannot be clearly labeled as any class.
If the identity overlap does not exists, we define q(i ) = 1∕C as [41] for faces in sequence data, in order to keep ∑ C i=1 q(i ) = 1 (required in Softmax loss).Since C usually is a large number, 1∕C is close to 0. Furthermore, q(i ) = 1∕C means that the feature of a face in sequence data cannot be close to features of all identities in identity data too.Therefore, the cross-entropy loss is rewritten as where Z = 0 for the input face of identity data, and Z = 1 for the input face of sequence data.If we are not sure if there is identity overlap, all q(i ) are set to 0 for all input faces in sequence data, then  S is always to be 0, which is equivalent to all faces in sequence data do not participate in training at all.
The LSR can also be integrated into other softmax-like classification loss functions.In our implementation, a feature normalized SphereFace (L2-SphereFace for short in the paper, same with F-Norm SphereFace in [5]) is applied as the chief classification loss.An additional L 2 -constraint is added to the regular SphereFace [8], it means the input feature ⃗ x k must be firstly normalized and scaled by a scalar parameter  ( ⋅ ⃗ x k ∕‖⃗ x k ‖ 2 ).Therefore, the decision boundaries of the L2-SphereFace under binary classification is (cos m 1 − cos  2 ) = 0 for class 1, and is (cos  1 − cos m 2 ) = 0 for class 2. In our implementation, the parameter  and the margin m are set to 32.0 and 4, respectively.Experiments in Section 4.2 demonstrate the effectiveness of LSR.

DSA loss
In this section, we further propose a new auxiliary loss, namely discriminative sequence agent loss (DSA Loss), which concerns the intra-class compactness and the inter-class dispersion, and deals with sequence data simultaneously.
First, considering the traditional classification problem with an identity dataset, we define as the distance between the feature ⃗ x k of the k th training sample and the feature center ⃗ c n of the n th class(identity), d k,n is actually equivalent to the Euclidean distance.Note that if ⃗ x k and ⃗ c n are normalized, d k,n can be re-formulated as where  k,n denotes the angle between ⃗ x k and ⃗ c n , and d k,n can be regarded as the angular distance.Since our target is to reduce the distance between ⃗ x k and ⃗ c y k and enlarge other distances between ⃗ x k and ⃗ c n for all n ≠ y k , where y k is the label of the k th training sample, a discriminative loss can be formulated as where  ∈ [1, +∞) and  ∈ [0, +∞) are two parameters to adjust the discriminative power of the learned features.Therefore, the final loss function is where the parameter  is applied to balance the intra-class compactness and the inter-class dispersion, N is the number of identity (class) of the identity dataset.We introduce another parameter p as the probability that the n th center is employed in computing the final loss, because N might be a huge number and it will be time-consuming if all  k,n are computed in each iteration.b(1, p) means the Bernoulli distribution with the probability p.
The gradients of  D with respect to ⃗ x k and the update equation of ⃗ c n , similar with that in the center loss, are computed as: and where (condition) = 1 if the condition is satisfied, and (condition) = 0 if not.According to Equation ( 8), the feature ⃗ x k is pulled towards the feature center ⃗ c y k of its identity, and is pushed away from feature centers of other identities randomly selected in each training iteration.

DSA for sequence data
Taking into account sequence data, there is a slight modification in Equation ( 8) to compute the final DSA loss.We assume that there are C identities (we label them 0 to C − 1) in the identity dataset and H sequences (we label them C to C + H − 1) in the sequence dataset.The discriminative loss (Equation 7) is reformulated as where  indicates whether the distance between the n th and the y k th features should be enlarged.If the identity overlap exists between the identity and sequence datasets,  can be set to 1 only when the k th sample is an identity sample and the n < C .Otherwise,  is set to 1.Moreover, the final loss function (Equation 8) is rewritten as where  and  are also Boolean values.If the k th sample is selected from the identity dataset,  and  are set to 1.If the k th sample is selected from the sequence dataset, only  is set to 1.That is to say, if the k th sample is in the identity dataset, ⃗ x k should be pushed away from feature centers of other identities and all sequences, or ⃗ x k is only pushed away from feature centers of identities.Figure 2

illustrates two examples.
There are four parameters (, , , and p) in the DSA loss function.The parameter  can be set to 0.5 since we concern both the intra-class compactness and the inter-class dispersion.The parameters  and  are used to adjust the discriminative power of features.Using larger values is preferred, but it will increase the difficulty of convergence in training.According to our experiments,  = 2.0 and  = 1.0 can be applied in most applications.The parameter p is applied to select part of identities/sequences while computing  k,n , in order to reduce the computing cost.The value of the parameter p can be set flexibly based on computing resources in real applications.

MNIST example
We perform a toy example on the MNIST dataset [42] with our DSA loss.LeNet++ [25], a deeper and wider version of LeNet, is employed.The last hidden layer output of the model is restricted to 2-dimensions for easy visualization (see Figure 3).For comparison, we train 4 models supervised by a softmax loss, a softmax loss and a center loss, a softmax loss and a DSA loss, a softmax loss and a DSA loss (with normalized ⃗ x k and ⃗ c n ), respectively.We set  = 0.5,  = 2.0,  = 1.0 and p = 1.0 in the DSA loss.The loss weight values of the center/DSA loss are set to 0.04.All models are trained with the batch size of 32.The learning rate begins with 0.01, and is divided by 10 at 14K iterations.The training process is finished at 20K iterations.As shown in Figure 3, the features learned with the DSA loss are more discriminative.The feature dispersion in Figure 3(b) and Figure 3(c) demonstrates that the DSA loss can enlarge interclass distances, and the feature centers of different classes are pushed away from each other.Table 1 lists the classification accuracies of 4 models on MNIST test set.From the results, we can get the following observations: (1) the center loss and the DSA loss both improve the classification performance; (2) as an auxiliary loss, the proposed DSA loss outperforms the center loss.

Implementation details
In our experiments, all the face images and their landmarks are detected by MTCNN [43].The faces are aligned by similar transformation as [44], and are cropped to 144 × 144 RGB images (randomly cropped to 128 × 128 in training).Each pixel in RGB images is normalized by subtracting 127.5 then divided by 128.

Training and testing
Caffe [45] is used to implement CNN models.Different CNN models are employed in the experiments, which will be further introduced.All weights of the auxiliary losses ( in Equation 1) are set to 0.04 in the experiments.Euclidean distances (do not normalize ⃗ x k and ⃗ c n in Equation 5) are applied in the DSA loss functions used in these section.At the testing stage, only features of the original image are directly extracted from the last full connected layer of CNNs, and the cosine similarity is used to measure the feature distance in the experiments.More details are presented in the corresponding sections.The code and model are publicly available at the website1 .All verification accuracies of SeqFace are calculated by averaging the results of 5 trained models.

Exploration experiment
In this section, the employed CNN is a ResNet-20 network which is similar to [8], and it is trained on the publicly available CASIA-WebFace dataset [11]   To demonstrate the effectiveness of LSR and DSA loss on the dataset with part anotations, we train 7 models (see Table 2) for comparison.
First, we use a regular softmax loss (Model I), a L2-SphereFace loss (Model II) to train 2 CNN models respectively.Only the dataset A is used as the training dataset.Verification accuracies demonstrate that the L2-SphereFace greatly boosts the performance.
Then, the center loss and our DSA loss are applied as the auxiliary loss to jointly supervise the CNN models (Model III and Model IV, respectively) with a L2-Sphere loss.The reported results demonstrate that two auxiliary loss functions have positive effect on the FR performance, and our DSA loss outperforms the center loss.
Moreover, sequence data (the dataset B) are added to train the CNN model supervised by a LSR-based L2-SphereFace (Model V).According to results, we can conclude that LSR also plays a positive role in training.LSR and DSA loss then work together to train a CNN model (Model VI), and we can have following observation: our SeqFace greatly enhances the discriminative power of learned features.
Last, we also train a model (Model VII) on total CASIA-WebFace with a L2-SphereFace loss.Comparing accuracies between the Model V and VI, we can conclude that complete identity annotation is naturally preferred in training datasets, but the little gap shows that competitive performance also can be achieved by making full use of sequence information.
A ResNet-27 model2 (the architecture is shown in Figure 4) and a ResNet-64 [8] are employed for evaluation.To accelerate the training process, we first train a baseline model under the supervision of the regular L2-SphereFace on the identity dataset only, and then fine-tune the baseline model by using the SeqFace.Our models are trained with batch size of 128 on 4 Titanx GPU.The learning rate begins with 0.01, and is divided by 10 at 300K and 600K iterations.The training is finished at 800K iterations.The models are jointly supervised by a LSR-L2-SphereFace loss and a DSA loss, and are learned on the MS-Celeb-1M and our Celeb-Seq datasets described below.In the DSA loss,  = 0.5,  = 2.0,  = 1.0.The parameter p is set to 0.001 because of the large number of sequences in the Celeb-Seq dataset.

Training datasets
A refined MS-Celeb-1M (4M images and 79K identities) provided by [44] is used as the identity dataset.Since there is no public sequence datasets for training deep CNNs, we first construct a sequence dataset Celeb-Seq-Overlap, which includes about 2.5M face images of 550K face sequences.We firstly extract about 800K face sequences by using MTCNN [43] and Kalman-Consensus Filter (KCF) [51] to detect and track video faces from 32 online TV Channels, then compute image features with the model provided by SphereFace [8].Noisy faces, and nearly duplicate faces in one sequence are discarded from the dataset automatically.In Celeb-Seq-Overlap, some celebrities can be found in MS-Celeb-1M.Faces of overlap identities with MS-Celeb-1M are then removed to generate another dataset Celeb-Seq.We also remove face images belong to identities that appear in the LFW and YTF test sets.Some face sequences in the Celeb-Seq dataset are shown in Figure 5.

Evaluation on LFW and YTF
LFW [46] and YTF [47] are challenging testing benchmarks released for face verification.LFW dataset contains 13,233 faces of 5749 different identities, with large variations in pose, expression and illuminations.YTF dataset includes 3425 videos of 1595 identities.We follow the unrestricted with labeled outside data protocol.To evaluate performance on YTF, the simple average feature of all faces in a video is applied to compute the final score.Table 3 reports the verification performance of several methods and some commercial systems.To demonstrate effectiveness of the SeqFace, the performance of our baseline ResNet-27 is also reported in the table.The SeqFace achieves the highest accuracies on these two benchmarks, even compared with other commercial systems.Note that the ArcFace employs the improved ResNets [52].It is reported in the ArcFace that a regular 50-layer ResNet achieves a 99.71% accuracy on LFW.Moreover, our ResNet-27 and ResNet-64 models achieve 99.50% and  99.67% at VR@FAR=0 on LFW.The models will be publicly accessed online in near future.Figure 6 gives an overview of all failure cases of our ResNet-64 model.The results also prove that identity overlap between sequence data and identity data should be removed to achieve better performance.The SeqFace is only a framework to make use of sequence data.We believe that new loss functions and modern networks (such as a deeper ResNet with improved residual units [52]) can be employed in the SeqFace to further improve the performance.

Evaluation on CFP and AgeDB
There are 500 identities, each with 10 frontal and 4 profile images, in CFP dataset [48].In this paper, we use the most challenging subset CFP-FP, which includes

Evaluation on megaface
MegaFace [50] is a challenging testing benchmark to evaluate the performance of FR methods at the million scale.It includes a probe set and a gallery set.The gallery set consists of more than 1 million images from 690K different identities.The probe sets consists of two datasets: Facescrub and FGNet.We evaluate the model on one of the three gallery set (set 1) with the provided code3 for both face identification and verification protocols.Table 5 shows that our models achieve competitive performance on MegaFace benchmark.For fair comparison with other methods, the features of FaceScrub are extracted directly from the faces with land marks provided by MegaFace.Moreover, since removing noisy images in the MegaFace can dramatically improve the identification performance (from 82% to 98% in [5]), we report the performance of our models on the original and cleaned versions of MegaFace dataset.The results in Table 5 demonstrate that the SeqFace is a competitive method.

CONCLUSION
A large-scale high-quality dataset for training CNNs in FR is very expensive to construct.Face features learned on publicly available datasets for researchers might not achieve satisfied performance in some circumstances, for example, evaluating Asian people in surveillance videos.Though large amount of face images in the real situation can be collected, assigning labels to these images is still time-consuming.Fortunately, a dataset containing large amount of face sequences can be efficiently constructed by using face detection and tracking methods.
In this paper, we proposed a framework named SeqFace, which can utilize identity and sequence data together to learn highly discriminative face features.A chief classification loss and another auxiliary loss are combined to learn features of these two types of datasets.The LSR is employed to help the chief loss to deal with sequence input.The DSA loss was also proposed to supervise CNNs as an auxiliary loss.We achieved good results on several popular face benchmarks only with a simple ResNet model.We also believe that higher performance can be obtained, if more advanced loss functions ( [5,6]) and CNN architectures [52] are employed.
Compared with other face feature learning approaches, besides the traditional identity data, SeqFace can additionally employ face sequences to learn discriminative face features.As far as we know, SeqFace is the first framework to employ face sequences as training data to learn face features.Although identity-overlap between identity and sequence data is prohibited in SeqFace, avoiding such overlap is not a time-consuming task in many real scenarios as discussed in Section 4. Solving this limitation is also our future work.Moreover, it is obvious that SeqFace also has great potential to be applied in other similar fields, such as Person-reidentification.

FIGURE 1
FIGURE 1 In our SeqFace framework, the CNN model is trained on an identity dataset and a sequence dataset, and is supervised jointly by a chief LSR classification loss and another auxiliary DSA loss.Different sequences can belong to the same identity in sequence data

FIGURE 2
FIGURE 2 Illustration of forces on sample features of identity data and sequence data.The i th sample is from the identity dataset, and the j th one is from the sequence dataset.⃗ c 1 , ⃗ c 2 and ⃗ c 3 are feature centers of corresponding identities, and ⃗ c 4 and ⃗ c 5 are feature centers of corresponding sequences.y i = 1 and y j = 5

FIGURE 3
FIGURE 3 Visualization of 2D feature distribution for the MNIST test set.The features of samples from different classes are denoted by the points with different colors.Four CNNs are supervised by the loss functions of (a) Softmax loss.(b) Softmax loss + Center loss.(c) Softmax loss + DSA loss with Euclidean distance.(d) Softmax loss + DSA loss with angular distance

FIGURE 4 FIGURE 5 3
FIGURE 4 The ResNet-27 architecture for the experiments.The CNN is jointly supervised by the LSR-L2-SphereFace and the DSA loss.ID denotes the input identity data, SEQ denotes the input sequence data, C denotes the convolution layer, P denotes the max-pooling layer, and FC denotes the fully connected layer

TABLE 1
Accuracy on MNIST test set WebFace to generate one in order to facilitate to reproduction and comparison.To evaluate the effectiveness of sequence data, 10,575 identities in the CASIA-WebFace dataset are randomly divided into two parts:

TABLE 2
Face verification accuracy on LFW dataset Faces in the dataset B is then randomly split into 32,996 sequences.The dataset A and B are treated as the identity dataset and the sequence dataset respectively.It is clear that the CASIA-WebFace dataset contains full annotations and the dataset A+B only contains part annotations of 10,575 identities.