A GE E STIMATION U SING S PECIFIC D OMAIN T RANSFER L EARNING

Nowadays, the engagement of deep neural networks in computer vision increases the ability to achieve higher accuracy in many learning tasks, such as face recognition and detection. However, the automatic estimation of human age is still considered as the most challenging facial task that demands extra efforts to obtain an accepted accuracy for real application. In this paper, we attempt to obtain a satisfied model that overcomes the overfitting problem, by fine-tuning CNN model which was pre-trained on face recognition task to estimate the real age. To make the model more robust, we evaluated the model for real age estimation on two types of datasets: on the constrained FG_NET dataset, we achieved 3.446 of MAE, while on the unconstrained UTKFace dataset, we achieved 4.867 of MAE. The experimental results of our approach outperform other state-of-the-art age estimation models on the benchmark datasets. We also fine-tuned the model for age group classification task on Adience dataset and our model achieved an accuracy of 61.4%.


INTRODUCTION
To characterize the human identity, different attributes can be derived from the facial image. Age is a crucial trait that can support other significant properties, such as fingerprint and iris, to get a more realistic system for the task of identification and verification of human identity [1]. Age estimation is related to the automatic process of predicting the real age as an exact age of a person or classifying the image into age groups represented by age range [2]. Therefore, estimating the real age of a person is much harder than just classifying a person to which age's category he belongs. Age estimation problem is considered as the most challenging facial task that is affected by various internal factors, such as gender, race and external factors, such as environment conditions and facial expressions. Recently, a growing interest is witnessed to automate age estimation systems and great efforts are made to enhance this challenging task.
Deep learning is one of the new technologies having been increasingly used in the field of computer vision. There is no doubt how deep learning technology outperforms the traditional algorithms of machine learning. Nonlinear features can be automatically extracted using Convolutional Neural Networks (CNNs) [3]. The power of CNNs is related to their capability of hierarchical learning for concepts across several layers. Despite the success of deep learning in many tasks, the accuracy of age estimation systems is still practically unacceptable.
Transfer learning [4] offers many benefits for deep learning-based models. It increases the training speed on new data. Furthermore, it requires less amount of data to train the model compared with training from scratch. Also, it improves the performance of the network. Generally, there are two types of transfer learning: general-domain pre-trained models where knowledge is transferred from a general task to a target unrelated task, whereas models learned on a specific domain are pre-trained on a related or similar task to the target task.
Face recognition and age estimation are different tasks. However, we can argue that they are correlated. While the recognition learning process was made for the VGGFace of the facial features and landmarks across a large number of images related to the same person in different conditions, this process can be considered as an initial step for an age estimation system rather than using a random initialization of weights for the network. To gain the advantage of reusing specific domain pre-trained models, our model is considering the VGGFace [5] as the CNNs architecture. The aim of choosing this network is related to many reasons: Firstly, the state-of-art results that have been achieved by this model in face recognition task. Secondly, VGGFace has been pre-trained on a large database that contains 2.6 M images of 2.6 K people and this can overcome the overfitting problem when training the same model on small datasets on related task of face recognition. Thirdly, using a pre-trained model makes the training process faster while increasing the overall performance. Finally, age can be considered as a facial trait, thus the learning process will be easier than using general domain transfer learning.
In this paper, we prove that selecting a good online data augmentation helps improve the performance. Using a pre-trained model on a large dataset for face recognition task has a good impact on extracting related features to age and can prevent the overfitting problem. Moreover, combining the Global Average Pooling (GAP) with Fully Connected (FC) layers to generate the classifier increases its ability to perform the age estimation task. Finally, treating the age as a multi-classification problem is good when using a balanced distribution over the dataset's classes.
The remainder of this paper is divided into five sections as follows: Section 2 gives a brief overview of the related work on age estimation. The methodology with the baseline network is outlined in the third section. Section 4 displays the different experiments that have been conducted to achieve the best performance with related results. A discussion of the results is addressed in the fifth section. Our conclusions are drawn in the final section.

RELATED WORK
To estimate real age, two main steps should be followed. Firstly, the image representation should be extracted, then the feature vector is formalized. Secondly, an encoding algorithm is used to estimate the age [6]. Traditional methods carry out the process of age estimation by extracting the related features to age then formalizing the feature vector. Before representing the different facial details, the faces should be detected and aligned. Then, for the stage of extraction of the local and global features, a descriptor such as Active Appearance Model (AAM) [7], Biologically Inspired Feature (BIF) [8] or Local Binary Patterns (LBPs) [9] should be carefully chosen. The dimension of the resulted feature vector could be very high. Therefore, a reduction algorithm such as Principle Components Analysis (PCA) is used to reduce the dimension. The traditional techniques are classified into six groups according to [10]: anthropometric-based [11][12] [13] that are based on the face geometry. To measure the geometric ratios from images correctly, faces should be in frontal view, since the computations of ratios from 2D images are sensitive. The second method is texture-based model [8]- [9]. Texture information can be directly calculated from images using pixel intensities. To extract these features, many effective descriptors such as BIF have been utilized in several works [14][15] [16]. AAM [7] is another effective algorithm that combines shape and texture. The models are learned through the training process using a number of images. After that a parametric model for faces is generated by PCA. Another model is aging pattern subspace [17]- [18] which defines the aging pattern as a sequence of personal facial images belonging to the same person and finally the aging manifold [19]- [20], where the aging pattern can be commonly learned as a trend for several subjects at different ages. The features from different descriptors can be also fused to obtain a more robust system [21]- [22].
Since the development of deep learning technology, many researchers turned into replacing the traditional techniques with this technology. CNNs [3] have been widely used in computer vision field due to their effective learning ability for nonlinear features. Recent works on age estimation using deep learning [23][24] [25] have only focused on developing their approaches using models that were pretrained on ImageNet [26] dataset for object classification task. Other approaches [27][28] [29] found that using a model pre-trained for a specific-domain related to age prediction task such as face recognition can achieve a better performance on age estimation. Yang et al. [30] trained their CNN model from scratch to derive different information from human faces, such as age, gender and race. Their results were lowest compared to previous works on age estimation which were based on traditional techniques for feature extraction. Instead of extracting the features from the top layer, Wang et al. [18] obtained the features from different layers. They enhanced the system by adopting the manifold algorithm with the basic model. Liu et al. [28] built a multi-path model and combined a VGGFace network with two shallow VGG16 [31] networks. The feature vector of the three networks was normalized and fed to the age estimator. Hu et al. [32] learned the age from two pairs of images which belong to the same person. They found the age difference by using the Kullback-Leibler divergence as loss function. Rodríguez et al. [33] tried to place more attention on VGGFace [5] that is pre-trained on face recognition task. The model consists of two CNNs; the patch network which receives the low-resolution image and the attention CNN which is fed with the high-resolution image to place more attention on the important regions on the face.
Estimating real age can be implemented as a multi-classification task [34] [35]. In this case, ages are represented as separated labels. Therefore, the classifier will predict the class of the age after the learning process using one of the available benchmarks annotated with age labels. On the other hand, we can use the regression algorithm [36]- [37] because of the nature of the human ages, as they are continuous values. A new algorithm which shows good results in age estimation is called ranking algorithm [38] [39] [40]. Instead of considering the age as a multi-class task, ranking algorithm convers the problem into different binary classification tasks and the resulted age ranks are the aggregation of the outputs from different classifiers. Rothe et al. [27] pre-trained the VGG16 on a large unconstrained dataset: IMDB_WIKI [27] for real age estimation task. They considered the age as a multi-class problem with linear regression activation.
Antipov et al. [29] used VGG16 and pre-trained it on face recognition task. They used a soft classification to encode age. They found that pretraining on face recognition task is more suitable for age and gender classification than general task pre-training, while the strategy of multi-task pretraining is useful in case of training the model from scratch. To overcome the problem of sample imbalance, Li et al. [25] used AlexNet for feature extraction with a cumulative hidden layer. The main advantage of using a cumulative hidden layer is to learn the ages from faces with neighboring ages. Shang and Ai [41] separated the related features of aging into different groups by using clustering algorithm k-means++. They trained the network again for each group to estimate the final age for each subject. Later, Zhang et al. [42] used the DEX method [27] to estimate the real age. They improved the age estimation system by extracting the fine-grained features using the attention mechanism. A new loss function was proposed by [43] based on finding the age distribution based on the mean and variance of the groundtruth age.
Deep network can be utilized as a feature extractor. Duan et al. [15] extracted the features using CNNs. They combined classification and regression by firstly classifying the images according to age groups using Extreme Machine Learning (ELM) and then regressing the final value of the age using ELM regressor. Their model was a combination between classification and regression. Chang and Chen [38] used the scattering transform to extract the Gabor coefficients. They treated the age labels as a ranking algorithm, where the aggregation of the results from a series of binary classifiers performs the age ranks. Chen et al. [39] trained a set of CNNs on ordinal age labels. Different outputs were obtained and aggregated to predict the final age. Recently, Li et al. [44] extracted the features using CNNs and fed the BridgeNet which consists of local regressors and gated networks. The aim of using the gated networks was to weigh the regression results. Thus, the final age was calculated by taking the summation of all weights resulted from the local regressors.
The literature shows that using a pre-trained model on general task, such as on ImageNet, needs more efforts and deep networks to achieve reasonable results for age estimation task. A more powerful model is to pre-train the network on a specific task, such as face recognition or gender classification. Inspired from this consequence, we use a pre-trained model on face recognition that can increase the system ability to extract the related features of age.

METHODOLOGY
In deep learning, transfer learning technique can be defined as the process of reusing a model that has already been trained for a specific task to perform a similar or related task [4]. The aim of using a pretrained model is to take the advantage from the features that have been extracted in the front layers instead of developing the model from scratch. Moreover, the computation time for training can be reduced while using a pre-trained model. Different policies can be followed while reusing pre-trained models: 1. Some late layers can be set as a trainable layer, so that their weights will be fine-tuned for the new task. This policy can be used when a dataset is available with plenty labels. 2. Freezing the convolutional base and adding classification layers in case that a small dataset is available which is similar to the source dataset that has been used in the pre-training stage.
3. Training the entire model or training from scratch which needs extra computational time and power with a very large dataset.
To build the classification layers, we can use one of the following approaches: 1. Adding an FC layer or a set of stacked FC layers followed by a Softmax activated layer for classification task or a linear activated layer for a regression task.
2. Adding GAP as proposed by Lin et al. [45] and connecting this layer directly to the output layer.
The main concept of GAP is that it reduces the dimension of each tensor by taking the average of each feature map. For example, if we have a tensor with dimension (hxwxd), GAP reduces the dimension from (hxwxd) to (1x1xd) by taking the average of each feature map (hxw). Moreover, this layer has a similar affect as the FC layer except that it can avoid overfitting, since there are no parameters to optimize.
In this study, different approaches were investigated to reuse the base model to build the classifier by adapting deeper and wider schemes. The model was also examined in terms of the power of concatenating the GAP with FC layers to leverage performance. Classification and regression were both implemented to estimate real age, while for the age group task, classification was implemented.

Network Architecture
VGGFace-based VGG16 consists of eight convolutional layers and the classifier layers. On the classification block, there are two FC layers, where each layer has a 4096-dimensional output. An activation layer that has the rectification operator, such as Rectified Linear Activation Unit (ReLU), is added between these FC layers. After each block, a max pooling layer is added to down sample the feature map. The last layer represents the output layer with 2622 classes reflecting the number of subjects in the database. The activation in the output layer is chosen to be Softmax function for multiclass classification problem. The base model with the top connected layers pre-trained on face recognition task is shown in Table 1.

Adding Batch Normalization between FC Layers
A good technique that approved its efficiency in avoiding overfitting problem is using regularization. Inserting Batch Normalization (BN) [46] between the convolution layers will regularize and make the model more stable. BN layer takes the output of the preceding activation layer and normalizes it by subtracting the mini-batch mean and dividing by the mini-batch standard deviation. In other words, the normalizing transform aims to repair the means and the variances of layer inputs by adding two trainable parameters at each layer. Moreover, it reduces the network dependency on the initialization of each layer, which allows to use a higher learning rate.

Improving the Model with Online Data Augmentation
One of the most important techniques to enhance the performance and robustness of deep learning models is to train the neural network with a large amount of data. Unfortunately, most of the imagebased applications have limited datasets or the conditions do not reflect the real-world scenarios under which images have been taken. Age estimation task is considered as a challenging computer vision task that is affected by many internal and external factors [28], [47]. There are no general patterns of aging for all humans. A more realistic age estimation system should be able to learn more irrelevant patterns of ages in different conditions.
Data augmentation [48] is a regularization technique used to feed the neural network with more synthetic images to reflect more realistic conditions and perspectives. By applying data augmentation, the network can avoid the overfitting problem that is caused by using small datasets. Different conditions can be applied to make additional modified images, such as translation, rotation, scaling, brightness, …etc. In this work, we select online augmentation to be applied to the images during the training process as the batches are passed. Thus, the training process will be quicker and there is no need to load the original data with the augmented data to the memory as in offline augmentation [48]. We diagnosed different transformation functions of online data augmentation to select the most appropriate one that enhances the performance of the model.

Fine-tuning VGGFace Model for Age Estimation
A single network of VGGFace is used and fine-tuned for real-age estimation. We propose two approaches to reuse the basic model: Approach-1: In this approach, we keep the base convolutional layers with the top classification layers and remove the last Softmax activation layer. Then, the resulted feature map from FC8 layer is connected to extra FC layers. We freeze all the layers except the new additional one. The model is examined when adding different numbers of FC layers that have different numbers of neurons. The last layer is the output layer which is connected to the model as a dense layer. This approach is shown in Figure 1.

Approach-2:
In this approach, we want to combine the GAP layer and the FC layers. We keep the base convolution layers and remove the top classification layers, then connect the last max pooling layer with GAP layer. The output of GAP is then fed to the FC layers. The model is examined when adding different numbers of FC layers that have different numbers of neurons. The last layer is the output layer which is connected to the model as a dense layer. Figure 2 shows the proposed structure of approach-2 to finetune VGGFace network on age estimation.

Classification vs. Regression
Age can be estimated using one of the age encoding algorithms. If the age is considered as a multiclassification problem, then each age will represent a single class. However, we can find a correlation  and continuous behavior across different ages. In this case, ages can be considered as a regression problem. In this work, we investigated the classification and regression algorithms to encode age.By considering age as a multi-class problem, we added an age classifier on the top of the network as an output. We used the one-hot encoding method to represent the age labels. Thus, each sample will have a probability value of 1.0 for the correct class and 0.0 for other class values. To predict this probability, Softmax was applied as the activation function. Softmax function [49] can be defined as in Equation 1: where P is the probability value that is assigned to the correct label given the image and parameterized by W. Furthermore, to monitor the training process, categorial-cross entropy was selected as the loss function.
where X is the explanatory variable and Y is the dependent variable. The slope of the line is b and a is the intercept (the value of y when x = 0).

EXPERIMENTS AND RESULTS
We practically examined the efficiency of our proposed approaches to fine-tune a pre-trained network of VGGFace for real-age estimation. We conducted many experiments to achieve the minimum value of error. The process of learning was further enhanced using online data augmentation, such as rotation, shearing and flipping. After trying many functions for image transformation, we selected the proper augmentation functions and considered them for all experiments. Thus, each image on the training set will be flipped horizontally, sheared randomly to 0.5 and rotated 45 degrees.

Evaluation Metrics
The most common metric for real-age estimation is the MAE. This metric can be more representative than classification accuracy to evaluate the age estimation system, since it shows the difference between the estimated age and the real age. This metric can be mathematically defined as in Equation 3: The estimated age is represented by ̂, where represents the real age and is the number of testing samples. The evaluation between the different models is based on obtaining a minimum value of MAE, which implies a good performance [2].
For the task of age group classification, accuracy is used as the evaluation metric. We calculated accuracy by using Equation 4: where is the number of correctly classified images to a specific group x and represents the number of testing samples of group x [17].

Benchmark Datasets
FG_NET Dataset [50] has 1002 images which are belonging to 82 subjects. There is a grayscale resolution for all images in addition to the colored version. The dataset covers a range of ages from 0 to 69 years, but most of the available images are in the range less than 40 years. On average, there are 12 images per person. The human gender and race annotations are also provided in this dataset. FG_NET is considered as a constrained dataset, where all the images have frontal head poses captured on restricted conditions. Moreover, for the task of face modelling, the database provides 68 landmark points. Because of the problem of highly biased classes in this dataset, we considered a different protocol. Different recent age estimation models used the Leave One Person Out (LOPO), which depends on testing the model using the different images that belong to one person each time and taking the average obtained from all 82 subjects. Thus, to make the testing process more realistic, we considered another splitting protocol. The dataset is divided into 80 % as training set, 10% as validation set and 10% for testing. Figure 3 shows some samples of the dataset. UTKFace Dataset [51] has over 20,000 face images labelled with age, gender and ethnicity. The images cover a large span of ages from 0 to 116 years. Huge variations in head poses, illumination and occlusion are contained in the images. It has images captured in unconstrained conditions with correct real-age annotations. Figure 4 shows some images from the dataset. The dataset has been used to evaluate the model on an unconstrained dataset for real-age estimation using validation splitting of 80% for training, 10% for validation and 10% for testing. Adience Dataset [22] [52]. It was collected from Fliker.com albums. It is used to make age group and gender classification. Some images from the dataset are shown in Figure 5. Adience dataset consists of 26,580 images that belong to 2,284 subjects. The set contains 8 age groups from 0 to 60 years and older. Big challenges are embodied in this database, while the images have low resolution with extreme blurring, occlusion, different head poses and expressions. Five models are trained for each fold. As the Adience dataset is already divided into five folders, four folders have been used for training and validation while folder0 has been used as a hold-out testing set. Then, accuracy is tested by the average or mean of all accuracies obtained from the five folds. The faces have been detected and the images were cropped to remove the annoying background.

Data Pre-processing
For the Adience dataset, we detected the faces from the images and cropped them to remove the noisy background, as shown in Figure 6.

Original image
After face detection and cropping Figure 6. Pre-processing step for images in Adience dataset.
For the UTKFace dataset, we used the aligned and cropped version of the dataset offered from the dataset's website. For the FG_NET dataset, we used the images without making any processing. For all datasets, we rescaled the images to 224x224 resolution to be compatible with VGGFace input size.

Hardware and Software Tools
The system and its stages were executed using a High-Performance Computer (HPC) with NVidia Tesla GPU. Our model is implemented in Python using Keras with TensorFlow backend.

Experimental Results
This subsection summarizes the results that were obtained from the experiments for both encoding algorithms: classification and regression using the earlier proposed approaches. Section 4.5.1 shows the results on the constrained FG_NET dataset. In section 4.5.2, we show the results on the unconstrained UTKFace dataset using the proposed approaches. One of the main concerns that has a large effect on performance while reusing the pre-trained model is to optimize the appropriate topology of the classifier with strong generalization of the task. In our experiments, we studied different schemes to connect the pre-trained model with the output layers with changeable number of hidden layers (depth) and number of neurons (width) to construct the most proper structure that can design the age estimation task. We analyzed the proposed approaches using the following different schemes:  Scheme-1: We fine-tuned the base model without adding any FC layers. For Approach-1, we just connected FC8 to the output layer. For Approach-2, we just connected the GAP to the output layer.
 Scheme-2,3,4: Varying width, single layer with varying feature size. We started to expand the model by adding one layer with different neuron sizes, such as 1024, 5000 and 6000.  Scheme-5,6,7,8: Varying depth, instead of using one layer, adding a set of two layers. Within this scheme, we examined many combinations. We combined BN with FC layers. After each BN layer, we added ReLU activation layer connected to the FC layer with the following order: FC1+BN+ ReLU +FC2+ BN+ ReLU. The reverse order was also tested: BN+ ReLU +FC1+ BN+ ReLU +FC2 to examine which order is more effective. Another scheme with FC layers and ReLU activation has also be considered, where each FC layer is followed with ReLU layer.
After conducting a number of experiments with different values of the training's parameters, we adjusted the values of these parameters for both datasets as follows: epochs=64, batch size=16, learning rate= 0.0001, split ratio= (80 training,10 validation,10 testing) with Stochastic Gradient Descent (SGD) as an optimizer. To boost performance, we used online data augmentation.

Results on the Constrained FG_NET Dataset
In this section, we show the results of evaluating the proposed approaches on the constrained FG_NET dataset. We used the images from the dataset without making any pre-processing. For the case of classification, the number of nodes for the output layer was set to 70, which is related to the number of classes in the dataset. For regression, the number of nodes was set to 1.

Approach-1: Fine-tuning the base convolutional layers with including top
The model was constructed by freezing the entire model and removing the Softmax activation layer. Then, we connected the feature vector from FC8 with different sizes of extra dense layers. Table 2 shows the results of classification and regression algorithms while connecting different schemes to the base model. It is clear from the results that, using Approach-1 with regression is better than using it with classification and resulted with a more robust model to age estimation. The least MAE was achieved when using the (BN+ ReLU+ Two FC (4096)) with adding BN and activation before each FC layer.

Approach-2: Fine-tuning the base convolutional layers without including top
The fine-tuning process was handled by freezing the base convolutional layers and removing the classification layers. We used GAP to connect the base convolutional layers with the new FC layers.  Table 3 shows the results of Approach-2 for real-age estimation task on FG_NET dataset using classification and regression to encode age. The lowest value of MAE was achieved when using (Two FC (4096) + ReLU) with regression algorithm. Figure 7 shows the results of classification and regression encoding for different proposed schemes. We observed that, when using Approach-1, the regression algorithm on average obtained lower MAE than classification algorithm. We can see the good impact of using Approach-2 with the structure of stacked FC with activation layers (Two FC (4096) + ReLU) and without activation layer (Two FC (4096)).
Reusing the model with (No FC) scheme was the worst case for both approaches. When using the scheme (BN+ ReLU+ Two FC (4096)), Approach-1 obtained the least MAE. On scheme (Two FC (4096) + ReLU), the second Approach got the least MAE. Better results of regression over classification when fine-tuning the model on FG_NET are related to the biased classes on the dataset for the younger age and there is inadequate data to represent each class. Thus, learning the age as continuous values rather than discrete age labels is more efficient.

Results on the Unconstrained UTKFace Dataset
To make the model closer to real applications, we should evaluate it under rougher conditions. UTKFace is considered as an unconstrained dataset, where its images contain a diversity of head poses, illumination and occlusion. In this subsection, we show the results of evaluating the proposed approaches on the unconstrained UTKFace dataset. We used the aligned and cropped version of this dataset. For the case of classification, the number of nodes for the output layer was set to 101, which is related to the number of classes in the dataset. For regression, the number of nodes was set to 1. Figure 8 shows the age estimation results when using Approach-1 and Approach-2 on UTKFace dataset. The results show the effectiveness of using classification algorithm with the proposed approaches to encode age. The lowest value of MAE was achieved by the scheme (Two FC (4096) + ReLU). It can be observed from the preceding experiments, how the performance of the model using Approach-2 outperforms that of Approach-1 for both encoding algorithms. Both approaches show an instable performance when using scheme (BN+ ReLU+ Two FC (4096)) and scheme (Two FC (4096) + BN+ ReLU) with regression algorithm, which is related to use a small batch size with large dataset, thus the BN may have an adverse effect. The least MAE value was achieved when using the FC layers with ReLU activation for both approaches. It is noticed that approach-1 shows more stable performance than approach-2 across all schemes. Figure 8. Age estimation results on UTKFace dataset using proposed approaches.

Results on the Unconstrained Adience Dataset
In this section, we show the results of reusing the base model with the second approach on Adience dataset for the age group classification task. In this experiment, the images were pre-processed by detecting and cropping the faces to remove the background. We fine-tuned the VGGFace using Approach-2 that showed a better performance than Approach-1, by connecting the GAP layer with different schemes to find the age group classification task. The output layer was set to 8 neurons related to the number of classes with Softmax as activation function. In this model, we considered accuracy as the evaluation metric. As shown in Figure 9, the classification for 8 classes of age groups needs a less complex, less deep network than the task of real-age estimation. The best accuracy was achieved by the base model without adding any extra layers, by directly connecting the GAP layer with the output layer. Figure 9. Age group classification results on Adience dataset using approach-2.

DISCUSSION
In this section, the proposed model is investigated according to some different criteria such as the effectiveness of specific domain transfer learning, the robustness and the complexity of the model. Moreover, some failure and success cases with more analysis are presented.
The effectiveness of using specific domain transfer learning for age estimation system: The typical structure of VGGFace is based on the pre-trained version of VGG16 network on VGGFace dataset for face recognition task. To show the efficiency of our proposed method, general domain VGG16 model pre-trained on ImageNet was tested. We tested VGG16 with approach-2 while using scheme (no FC). From the results shown in Table 4, we can see how approach-2 has profited from specific domain transfer learning compared with general domain for age estimation task.
We compare the pre-trained model of VGGface and VGG16 on how age estimation can benefit in extracting more features from face. The feature maps are visualized from the first block. Figure 10 No shows the ability of the model VGGFace pre-trained on faces to consider and extract more features related to face. On the other side, we can see how VGG16 pre-trained on object task classification on the ImageNet dataset needs building a more complex network to be capable to extract more features related to face. From this point, we can conclude that using models that pre-trained on a task related to age such as face recognition is more effective than using models pre-trained on a general task.  The complexity of the model: adding two FC layers means extra computational efforts and large number of parameters. Nowadays, the common direction tends to design less complex models that can be adapted to real application. For both datasets used, the system achieved reasonable results while using scheme (one FC with 6000 inputs) when encoding age using classification algorithm. In this case, the resulting models have lower sizes compared with models constructing using two FC layers, thus these models can be used in low memory devices that have a limited capacity, such as mobile applications.
The robustness of the model: using data augmentation in deep neural networks increases the ability of the network to learn from the same images with different conditions that were created using different transformation functions. Moreover, testing the model using only constrained datasets with clear data does not add any extra benefit to the system when using it in real application. Thus, using unconstrained datasets such as UTKFace and Adience that contain a large diversity of illusion, facial expressions and occlusion improves the system robustness to real conditions. Also, age estimation system needs a large number of images that cover a large range of ages to achieve more robust system.
Success and failure cases of the model: we tested our best model on real age estimation using different images from both datasets (FG_NET, UTKFace). As shown in Figure 11, our model can successfully estimate the real age for both constrained and unconstrained images. In some cases, our system failed to find the real-age, where a person may look younger or older than his/her real age.  Figure 11. successful and failed samples tested by our best model.

COMPARISON WITH THE-STATE-OF-THE-ART
Corresponding to the earlier experiments, transfer learning from a pre-trained model on a specific domain, such as face recognition, can improve age estimation performance and overcome the overfitting problem. In addition, correct selection of data augmentation can enhance performance. In this section, a comparison of the results for the proposed model with the-state-of-the-art is described. Table 5 summarizes the results of age estimation on the constrained FG_NET dataset with the-state-of-art. Some models such as [42] needed deeper network to reach reasonable results. Antipov et al. [29] benefited from pre-training on face recognition to improve the performance of real-age estimation. Chen et al. [39] pre-trained the ranking CNNs to extract the facial features on a small unconstrained dataset. Chang and Chen [38] used the hand-crafted features on their model to extract age features from facial images. A classification model for age estimation was proposed by Rothe et al. [27]. They used a VGG16 and pre-trained it on unconstrained IMDB_WIKI dataset for age estimation task. IMDB_WIKI dataset that contains about 500k images, which is a small number compared with the Face dataset that contains 2.6M images to pre-train the VGGFace. The results show the effectiveness of using specific domain pre-training when extracting features related to age. Moreover, the combination of GAP and FC layers in Approach-2 with scheme (Two FC (4096) +ReLU) has regularized the model and prevented the overfitting problem. Using the scheme (BN+ ReLU + Two FC(4096)) has also a good influence on model stability. Table 6 summarizes the results of age estimation on the unconstrained UTKFace dataset with other models. Niu et al. [53] constructed their model of multi-output CNN by considering less number of layers than VGGFace. Cao et al. [54] used the VGG16 as the base network that was pre-trained on a general domain. It is clearly observed from the results that using VGGFace based on VGG16 for age estimation model is more efficient than using other models. For our proposed model, the least error was obtained when using classification as an age encoding algorithm while using Approach-2 for fine-tuning the base model with (Two FC (4096) +ReLU). It is noticed that our proposed model has obtained lower MAE when using classification than regression. This can be obviously related to that UTKFace is a large dataset with balanced distribution according to age. Thus, treating age as a multi-classification task will obtain good results if each class in the dataset has an adequate number of images with balanced distribution over a wide range of ages. For age group classification task, as shown in Table 7, we achieved a good accuracy in spite of the simple fine-tuning scheme and the approach that were used in this experiment. Although Zhang et al. [42] achieved a higher accuracy than other methods, they used a very deep residual network with 152 layers compared with less deep network such as VGG with 16 layers which was the base model for [27] and was pre-trained on a large dataset. Rodríguez et al. [55] used a Wide Residual Network (WRN) pretrained on a general domain task. For their previous work [33], despite adding the attention mechanism to improve the performance of VGGFace, they achieved a minor enhancement in accuracy compared with our model using the GAP layer that can reduce the overfitting problem. Other models just used shallow CNNs to extract age features.

CONCLUSIONS
This paper proposes an age estimation model based on VGGFace, which was pre-trained on a specific domain. Age is an attribute that is derived from face, thus reusing a model pre-trained on a related task to age can have the capability to extract discriminative features related to age effectively and avoid the overfitting problem when using limited data. In this work, we utilized different approaches for finetuning the basic VGGFace model. The first approach kept the top classification layers with the base convolutional layers and connected the model with different strategies of adding extra FC layers. The second approach joined the base convolutional layers with the GAP connected to different strategies of adding extra FC layers.
The proposed approaches were tested under different schemes, by varying feature size and model depth.
We also investigated the effectiveness of two algorithms for encoding age: classification and regression. A minimum error was obtained when using a balanced distributed dataset like UTKFace, where each class is represented by enough data at a specific age. On the other hand, in case of unbalanced datasets like FG_NET, regression performance was better than classification performance. Furthermore, selecting appropriate data augmentation can improve performance, such as rotation, shearing and flipping. We evaluated our model using two kinds of datasets: constrained FG_NET and unconstrained UTKFace and Adience. Our model achieved state-of-the-art results on FG_NET when using regression to encode age, while the lowest MAE was obtained when using classification for UTKFace dataset.
According to the age group classification task, the model was fine-tuned using the second approach and good results on Adience dataset were achieved despite the simple approach of fine-tuning that was implemented. The small degree of complexity to classify 8 age groups compared to large age range could explain the good performance that was obtained.
For future work, we will try to use a hybrid system that combines classification and regression in one model. The idea is to firstly classify the images into age groups, then for each group, a regression CNN will be trained to estimate age.