透過您的圖書館登入
IP:3.12.161.77
  • 學位論文

基於多任務串接卷積神經網路之人臉檢測與特徵點估計

Joint Face Detection and Facial Landmark Estimation Based on Multi-task Cascaded Convolutional Neural Networks

指導教授 : 易志孝

摘要


本論文提出基於多任務串接卷積神經網路之人臉檢測與特徵點估計,並探討人臉邊界框回歸向量與臉部特徵點之間的關係。本網路架構由三個卷積網路組成,分別為提案網路(Proposal Net)、提煉網路(Refine Net)以及輸出網路(Output Net)。提案網路進行初步的人臉候選框偵測,並計算人臉邊界框回歸向量矯正候選框,接著使用非最大抑制方法合併高重疊率的候選框。再將合併後的候選框輸入到提煉網路,刪除大量的錯誤候選框,利用回歸向量以及最大抑制方法矯正、合併候選框。處理完的候選框輸入至輸出網路中,加入臉部定位資訊(左眼、右眼、鼻、左嘴角、右嘴角),經過回歸向量校正候選框和臉部特徵點,利用非最大抑制方法合併候選框,最後輸出包含五個臉部特徵點的人臉框。在訓練過程中採用多任務學習方法,也引進了線上困難樣本篩選(Online Hard Sample Mining)來提升模型效能。 實驗資料使用WIDER FACE資料集訓練人臉檢測,使用CelebA資料集訓練人臉特徵點估計。使用WIDER FACE提供的驗證集進行驗證並畫出查準率-查全率曲線(Precision-Recall curve)。簡單驗證集在查準率為0.6情況下,查全率達到0.695;中階驗證集達在查準率為0.78情況下,查全率達到0.670;困難驗證集在查準率為0.8情形下,查全率達到0.402,優於Faceness、Two-stage CNN、ACF等方法。在驗證人臉特徵點估計方面,使用Deep Convolutional Network Cascade for Facial Point Detection以及CelebA提供的驗證集,前者結果依序由左眼、右眼、鼻、左嘴角以及右嘴角的平均誤差百分比(%)為10.89、11.88、14.27、15.6、16.06,後者為10.60、10.95、13.36、14.15、14.44。前者失敗率(%)依序為27.71、28.11、45.38、53.82、54.22,後者為27.72、28.67、43.30、49.96、51.04。

並列摘要


This thesis proposes a face detection and facial landmark estimation method based on the multi-task cascaded convolutional neural networks (MTCNN) and discusses the effects of multi-task learning on the accuracy of the face bounding box regression vector and facial landmark locations. The proposed model consists of three convolutional neural networks, namely proposal network (P-Net), refine network (R-Net), and output net (O-Net). First, P-Net performs face bounding box detection and calculates the bounding box regression vector to calibrate the candidate boxes. Moreover, P-Net employs the non-maximum suppression (NMS) method to merge the overlapping candidate boxes. The remaining bounding boxes are input to the R-Net, which further rejects a large number of incorrect candidate boxes. Similar to the P-Net, R-Net also uses the bounding box regression vector to calibrate candidate boxes and merge candidate boxes by NMS. The remaining bounding boxes of the R-Net are sent to the O-Net which is trained based on both the face bounding boxes and facial landmark localization information (left eye, right eye, nose, left mouth corner, right mouth corner). Finally, the Q-Net outputs the detected face bounding boxes and the estimated facial landmark locations. In the training process, we use the multi-task learning method and online hard sample mining technique to improve performance. The WIDER FACE dataset is employed to train the MTCNN for the task of face detection and the CelebA dataset is adopted to train the same network for the task of facial landmark estimation. We use the easy, medium, and hard validation sets in the WIDER FACE dataset to verify the performance of the MTCNN and draw the precision-recall curve. For the easy, medium, and hard validation sets, the recall rates of our trained MTCNN are 0.695, 0.67, and 0.402, and the precision rates are 0.6,0.78, and 0.8, respectively, which are better than those results obtained by Faceness, Two-stage CNN, and ACF methods. To evaluate the performance of MTCNN on the task of facial landmark estimation, we use the validation sets in the Deep Convolutional Network Cascade for Facial Point Detection (DCNN) dataset and the CelebA dataset to calculate the mean error and failure rate. For the DCNN validation set, the mean error (in terms of percentage) of the left eye, right eye, nose, left mouth corner, and right mouth corner are 10.89, 11.8, 14.2, 15.6, 16.06, respectively. The failure rate (in terms of percentage) of the left eye, right eye, nose, left mouth corner, and right mouth corner are 27.71, 28.11, 45.38, 53.82, 54.22, respectively. For the CelebA validation set, the mean error (in terms of percentage) of the left eye, right eye, nose, left mouth corner, and right mouth corner are 10.60, 10.95, 13.36, 14.15, 14.44, respectively. The Failure rate (in terms of percentage) of the left eye, right eye, nose, left mouth corner, and right mouth corner are 27.72, 28.67, 43.30, 49.96, 51.04, respectively.

參考文獻


[1] Y. LeCun, Y. Bengio and G. Hinton, "Deep learning," Nature, vol. 521, pp. 436-444, 2015.
[2] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, pp. 2278-2324, 1998.
[3] A. Krizhevsky, I. Sutskever and G. E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," Neural Information Processing Systems, vol. 1, pp. 1097-1105, 2012.
[4] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg and F. Li, "ImageNet Large Scale Visual Recognition Challenge," International Journal of Computer Vision, vol. 115, pp. 211-252, Sept. 2015.
[5] R. Girshick, J. Donahue, T. Darrell and J. Malik, "Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation," IEEE Conference on Computer Vision and Pattern Recognition, pp. 580-587, 2014.

延伸閱讀