Hierarchical Modified Fast R-CNN for Object Detection

Arindam Chaudhuri

Abstract


In object detection there is high degree of skewedness for objects' visual separability. It is difficult to distinguish object categories which demand dedicated classification. The deep convolutional neural networks (CNNs) are trained as N-way classifiers. As such considerable work is required towards leveraging hierarchical category structures. We present here Modified Fast region-based CNN (Mod Fast R-CNN) and Hierarchical Modified Fast region-based CNN (HMod Fast R-CNN) with deep CNNs being embedded considering categorical hierarchy. The easy classes are separated through coarse classifiers. The difficult classes are classified by fine classifiers. HMod Fast R-CNN is trained by initial components training which follows fine-tuning globally using multiple group discriminant analysis. The regularization is done using coarse category consistency. For large-scale recognition tasks, scalability is done considering conditional execution of fine category classifiers and layer parameters compression. Using MS-COCO (benchmark) CIFAR100 and VisualQA datasets we obtain good results. We build several different HMod Fast R-CNN versions where standard CNNs top-1 error is reduced significantly. HMod Fast R-CNN’s performance superiority with other object detectors on PASCAL VOC 2007 and VOC 2012 datasets are also highlighted.


Full Text:

PDF

References


K. He, X. Zhang, S. Ren and J. Sun, “Deep residual learning for image recognition,” in 2016 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778.

R. Girshick, J. Donahue, T. Darrell and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in 2014 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587.

K. He, G. Gkioxari, P. Dollár and R. Girshick, “Mask R-CNN,” in 2017 Proceedings of IEEE International Conference on Computer Vision, pp. 2961–2969.

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected CRFS,” arXiv, arXiv:1412.7062, 2014.

A. Krizhevsky, I. Sutskever and G. Hinton, “ImageNet classification with deep convolutional neural networks,” in 2012 Proceedings of International Conference on Neural Information Processing Systems, pp. 1097–1105.

P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus and Y. LeCun, “OverFeat: Integrated recognition, localization and detection using convolutional networks,” arXiv, arXiv:1312.6229, 2014.

Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard and L. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural Computation, vol 1, no 4, pp. 541–551, 1989.

K. He, X. Zhang, S. Ren and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” arXiv, arXiv:1406.4729, 2014.

Y. Zhu, R. Urtasun, R. Salakhutdinov and S. Fidler, “SegDeepM: Exploiting segmentation and context in deep neural networks for object detection,” arXiv, arXiv:1502.04275, 2015.

M. D. Zeiler and R. Fergus, “Stochastic pooling for regularization of deep convolutional neural networks,” arXiv, arXiv:1301.3557, 2013.

I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville and Y. Bengio, “Maxout networks,” in 2013 Proceedings of International Conference on Machine Learning, 28, pp. III-1319–III-1327.

J. T. Springenberg and M. Riedmiller, “Improving deep neural networks with probabilistic maxout units,” arXiv, arXiv:1312.6116, 2013.

M. Lin, Q. Chen and S. Yan, “Network in network,” arXiv, arXiv:1312.4400, 2013.

A. M. Tousch, S. Herbin and J. Y. Audibert, “Semantic hierarchies for image annotation: A survey,” Pattern Recognition, vol 45, no 1, pp. 333–345, 2012.

S. Bengio, J. Weston, D. Grangier. Label embedding trees for large multi-class tasks. 23rd International Conference on Neural Information Processing Systems, 1:163–171, 2010.

T. Gao and D. Koller, “Discriminative learning of relaxed hierarchy for large-scale visual recognition,” in 2011 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 2072–2079.

M. Marszalek and C. Schmid, “Semantic hierarchies for visual object recognition,” in 2007 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–7, 2007.

N. Verma, D. Mahajan, S. Sellamanickam and V. Nair, “Learning hierarchical similarity metrics,” in 2012 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 2280–2287.

Y. Jia, J. T. Abbott, J. Austerweil, T. Griffiths and T. Darrell, “Visual concept learning: Combining machine vision and bayesian generalization on concept hierarchies,” in 2013 Proceedings of International Conference on Neural Information Processing Systems, 2, pp. 1842–1850.

R. Salakhutdinov, A. Torralba and J. Tenenbaum, “Learning to share visual appearance for multiclass object detection,” in 2011 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1481–1488.

G. Griffin and P. Perona, “Learning and using taxonomies for fast visual categorization,” in 2008 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8.

M. Marszałek and C. Schmid, “Constructing category hierarchies for visual recognition,” in 2008 Proceedings of European Conference on Computer Vision, IV, pp. 479–491.

L. J. Li, C. Wang, Y. Lim, D. M. Blei and L. Fei-Fei, “Building and using a semantivisual image hierarchy,” in 2010 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3336–3343.

H. Bannour and C. Hudelot, “Hierarchical image annotation using semantic hierarchies,” in 2012 Proceedings of ACM International Conference on Information and Knowledge Management, pp. 2431–2434.

J. Deng, S. Satheesh, A. C. Berg and F. Li, “Fast and balanced: Efficient label tree learning for large scale object recognition,” in 2011 Proceedings of International Conference on Neural Information Processing Systems, 1, pp. 567–575.

J. Sivic, B. C. Russell, A. Zisserman, W. T. Freeman and A. A. Efros, “Unsupervised discovery of visual object class hierarchies,” in 2008 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8.

J. Deng, J. Krause, A. C. Berg and L. Fei-Fei, “Hedging yourbets: Optimizing accuracy-specificity trade-offs in large scale visual recognition,” in 2012 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3450–3457.

B. Liu, F. Sadeghi, M. Tappen, O. Shamir and C. Liu, “Probabilistic label trees for efficient large scale image classification,” in 2013 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 843–850.

N. Srivastava and R. Salakhutdinov, “Discriminative transfer learning with tree-based priors,” in 2013 Proceedings of International Conference on Neural Information Processing Systems, 2, pp. 2094–2102.

J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven and H. Adam, “Large-scale object classification using label relation graphs,” in 2014 Proceedings of European Conference on Computer Vision, I, pp. 48–64.

T. Xiao, J. Zhang, K. Yang, Y. Peng and Z. Zhang, “Error driven incremental learning in deep convolutional neural network for large-scale image classification,” in 2014 Proceedings of ACM International Conference on Multimedia, pp. 177–186.

A. Chaudhuri, “Some insights and observations on real time object detectors considering several benchmarks,” Technical Report, Samsung R & D Institute Delhi, India, 2021.

MS-COCO dataset: https://cocodataset.org

CIFAR100 dataset: https://web.stanford.edu/~hastie/CASI_files/DATA/cifar100.html

VisualQA dataset: https://visualqa.org/download.html

PASCAL VOC 2007 dataset: http://host.robots.ox.ac.uk/pascal/VOC/voc2007/

PASCAL VOC 2012 dataset:

http://host.robots.ox.ac.uk/pascal/VOC/voc2012/

Y. Sun, D. Liang, X. Wang and X. Tang, “Deepid3: Face recognition with very deep neural networks,” arXiv, arXiv:1502.00873, 2015.

Y. Sun, Y. Chen, X. Wang and X. Tang, “Deep learning face representation by joint identification-verification,” arXiv, arXiv:1406.4773v1, 2014.

W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in 2017 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 6738–6746.

J. Li, X. Liang, S. Shen, T. Xu, J. Feng and S. Yan, “Scale-aware fast R-CNN for pedestrian detection,” IEEE Transactions on Multimedia, vol 20, no 4, pp. 985–996, 2018.

J. Hosang, M. Omran, R. Benenson and B. Schiele, “Taking a deeper look at pedestrians,” in 2015 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4073–4082.

A. Angelova, A. Krizhevsky, V. Vanhoucke, A. S. Ogale and D. Ferguson, “Real-time pedestrian detection with deep network cascades,” in 2015 Proceedings of British Machine Vision Conference, pp. 32.1–32.12.

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in 2014 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732.

H. Mobahi, R. Collobert and J. Weston, “Deep learning from temporal coherence in video,” in 2009 Proceedings of ACM International Conference on Machine Learning, pp. 737–744.

S. C. Hoi, X. Wu, H. Liu, Y. Wu, H. Wang, H. Xue and Q. Wu, “Logo-net: Large-scale deep logo detection and brand recognition with deep region based convolutional networks,” arXiv, arXiv:1511.02462, 2015.

H. Su, X. Zhu and S. Gong, “Deep learning logo detection with data expansion by synthesising context,” arXiv, arXiv:1612.09322v3, 2017.

H. Su, S. Gong and X. Zhu, “Scalable deep learning logo detection,” arXiv, arXiv:1803.11417, 2018.

A. Vedaldi, V. Gulshan, M. Varma and A. Zisserman, “Multiple kernels for object detection,” in 2009 Proceedings of IEEE International Conference on Computer Vision, pp. 606–613.

P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in 2001 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–1.

H. Harzallah, F. Jurie and C. Schmid, “Combining efficient object localization and image classification,” in 2009 Proceedings of IEEE International Conference on Computer Vision, pp. 237–244.

N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in 2005 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 886–893.

P. Viola and M. J. Jones, “Robust real-time face detection,” International Journal of Computer Vision, vol 57, no 2, pp. 137–154, 2004.

D. G. Lowe, “Object recognition from local scale-invariant features,” in 1999 Proceedings of IEEE International Conference on Computer Vision, 2, pp. 1150–1157.

R. Lienhart and J. Maydt, “An extended set of Haar like features for rapid object detection,” in 2002 Proceedings of IEEE International Conference on Image Processing, 1, pp. 900–903.

H. Bay, T. Tuytelaars and L. Van Gool, “SURF: Speeded up robust features,” in 2006 Proceedings of European Conference on Computer Vision, pp. 404–417.

M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt and B. Scholkopf, “Support vector machines,” IEEE Intelligent Systems and their Applications, vol 13, no 4, pp. 18–28, 1998.

D. Opitz and R. Maclin, “Popular ensemble methods: An empirical study,” Journal of Artificial Intelligence Research, vol 11, pp. 169–198, 1999.

Y. Freund and R. E. Schapire, “Experiments with a new boosting algorithm,” in 1996 Proceedings of ACM International Conference on Machine Learning, pp.148–156.

Y. Yu, J. Zhang, Y. Huang, S. Zhang, W. Ren, C. Wang, K. Huang and T. Tan, “Object detection by context and boosted HOG-LBP,” in 2010 Proceedings of European Conference on Computer Vision on PASCAL VOC Workshop.

P. Felzenszwalb, R. Girshick, D. McAllester and D. Ramanan, “Discriminatively trained mixtures of deformable part models,” in 2008 Proceedings of European Conference on Computer Vision on PASCAL VOC Workshop.

M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn and A. Zisserman, “The PASCAL visual object classes (VOC) challenge,” International Journal of Computer Vision, vol 88, no 2, pp. 303–338, 2010.

P. Felzenszwalb, R. Girshick, D. McAllester and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 32, no 9, pp. 1627–1645, 2010.

D. G. Lowe, “Distinctive image features from scale-invariant key points,” International Journal of Computer Vision, vol 60, pp. 91–110, 2004.

T. Ojala, M. Pietikainen and T. Maenpaa, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol 24, no 7, pp. 971–987, 2002.

S. Ren, K. He, R. Girshick and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” arXiv, arXiv:1506.01497, 2015.

K. Fukushima and S. Miyake, “Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition,” in 1982 Proceedings of Competition and Cooperation in Neural Nets, pp. 267–285.

Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-based learning applied to document recognition,” in 1998 Proceedings of IEEE, vol 86, no 11, pp. 2278–2324.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 248–255.

R. Girshick, “Fast R-CNN,” arXiv, arXiv:1504.08083, 2015.

T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan and S. Belongie, “Feature pyramid networks for object detection,” in 2017 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125.

J. Redmon, S. Divvala, R. Girshick and A. Farhadi, “You only look once: Unified, real-time object detection,” in 2016 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788.

J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in 2017 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 6517-6525.

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu and A. C. Berg, “SSD: Single shot multibox detector,” in 2016 Proceedings of European Conference on Computer Vision, pp. 21–37.

J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” arXiv, arXiv:1804.02767v1, 2018.

A. Bochkovskiy, C.-Y. Wang and H.-Y. M. Liao, “YOLOv4: Optimal speed and accuracy of object detection,” arXiv, arXiv:2004.10934v1, 2020.

YOLOv5: https://github.com/ultralytics/yolov5

Z. S. Sabri and Z. Li, “Low-cost intelligent surveillance system based on Fast CNN,” PeerJ Computer Science, vol 7, pp. e402, 2021.

V. Mazzia, A. Khaliq, F. Salvetti and M. Chiaberge, “Real-time apple detection system using embedded systems with hardware accelerators: An edge AI application,” IEEE Access, vol 8, pp. 9102–9114, 2020.

Z. Luo, A. Small, L. Dugan and S. Lane, “Cloud chaser: Real time deep learning computer vision on low computing power devices,” arXiv, arXiv:1810.01069v2, 2020.

D. Barry, M. Shah, M. Keijsers, H. Khan and B. Hopman, “xYOLO: A model for real-time object detection in humanoid soccer on low-end hardware,” arXiv, arXiv:1910.03159v1, 2019.

S. Ghoury, C. Sungur and A. Durdu, “Real-time disease detection of grape and grape leaves using Faster R-CNN and SSD MobileNet architectures,” in 2019 Proceedings of International Conference on Advanced Technologies, Computer Engineering and Science, Alanya, Turkey.

B. A. Kumar, T. P. Chowdhary and T. G. Rao, “Smart embedded device for object and text recognition through real-time video using Raspberry PI,” International Journal of Engineering and Technology, vol 7, no 4, pp. 556–562, 2019.

D. Erhan, C. Szegedy, A. Toshev and D. Anguelov, “Scalable object detection using deep neural networks,” arXiv, arXiv:1312.2249, 2014.

M. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in 2014 Proceedings of European Conference on Computer Vision, I, pp. 818–833.

A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Technical Report, Computer Science Department, University of Toronto, Toronto, Canada, 2009.

Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in 2014 Proceedings of ACM International Conference on Multimedia, pp. 675–678.




DOI: https://doi.org/10.31449/inf.v45i7.3732

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.