Elsevier

Neurocomputing

Volume 195, 26 June 2016, Pages 23-29
Neurocomputing

Multimodal learning for view-based 3D object classification

https://doi.org/10.1016/j.neucom.2015.09.120Get rights and content

Abstract

Nowadays by employing many machine learning and pattern classification methods in object classification, the view-based 3D object classification, an emerging research topic, becomes a major research focus. However, most existing researches focus on only a single modality of image features for the object classification, although recent studies have shown that different kinds of features may provide complementary information for 3D object classification. In this paper, we propose the multimodal support vector machine to combine three modalities of image features, i.e., Sift descriptor, Outline Fourier transform descriptor, and Zernike Moments descriptor to discriminate the multiple classes of object, where each kernel corresponds to each modality. In this way, not only the independence of each modality but also the interrelation between them are both taken into considered. And we further employ multi-task feature selection via the l2-norm regularization after feature extraction to improve the performance of final classification. The experiments conducted in ETH-80 image set demonstrate the effectiveness and superiority of our method.

Introduction

Nowadays with the rapid development of websites and the widespread popularity of mobile devices, view-based 3D object classification and retrieval have been drawing more and more attentions, for the increasing availability of multi-view data on 3D objects, its low computational cost and its better performance in object classification and retrieval. In the application, view-based 3D object achieve effective and efficient 3D object classification and retrieval in many application fields of 3D objects, such as entertainment, medical, and architectural design industries [1].

View-based 3D object classification is one of the key tasks for image understanding in computer vision, which concerned with the general description of an object as belonging to a natural class of similar object by utilizing the multi-view data. However, it is quite difficult to perform visual classification very well by artificial computer vision systems. The main reason for this comes from the difficulty to capture the essential object features that are sufficient to reflect the similarity of intra-class objects and the variance of inter-class ones.

To overcome the problem of variability, one generally accepted strategy is to find or design feature descriptor which is highly invariant to the variations present within the classes. On the other hand, the feature descriptor representing the object should be distinguished from other object in the different classes as far as possible. However, as we know, none of the feature descriptor is suitable perfectly for the classification of all classes. For example, the color histogram may perform well on classification of the classes that present obvious color distinction, whereas it would be sensitive to the objects in the class whose color varies a lot. Therefore it is widely expected to combine diverse and complementary features of objects to make more accurate classification instead of single type of feature.

Combining multiple features as multi-modalities is a recent trend in class-level object recognition and image classification. One effective and popular method to the multimodal classification is Multiple Kernel Learning (MKL), which uses different learning methods for determining the kernel combination function. That is, the approach can be seen to linearly combine similarity functions between images such that the combined similarity function yields improved classification performance, where these different kernels may correspond to using different notions of similarity or may be using information coming from multiple sources (different representations or different feature subsets) [2], [3].

To improve the generation in the situation where many irrelevant features are present in the dataset, we employ multi-task feature learning via the l2-norm regularization has been studied in [4], which encourages multiple predictors from different tasks to share similar parameter sparsity patterns.

Overall the contributions of this paper could be summarized as follows:

  • (1)

    We employ multimodal SVM to make the view-based 3D object classification, where each kernel corresponds to each modality.

  • (2)

    We employ multi-task joint feature selection for multimodal classification, where by penalizing the sum of l2-norms of the blocks of coefficients associated with each feature across different tasks, we encourage multiple predictors to have similar parameter sparsity patterns.

The rest of the paper is organized as follows: Section 2 presents the background of the object classification and displays some related works about multimodal classification and kernel learning. Section 3 introduces our method on object classification. Section 4 describes the experiments results and corresponding evaluation. Finally, the conclusion and future works are given in Section 5.

Section snippets

Kernel learning

Kernel learning was originally proposed in [3], which showed how the kernel matrix could be learned from data via semi-definite programming (SDP) techniques. Some researches focused on the efficiency of kernel learning [5], [6], [7], which used Boosting to select training data instances for each of the base kernels. Some concerned about the efficiency of kernel learning by feature selection [8], [9]. And multi-class kernel learning is also concerned a lot [10], [11], the former exploited the

Method

The proposed framework is illustrated in Fig. 1. Firstly, the Sift feature, Outline Fourier feature, and Zernike Moments feature are extracted from dataset. And we construct the bag-of-words for sift feature. After that multi-task feature selection are employed to select the original features over these three modalities. Finally, we employ multimodal SVM with multiple kernels to make the object classification.

Experiments

In our experiments, the objectives are three-fold: (1) whether using multi-modality can improve the accuracy of object classification comparing to single-modality, (2) whether using multimodal SVM is a better choice in modeling multi-modality data, and (3) whether multi-task feature selection would contribute to the improvement of classification. In the following, we will report our implementation details on experiment setting, and quantitative results.

Conclusion

In this paper, instead of using single modality of image features for the view-based 3D object classification, we employ the multimodal support vector machine to combine three modalities of image features (Sift descriptor, Outline Fourier descriptor, and Zernike Moments descriptor) to discriminate the multiple classes of object, here each kernel corresponds to each modality. In this way, not only the independence of each modality but also the interrelation between them are both taken into

Acknowledgments

This work is supported by the Special Fund for Earthquake Research in the Public Interest No.201508025, the Nature Science Foundation of China (Nos. 61402388, 61422210 and 61373076), the Fundamental Research Funds for the Central Universities (Nos. 20720150080 and 2013121026), the CCF-Tencent Open Research Fund, and the Open Projects Program of National Laboratory of Pattern Recognition.

Fuhai Chen received B.S. Degree in Information and Computing Science in 2014 from Xiamen University, Fujian, China. He is currently working towards his Master Degree at Xiamen University. His research interests include machine learning, hyperspectral remote sensing image analysis, and computer scene understanding.

References (40)

  • G. Obozinski, B. Taskar, M. Jordan, Multi-Task Feature Selection, Technical Report, Statistics Department, UC...
  • B. Siddiquie, S.N. Vitaladevuni, L.S. Davis, Combining multiple kernels for efficient image classification, in: 2009...
  • K. Crammer, J. Keshet, Y. Singer, Kernel design using boosting, in: Advances in Neural Information Processing Systems,...
  • T. Hertz, A.B. Hillel, D. Weinshall, Learning a kernel function for classification with small training samples, in:...
  • A. Rakotomamonjy, F. Bach, S. Canu, Y. Grandvalet, More efficiency in multiple kernel learning, in: Proceedings of the...
  • R. Xiao, W. Li, Y. Tian, X. Tang, Joint boosting feature selection for robust face recognition, in: 2006 IEEE Computer...
  • A. Zien, C.S. Ong, Multiclass multiple kernel learning, in: Proceedings of the 24th International Conference on Machine...
  • C.H. Lampert, M.B. Blaschko, A multiple kernel learning approach to joint multi-class object detection, in: Pattern...
  • K. Knight et al.

    Asymptotics for lasso-type estimators

    Ann. Stat.

    (2000)
  • D.L. Donoho

    For most large underdetermined systems of linear equations the minimal 1-norm solution is also the sparsest solution

    Commun. Pure Appl. Math.

    (2006)
  • Cited by (8)

    • 3D shape recognition and retrieval based on multi-modality deep learning

      2017, Neurocomputing
      Citation Excerpt :

      Zhao et al. [42] propose Retinex-based Importance Feature (RIF) and Relative Normal Distance (RND) for 3D free form shapes based on the human visual perception characteristics and surface geometry respectively. Chen et al. [43] propose the multi-modal support vector machine to combine three modalities of image feature, i.e., Sift descriptor, Outline Fourier transform descriptor, and Zernike Moments descriptor to discriminate the multiple classes of object. Leng et al. [44] propose a 3D model based on Deep Boltzman Machines (DBM) and semi-supervised learning method to recognize 3D shape.

    • Think locally, fit globally: Robust and fast 3D shape matching via adaptive algebraic fitting

      2017, Neurocomputing
      Citation Excerpt :

      Free form 3D surface matching is the technique to find the correspondences between surfaces. There is a increasing demand for efficient and accurate 3D surface matching techniques because it serves as a fundamental method for a number of 3D based computer vision techniques, e.g., tracking [35], recognition [3], classification [7,16,17], retrieval [1,40], registration [15,27,36], modeling [9,13,37], morphing [31,38] and BRDF estimation [21–23]. Matching surfaces is not a trivial task when the surface is deformable.

    • Combine EfficientNet and CNN for 3D model classification

      2023, Mathematical Biosciences and Engineering
    • 3D Model Classification Based on Bayesian Classifier with AdaBoost

      2021, Discrete Dynamics in Nature and Society
    • 3d radon transform for shape retrieval using bag-of-visual-features

      2020, International Arab Journal of Information Technology
    View all citing articles on Scopus

    Fuhai Chen received B.S. Degree in Information and Computing Science in 2014 from Xiamen University, Fujian, China. He is currently working towards his Master Degree at Xiamen University. His research interests include machine learning, hyperspectral remote sensing image analysis, and computer scene understanding.

    Rongrong Ji serves as the professor of Xiamen University, where he directs the Intelligent Multimedia Technology Laboratory (http://imt.xmu.edu.cn) and serves as a dean assistant in the School of Information Science and Engineering. He has been a postdoc research fellow in the Department of Electrical Engineering, Columbia University from 2010 to 2013, worked with Professor Shih-Fu Chang. He obtained his Ph.D. degree in Computer Science from Harbin Institute of Technology, graduated with a Best Thesis Award at HIT. He had been a visiting student at University of Texas of San Antonio worked with Professor Qi Tian, and a research assistant at Peking University worked with Professor Wen Gao in 2010, a research intern at Microsoft Research Asia, worked with Dr. Xing Xie from 2007 to 2008. He is the author of over 40 tired-1 journals and conferences including IJCV, TIP, TMM, ICCV, CVPR, IJCAI, AAAI, and ACM Multimedia. His research interests include image and video search, content understanding, mobile visual search, and social multimedia analytics. Ji is the recipient of the Best Paper Award at ACM Multimedia 2011 and Microsoft Fellowship 2007. He is a guest editor for IEEE Multimedia Magazine, Neurocomputing, and ACM Multimedia Systems Journal. He has been a special session chair of MMM 2014, VCIP 2013, MMM 2013 and PCM 2012, would be a program chair of ICIMCS 2016, Local Arrangement Chair of MMSP 2015. He serves as reviewers for IEEE TPAMI, IJCV, TIP, TMM, CSVT, TSMC A⧹B⧹C and IEEE Signal Processing Magazine, etc. He is in the program committees of over 10 top conferences including CVPR 2013, ICCV 2013, ECCV 2012, ACM Multimedia 2013–2010, etc.

    Liujuan Cao is currently an assistant professor at the Department of Computer Science, Xiamen University. Before that, she obtained her Ph.D. degree from Harbin Engineering University. Her research is in the field of multimedia analysis, geo-science and remote sensing, and computer vision. She has published extensively at CVPR, Neurocomputing, Signal Processing, ICIP, and VCIP.

    View full text