Classifier and Exemplar Synthesis for Zero-Shot Learning

Changpinyo, Soravit; Chao, Wei-Lun; Gong, Boqing; Sha, Fei

doi:10.1007/s11263-019-01193-1

Classifier and Exemplar Synthesis for Zero-Shot Learning

Published: 06 September 2019

Volume 128, pages 166–201, (2020)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Soravit Changpinyo ORCID: orcid.org/0000-0002-4013-1190¹^na1,
Wei-Lun Chao²^na1,
Boqing Gong³ &
…
Fei Sha⁴

1228 Accesses
32 Citations
1 Altmetric
Explore all metrics

Abstract

Zero-shot learning (ZSL) enables solving a task without the need to see its examples. In this paper, we propose two ZSL frameworks that learn to synthesize parameters for novel unseen classes. First, we propose to cast the problem of ZSL as learning manifold embeddings from graphs composed of object classes, leading to a flexible approach that synthesizes “classifiers” for the unseen classes. Then, we define an auxiliary task of synthesizing “exemplars” for the unseen classes to be used as an automatic denoising mechanism for any existing ZSL approaches or as an effective ZSL model by itself. On five visual recognition benchmark datasets, we demonstrate the superior performances of our proposed frameworks in various scenarios of both conventional and generalized ZSL. Finally, we provide valuable insights through a series of empirical analyses, among which are a comparison of semantic representations on the full ImageNet benchmark as well as a comparison of metrics used in generalized ZSL. Our code and data are publicly available at https://github.com/pujols/Zero-shot-learning-journal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

End-to-End Object Detection with Transformers

Microsoft COCO: Common Objects in Context

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Notes

In this work, classifiers are taken to be the normals of hyperplanes separating different classes (i.e., linear classifiers).
In the context of deep neural networks for classification, one can think of $\varvec{w}_c$ as the vector corresponding to class c in the last fully-connected layer and ${\varvec{x}}$ as the input to that layer.
In practice, we found these initializations to be highly effective—even keeping the initial $\varvec{b}_r$ intact while only learning $\varvec{v}_r$ for $r = 1,\ldots ,\mathsf {R}$ can already achieve comparable results. In most of our experiments, we thus only learn $\varvec{v}_r$ for $r = 1,\ldots ,\mathsf {R}$.
There is one class in the ILSVRC 2012 1K dataset that does not appear in the ImageNet 2011 21K dataset. Thus, we have a total of 20,842 unseen classes to evaluate.
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 on September 1, 2015.
http://www.image-net.org/api/xml/structure_released.xml.
In SynC, $f_c({\varvec{x}}) = \varvec{w}_c^{\text {T}}{\varvec{x}}= (\sum _{r=1}^\mathsf {R}s_{cr}\varvec{v}_r)^{\text {T}}{\varvec{x}}$ [cf. Sect. 2.1.3 and Eq. (10)]. In EXEM, $f_c({\varvec{x}}) =- \text {dis}_{NN}(\varvec{M}{\varvec{x}}, \varvec{\psi }(\varvec{a}_c))$ if we treat $\varvec{\psi }(\varvec{a}_c)$ as data and apply a nearest neighbor classifier [cf. Sect. 2.2.2 and Eq. (13)].
Wang et al. (2018) and Kampffmeyer et al. (2019) extracted word vectors of class names by averaging the vectors of words in the synset name, enabling all 20,842 unseen classes to have word vectors. The number of 2-hop, 3-hop, and All classes are thus 1,589, 7,860, and 20,842, respectively.
For interested readers, if we set the number of attributes as the number of phantom classes (each $\varvec{b}_r$ is the one-hot representation of an attribute), and use the Gaussian kernel with an isotropically diagonal covariance matrix in Eq. (3) with properly set bandwidths (either very small or very large) for each attribute, we will recover the formulation in Akata et al. (2013), Akata et al. (2015) when the bandwidths tend to zero or infinity.
https://code.google.com/p/word2vec/.
For GoogLeNet features, we follow Changpinyo et al. (2017) to set $\lambda =1$ and $\mathsf {d}=500$ for all experiments.
For CV-distance, we set $\mathsf {d}=500$ for all experiments. This is because the smaller $\mathsf {d}$ is, the smaller the distance is.
We treat rows of each distance matrix as data points and compute the Pearson correlation coefficients between matrices.

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I. J., Harp, A., Irving, G., Isard, M., Jia, Y., Józefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D. G., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P. A., Vanhoucke, V., Vasudevan, V., Viégas, F. B., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., & Zheng, X. (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed systems. In: OSDI.
Akata, Z., Perronnin, F., Harchaoui, Z., & Schmid, C. (2013) . Label-embedding for attribute-based classification. In: CVPR.
Akata, Z., Reed, S., Walter, D., Lee, H., & Schiele, B. (2015) . Evaluation of output embeddings for fine-grained image classification. In: CVPR.
Al-Halah, Z., & Stiefelhagen, R. (2015) . How to transfer? zero-shot object recognition via hierarchical transfer of semantic attributes. In: WACV.
Argyriou, A., Evgeniou, T., & Pontil, M. (2008). Convex multi-task feature learning. Machine Learning, 73, 243–272.
Article Google Scholar
Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6), 1373–1396.
Article Google Scholar
Bucher, M., Herbin, S., & Jurie, F. (2018) . Zero-shot classification by generating artificial visual features. In: RFIAP.
Changpinyo, S., Chao, W.-L., Gong, B., & Sha, F. (2016) . Synthesized classifiers for zero-shot learning. In CVPR.
Changpinyo, S., Chao, W.-L., & Sha, F. (2017) . Predicting visual exemplars of unseen classes for zero-shot learning. In ICCV.
Chao, W.-L., Changpinyo, S., Gong, B., & Sha, F. (2016). An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In ECCV.
Chen, C.-Y., & Grauman, K. (2014). Inferring analogous attributes. In CVPR.
Crammer, K., & Singer, Y. (2002). On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2, 265–292.
MATH Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR.
Duan, K., Parikh, D., Crandall, D., & Grauman, K. (2012) . Discovering localized attributes for fine-grained recognition. In CVPR.
Elhoseiny, M., Saleh, B., & Elgammal, A. (2013) . Write a classifier: Zero-shot learning using purely textual descriptions. In ICCV.
Farhadi, A., Endres, I., Hoiem, D., & Forsyth, D. (2009). Describing objects by their attributes. In CVPR.
Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Ranzato, M. A., & Mikolov, T. (2013) . Devise: A deep visual-semantic embedding model. In NIPS.
Fu, Y., Hospedales, T. M., Xiang, T., Fu, Z., & Gong, S. (2014). Transductive multi-view embedding for zero-shot recognition and annotation. In ECCV.
Fu, Y., Hospedales, T. M., Xiang, T., & Gong, S. (2015) . Transductive multi-view zero-shot learning. TPAMI.
Fu, Y., Xiang, T., Jiang, Y.-G., Xue, X., Sigal, L., & Gong, S. (2018). Recent advances in zero-shot recognition: Toward data-efficient understanding of visual content. IEEE Signal Processing Magazine, 35, 112–125.
Article Google Scholar
Gan, C., Lin, M., Yang, Y., Zhuang, Y., & Hauptmann, A. G. (2015) . Exploring semantic interclass relationships (sir) for zero-shot action recognition. In AAAI.
Gan, C., Yang, T., & Gong, B. (2016). Learning attributes equals multi-source domain generalization. In CVPR.
Garcia, S., & Herrera, F. (2008) . An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. JMLR, 9:2677–2694.
Gavves, E., Mensink, T., Tommasi, T., Snoek, C. G., & Tuytelaars, T. (2015). Active transfer learning with zero-shot priors: Reusing past datasets for future tasks. In ICCV.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
Hinton, G. E., & Roweis, S. T. (2002) . Stochastic neighbor embedding. In NIPS.
Jayaraman, D., & Grauman, K. (2014) . Zero-shot recognition with unreliable attributes. In NIPS.
Jayaraman, D., Sha, F., & Grauman, K. (2014). Decorrelating semantic visual attributes by resisting the urge to share. In CVPR.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014) . Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia.
Kampffmeyer, M., Chen, Y., Liang, X., Wang, H., Zhang, Y., & Xing, E. P. (2019). Rethinking knowledge graph propagation for zero-shot learning. In CVPR.
Karessli, N., Akata, Z., Bulling, A., & Schiele, B. (2017) . Gaze embeddings for zero-shot image classification. In CVPR.
Kipf, T. N., Welling, M. (2017) . Semi-supervised classification with graph convolutional networks. In ICLR.
Kodirov, E., Xiang, T., Fu, Z., & Gong, S. (2015). Unsupervised domain adaptation for zero-shot learning. In: ICCV.
Kodirov, E., Xiang, T., & Gong, S. (2017). Semantic autoencoder for zero-shot learning. In CVPR.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012) . Imagenet classification with deep convolutional neural networks. In NIPS.
Kumar Verma, V., Arora, G., Mishra, A., & Rai, P. (2018). Generalized zero-shot learning via synthesized examples. In CVPR.
Lampert, C. H., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In CVPR.
Lampert, C. H., Nickisch, H., & Harmeling, S. (2014). Attribute-based classification for zero-shot visual object categorization. TPAMI, 36(3), 453–465.
Article Google Scholar
Lei Ba, J., Swersky, K., Fidler, S., & Salakhutdinov, R. (2015). Predicting deep zero-shot convolutional neural networks using textual descriptions. In ICCV.
Li, X., Guo, Y., & Schuurmans, D. (2015). Semi-supervised zero-shot classification with label representation learning. In ICCV.
Long, Y., Liu, L., Shao, L., Shen, F., Ding, G., & Han, J. (2017). From zero-shot learning to conventional supervised classification: Unseen visual data synthesis. In CVPR.
Lu, Y. (2016). Unsupervised learning of neural network outputs. In IJCAI.
Mansimov, E., Parisotto, E., Ba, J. L., & Salakhutdinov, R. (2016). Generating images from captions with attention. In ICLR.
Mensink, T., Gavves, E., & Snoek, C. G. (2014). COSTA: Co-occurrence statistics for zero-shot classification. In CVPR.
Mensink, T., Verbeek, J., Perronnin, F., & Csurka, G. (2013). Distance-based image classification: Generalizing to new classes at near-zero cost. TPAMI, 35(11), 2624–2637.
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G. S., & Dean, J. (2013a). Efficient estimation of word representations in vector space. In ICLR Workshops.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In NIPS.
Miller, G. A. (1995). Wordnet: a lexical database for english. Communications of the ACM, 38(11), 39–41.
Article Google Scholar
Morgado, P., & Vasconcelos, N. (2017). Semantically consistent regularization for zero-shot recognition. In CVPR.
Norouzi, M., Mikolov, T., Bengio, S., Singer, Y., Shlens, J., Frome, A., Corrado, G. S., & Dean, J. (2014). Zero-shot learning by convex combination of semantic embeddings. In ICLR Workshops.
Palatucci, M., Pomerleau, D., Hinton, G. E., & Mitchell, T. M. (2009). Zero-shot learning with semantic output codes. In NIPS.
Parikh, D., & Grauman, K. (2011). Interactively building a discriminative vocabulary of nameable attributes. In CVPR.
Patterson, G., Xu, C., Su, H., & Hays, J. (2014). The SUN Attribute Database: Beyond categories for deeper scene understanding. IJCV, 108(1–2), 59–81.
Article Google Scholar
Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In EMNLP.
Rebuffi, S.-A., Kolesnikov, A., Sperl, G., & Lampert, C. H. (2017) . iCaRL: Incremental classifier and representation learning. In CVPR.
Reed, S., Akata, Z., Lee, H., & Schiele, B. (2016a). Learning deep representations of fine-grained visual descriptions. In CVPR.
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., & Lee, H. (2016). Generative adversarial text to image synthesis. In ICML.
Ristin, M., Guillaumin, M., Gall, J., & Van Gool, L. (2016). Incremental learning of random forests for large-scale image classification. TPAMI, 38(3), 490–503.
Article Google Scholar
Rohrbach, M., Stark, M., & Schiele, B. (2011). Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In CVPR.
Rohrbach, M., Stark, M., Szarvas, G., Gurevych, I., & Schiele, B. (2010). What helps where–and why? semantic relatedness for knowledge transfer. In CVPR.
Romera-Paredes, B., & Torr, P. H. S. (2015). An embarrassingly simple approach to zero-shot learning. In ICML.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. IJCV.
Salakhutdinov, R., Torralba, A., & Tenenbaum, J. (2011). Learning to share visual appearance for multiclass object detection. In CVPR.
Schölkopf, B., & Smola, A. J. (2002). Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press.
Schölkopf, B., Smola, A. J., Williamson, R. C., & Bartlett, P. L. (2000). New support vector algorithms. Neural computation, 12(5), 1207–1245.
Article Google Scholar
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.
Socher, R., Ganjoo, M., Manning, C. D., & Ng, A. Y. (2013). Zero-shot learning through cross-modal transfer. In NIPS.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In CVPR.
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. JMLR, 9(2579–2605), 85.
MATH Google Scholar
Van Horn, G., & Perona, P. (2017). The devil is in the tails: Fine-grained classification in the wild. arXiv preprint arXiv:1709.01450.
Verma, V. K., & Rai, P. (2017). A simple exponential family framework for zero-shot learning. In ECML/PKDD.
Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology.
Wang, Q., & Chen, K. (2017). Zero-shot visual recognition via bidirectional latent embedding. IJCV, 124, 356–383.
Article MathSciNet Google Scholar
Wang, X., Ye, Y., & Gupta, A. (2018). Zero-shot recognition via semantic embeddings and knowledge graphs. In CVPR.
Xian, Y., Akata, Z., Sharma, G., Nguyen, Q., Hein, M., & Schiele, B. (2016). Latent embeddings for zero-shot classification. In CVPR.
Xian, Y., Lampert, C. H., Schiele, B., & Akata, Z. (2018a). Zero-shot learning - a comprehensive evaluation of the Good, the Bad and the Ugly. TPAMI.
Xian, Y., Lorenz, T., Schiele, B., & Akata, Z. (2018b). Feature generating networks for zero-shot learning. In CVPR.
Xian, Y., Schiele, B., & Akata, Z. (2017). Zero-shot learning - the Good, the Bad and the Ugly. In CVPR.
Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). SUN Database: Large-scale scene recognition from abbey to zoo. In CVPR.
Xu, X., Hospedales, T., & Gong, S. (2015). Semantic embedding space for zero-shot action recognition. In ICIP.
Yan, X., Yang, J., Sohn, K., & Lee, H. (2016). Attribute2Image: Conditional image generation from visual attributes. In ECCV.
Yang, Y., Hospedales, T. M. (2015). A unified perspective on multi-domain and multi-task learning. In ICLR.
Yu, F. X., Cao, L., Feris, R. S., Smith, J. R., & Chang, S.-F. (2013). Designing category-level attributes for discriminative visual recognition. In CVPR.
Zhang, L., Xiang, T., Gong, S. (2017). Learning a deep embedding model for zero-shot learning. In CVPR.
Zhang, Z., & Saligrama, V. (2015). Zero-shot learning via semantic similarity embedding. In ICCV.
Zhang, Z., & Saligrama, V. (2016). Zero-shot learning via joint latent similarity embedding. In CVPR.
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., & Torralba, A. (2018). Places: A 10 million image database for scene recognition. TPAMI, 40, 1452–1464.
Article Google Scholar
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In NIPS.
Zhu, X., Anguelov, D., & Ramanan, D. (2014). Capturing long-tail distributions of object subcategories. In CVPR.
Zhu, Y., Elhoseiny, M., Liu, B., Peng, X., & Elgammal, A. (2018). A generative adversarial approach for zero-shot learning from noisy texts. In CVPR.

Download references

Acknowledgements

This work is partially supported by USC Graduate Fellowships, NSF IIS-1065243, 1451412, 1513966/1632803/1833137, 1208500, CCF-1139148, a Google Research Award, an Alfred P. Sloan Research Fellowship, gifts from Facebook and Netflix, and ARO# W911NF-12-1-0241 and W911NF-15-1-0484.

Author information

Soravit Changpinyo and Wei-Lun Chao have contributed equally to this work.

Authors and Affiliations

Google AI, Los Angeles, USA
Soravit Changpinyo
Department of Computer Science, Cornell University, Ithaca, USA
Wei-Lun Chao
Google, Seattle, USA
Boqing Gong
Department of Computer Science, University of Southern California, Los Angeles, USA
Fei Sha

Authors

Soravit Changpinyo
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Lun Chao
View author publications
You can also search for this author in PubMed Google Scholar
Boqing Gong
View author publications
You can also search for this author in PubMed Google Scholar
Fei Sha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Soravit Changpinyo.

Additional information

Communicated by Christoph H. Lampert.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Details on How to Obtain Word Vectors on ImageNet

We use the word2vec package.^{Footnote 10} We preprocess the input corpus with the word2phrase function so that we can directly obtain word vectors for both single-word and multiple-word terms, including those terms in the ImageNet synsets; each class of ImageNet is a synset: a set of synonymous terms, where each term is a word or a phrase. We impose no restriction on the vocabulary size. Following Frome et al. (2013), we use a window size of 20, apply the hierarchical softmax for predicting adjacent terms, and train the model for a single epoch. As one class may correspond to multiple word vectors by the nature of synsets, we simply average them to form a single word vector for each class.

Appendix B: Hyper-parameter Tuning

1.1 B.1 For Conventional Zero-Shot Learning

The standard approach for cross-validation (CV) in a classification task splits training data into several folds such that they share the same set of class labels. This strategy is less sensible in zero-shot learning as it does not imitate what actually happens at the test stage. We thus adopt the strategy in Elhoseiny et al. (2013), Akata et al. (2015), Romera-Paredes and Torr (2015), Zhang and Saligrama (2015), Romera-Paredes and Torr (2015). In this scheme, we split training data into several folds such that the class labels of these folds are disjoint. We then hold out data from one fold as pseudo-unseen classes, train our models on the remaining folds (which belong to the remaining classes), and tune hyper-parameters based on a certain performance metric on the held-out fold. For clarity, we denote the standard CV as sample-wise CV and the zero-shot CV scheme as class-wise CV. Figure 8 illustrates the two scenarios.

We use this strategy to tune hyper-parameters in both our approaches (SynC and EXEM) and the baselines. In SynC, the main hyper-parameters are the regularization parameter $\lambda $ in Eq. (6) and the scaling parameter $\sigma $ in Eq. (3). When learning semantic representations [Eq. (9)], we also tune $\eta $ and $\gamma $. To reduce the search space during CV, we first fix $\varvec{b}_r = \varvec{a}_r$ for $r = 1,\ldots ,\mathsf {R}$ and tune $\lambda , \sigma $. Then we fix $\lambda $ and $\sigma $ and tune $\eta $ and $\gamma $. The metric is the classification accuracy.

Table 10 Expanded results of Table 4. The metric is “per-sample” accuracy for F@K to aid comparison with previous published results

Full size table

In EXEM, we tune (a) projected dimensionality $\mathsf {d}$ for PCA and (b)$\lambda $, $\nu $, and the RBF-kernel bandwidth in SVR.^{Footnote 11} Since EXEM is a two-stage approach, we consider the following two performance metrics. The first one minimizes the distance between the predicted exemplars and the ground-truth (average of the hold-out data of each class after the PCA projection) in $\mathbb {R}^{\mathsf {d}}$. We use the Euclidean distance in this case. We term this measure “CV-distance.” This approach does not assume the downstream task at training and aims to measure the quality of predicted exemplars by its faithfulness. The other approach “CV-accuracy” maximizes the per-class classification accuracy on the hold-out fold. This measure can easily be obtained for EXEM (1NN) and EXEM (1NNs), which use simple decision rules that have no further hyper-parameters to tune. Empirically, we found that CV-accuracy generally leads to slightly better performance. The results reported in the main text for these two approaches are thus based on this measure. On the other hand, EXEM (ZSL method) (where ZSL method = SynC, ConSE, ESZSL) requires further hyper-parameter tuning. For computational purposes, we use CV-distance^{Footnote 12} for tuning hyper-parameters of the regressors, followed by the hyper-parameter tuning for ZSL methods using the predicted exemplars. As SynC and ConSE construct their classifiers based on the distance values between class semantic representations, we do not expect a significant performance drop in this case.

Table 11 Expanded results of the third section of Table 7

Full size table

1.2 B.2 For Generalized Zero-shot Learning

To perform class-wise CV in the generalized zero-shot learning (GZSL) setting, we further separate each fold into two splits, each with either 80% or 20% of data. We then hold out one fold, train models on the $80\%$ splits of the remaining folds, and tune hyper-parameters based on a certain performance metric on (i) the $80\%$ split of the hold-out fold and (ii) the $20\%$ splits of the training (i.e., remaining) folds. In this way we can mimic the GZSL setting in hyper-parameter tuning. Specifically, for metrics with calibration (cf. Table 6), we first compute AUSUC using (i) and (ii) to tune the hyper-parameters mentioned in Sect. B.1, and select the calibration factor $\gamma $ that maximizes the harmonic mean. For the uncalibrated harmonic mean, we follow Xian et al. (2018a) to tune hyper-parameters in the same way as in the conventional ZSL setting.

Table 12 Expanded results of Table 7 with “per-sample” accuracy used to differentiate this accuracy from the “per-class” one in Table 7

Full size table

Appendix C: Experimental Results on ImageNet with Previous Experimental Setups

The first ZSL work on ImageNet and much of its follow-up considers only 2-hop, 3-hop, and All test sets and other evaluation metrics. We include our results here in Tables 10, 11, and 12 to aid comparison with such work. As mentioned in Sect. 3.4.1, we also consider Flat hit@K (F@K) and Hierarchical precision@K (HP@K). F@K is defined as the percentage of test images for which the model returns the true label in its top K predictions. HP@K is defined as the percentage of overlapping (i.e., precision) between the model’s top K predictions and the ground-truth list. For each class, the ground-truth list of its K closest categories is generated based on the ImageNet hierarchy. Note that F@1 is the per-sample multi-way classification accuracy.

When computing Hierarchical precision@K (HP@K), we use the algorithm in the “Appendix” of Frome et al. (2013) to compute the ground-truth list, a set of at least K classes that are considered to be correct. This set is called hCorrectSet and it is computed for each K and class c. See Algorithm 1 for more details. The main idea is to expand the radius around the true class c until the set has at least K classes.

Note that validRadiusSet depends on which classes are in the label space to be predicted (i.e., depending on whether we consider 2-hop, 3-hop, or All. We obtain the label sets for 2-hop and 3-hop from the authors of Frome et al. (2013), Norouzi et al. (2014). We implement Algorithm 1 to derive hCorrectSet ourselves.

Appendix D: Analysis on SynC

In this section, we focus on SynC$^\text {o-vs-o}$ together with GoogLeNet features and the standard split (SS). We look at the effect of modifying the regularization term, learning base semantic representations, and varying the number of base classes and their correlations.

Table 13 Comparison between regularization with $\varvec{w}_c$ and $\varvec{v}_c$ on SynC$^\text {o-vs-o}$

Full size table

Table 14 Effect of learning semantic representations

Full size table

Table 15 We compute the Euclidean distance matrix between the unseen classes based on semantic representations ($\varvec{D}_{\varvec{a}_u}$), predicted exemplars ($\varvec{D}_{\varvec{\psi }(\varvec{a}_u)}$), and real exemplars ($\varvec{D}_{\varvec{v}_u}$)

Full size table

1.1 D.1 Different Forms of Regularization

In Eqs. (6) and (9), $\left\| \varvec{w}_c\right\| _2^2$ is the regularization term. Here we consider modifying that term to $\left\| \varvec{v}_r\right\| _2^2$—regularizing the bases directly. Table 13 shows that $\left\| \varvec{v}_r\right\| _2^2$ leads to better results. However, we find that learning with $\left\| \varvec{v}_r\right\| _2^2$ converges much slower than with $\left\| \varvec{w}_c\right\| _2^2$. Thus, we use $\left\| \varvec{w}_c\right\| _2^2$ in our main experiments (though it puts our methods at a disadvantage).

1.2 D.2 Learning Phantom Classes’ Semantic Representations

So far we adopt the version of SynC that sets the number of base classifiers to be the number of seen classes $\mathsf {S}$, and sets $\varvec{b}_r = \varvec{a}_c$ for $r=c$. Here we study whether we can learn optimally the semantic representations for the phantom classes that correspond to base classifiers. The results in Table 14 suggest that learning representations could have a positive effect.

1.3 D.3 How Many Base Classifiers are Necessary?

In Fig. 9, we investigate how many base classifiers are needed—so far, we have set that number to be the number of seen classes out of convenience. The plot shows that in fact, a smaller number ($\sim 60\%$) is enough for our algorithm to reach the plateau of the performance curve. Moreover, increasing the number of base classifiers does not seem to have an overwhelming effect.

Note that the semantic representations $\varvec{b}_r$ of the phantom classes are set equal to $\varvec{a}_r, \forall r\in \{1, \ldots , \mathsf {R}\}$ at 100% (i.e., $\mathsf {R}=\mathsf {S}$). For percentages smaller than 100%, we perform K-means and set $\varvec{b}_r$ to be the cluster centroids after $\ell _2$ normalization (in this case, $\mathsf {R}= K$). For percentages larger than 100%, we set the first $\mathsf {S}$$\varvec{b}_r$ to be $\varvec{a}_r$, and the remaining $\varvec{b}_r$ as the random combinations of $\varvec{a}_r$ (also with $\ell _2$ normalization on $\varvec{b}_r$).

We have shown that even by using fewer base (phantom) classifiers than the number of seen classes (e.g., around 60 %), we get comparable or even better results, especially for CUB. We surmise that this is because CUB is a fine-grained recognition benchmark and has higher correlations among classes, and provide analysis in Fig. 10 to justify this.

Table 16 Overlap of k-nearest classes (in %) on AwA, CUB, SUN. We measure the overlap between those searched by real exemplars and those searched by semantic representations (i.e., attributes) or predicted exemplars

Full size table

Table 17 Comparison between EXEM (1NN) with support vector regressors (SVR) and with 2-layer multi-layer perceptron (MLP) for predicting visual exemplars

Full size table

Table 18 Accuracy of EXEM (1NN) on AwA, CUB, and SUN when predicted exemplars are from original visual features (No PCA) and PCA-projected features (PCA with $\mathsf {d}$ = 1024, 500, 200, 100, 50, 10)

Full size table

We train one-versus-other classifiers for each value of the regularization parameter on both AwA and CUB, and then perform PCA on the resulting classifier matrices. We then plot the required number (in percentage) of PCA components to capture 95% of variance in the classifiers. Clearly, AwA requires more. This explains why we see the drop in accuracy for AwA but not CUB when using even fewer base classifiers. Particularly, the low percentage for CUB in Fig. 10 implies that fewer base classifiers are possible. Given that CUB is a fine-grained recognition benchmark, this result is not surprising in retrospection as the classes are highly correlated.

Appendix E: Analysis on EXEM

In this section, we provide more analysis on EXEM. We focus on GoogLeNet features and the standard split (SS). We provide both qualitative and quantitative measures of predicted exemplars. We also investigate neural networks for exemplar prediction functions and the effect of PCA.

1.1 E.1 Quality of Predicted Exemplars

We first show that predicted visual exemplars better reflect visual similarities between classes than semantic representations. Let $\varvec{D}_{\varvec{a}_u}$ be the pairwise Euclidean distance matrix between unseen classes computed from semantic representations (i.e., $\mathsf {U}$ by $\mathsf {U}$), $\varvec{D}_{\varvec{\psi }(\varvec{a}_u)}$ the distance matrix computed from predicted exemplars, and $\varvec{D}_{\varvec{v}_u}$ the distance matrix computed from real exemplars (which we do not have access to). Table 15 shows that the Pearson correlation coefficient^{Footnote 13} between $\varvec{D}_{\varvec{\psi }(\varvec{a}_u)}$ and $\varvec{D}_{\varvec{v}_u}$ is much higher than that between $\varvec{D}_{\varvec{a}_u}$ and $\varvec{D}_{\varvec{v}_u}$. Importantly, we improve this correlation without access to any data of the unseen classes.

Besides the correlation used in Table 15, we can also use %kNN-overlap defined in Sect. 4.5.1 as another evidence that predicted exemplars better reflect visual similarities (as defined by real exemplars) than semantic representations. Recall that %kNN-overlap(A1, A2) is the average (over all unseen classes u) of the percentages of overlap between two sets of k-nearest neighbors kNN$_{A1}(u)$ and kNN$_{A2}(u)$. In Table 16, we report %kNN-overlap (semantic representations, real exemplars) and %kNN-overlap (predicted exemplar, real exemplars). We set K to be 40% of the number of unseen classes, but we note that the trends are consistent for different values of K.

We then show some t-SNE (Van der Maaten and Hinton 2008) visualization of predicted visual exemplars of the unseen classes. Ideally, we would like them to be as close to their corresponding real images as possible. In Fig. 11, we demonstrate that this is indeed the case for many of the unseen classes. For those unseen classes (each of which denoted by a color), their real images (crosses) and our predicted visual exemplars (circles) are well-aligned.

The quality of predicted exemplars (here based on the distance to the real images) depends on two main factors: the predictive capability of semantic representations and the number of semantic representation-visual exemplar pairs available for training, which in this case is equal to the number of seen classes $\mathsf {S}$. On AwA where we have only 40 training pairs, the predicted exemplars are surprisingly accurate, mostly either placed in their corresponding clusters or at least closer to their clusters than predicted exemplars of the other unseen classes. Thus, we expect them to be useful for discriminating among the unseen classes. On ImageNet, the predicted exemplars are not as accurate as we would have hoped, but this is expected since the word vectors are purely learned from text.

We also observe relatively well-separated clusters in the semantic embedding space (in our case, also the visual feature space since we only apply PCA projections to the visual features), confirming our assumption about the existence of clustering structures. On CUB, we observe that these clusters are more mixed than on other datasets. This is not surprising given that it is a fine-grained classification dataset of bird species.

1.2 E.2 Exemplar Prediction Function

We compare two approaches for predicting visual exemplars: kernel-based support vector regressors (SVR) and 2-layer multi-layer perceptron (MLP) with ReLU nonlinearity. MLP weights are $\ell _2$ regularized, and we cross-validate the regularization constant.

Similar to Zhang et al. (2017), our multi-layer perceptron is of the form:

$$\begin{aligned} \frac{1}{\mathsf {S}} \sum _{c=1}^{\mathsf {S}} \left\| \varvec{v}_c - \varvec{W}_2 \cdot \text {ReLU}(\varvec{W}_1 \cdot \varvec{a}_c)\right\| _2^2 + \lambda \cdot R(\varvec{W}_1, \varvec{W}_2), \end{aligned}$$

(19)

where R denotes the $\ell _2$ regularization, $\mathsf {S}$ is the number of seen classes, $\varvec{v}_c$ is the visual exemplar of class c, $\varvec{a}_c$ is the semantic representation of class c, and the weights $\varvec{W}_1$ and $\varvec{W}_2$ are parameters to be optimized.

Following Zhang et al. (2017), we randomly initialize the weights $\varvec{W}_1$ and $\varvec{W}_2$, and set the number of hidden units for AwA and CUB to be 300 and 700, respectively. We use Adam optimizer with a learning rate 0.0001 and minibatch size of $\mathsf {S}$. We tune $\lambda $ on the same splits of data as in other experiments with class-wise CV (Sect. B). Our code is implemented in TensorFlow (Abadi et al. 2016).

Table 17 shows that SVR performs more robustly than MLP. One explanation is that MLP is prone to overfitting due to the small training set size (the number of seen classes) as well as the model selection challenge imposed by ZSL scenarios. SVR also comes with other benefits; it is more efficient and less susceptible to initialization.

1.3 E.3 Effect of PCA

Table 18 investigates the effect of PCA. In general, EXEM (1NN) performs comparably with and without PCA. Moreover, we see that our approach is extremely robust, working reasonably over a wide range of (large enough) $\mathsf {d}$ on all datasets. Clearly, a smaller PCA dimension leads to faster computation due to fewer regressors to be trained.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Changpinyo, S., Chao, WL., Gong, B. et al. Classifier and Exemplar Synthesis for Zero-Shot Learning. Int J Comput Vis 128, 166–201 (2020). https://doi.org/10.1007/s11263-019-01193-1

Download citation

Received: 30 December 2018
Accepted: 29 June 2019
Published: 06 September 2019
Issue Date: January 2020
DOI: https://doi.org/10.1007/s11263-019-01193-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Classifier and Exemplar Synthesis for Zero-Shot Learning

Abstract

Access this article

Similar content being viewed by others

End-to-End Object Detection with Transformers

Microsoft COCO: Common Objects in Context

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Notes

References

Acknowledgements