Method for improving zero-shot image classification

: In order to improve the robustness of the similarity metric method of image classification, and reduce the complexity of the measure function, the Pearson correlation coefficient is introduced to improve the zero-shot image classification. Firstly, the mapping matrix from visual space to semantic space is learned by the training dataset, and the visual feature is aligned by it. Then in the semantic space, the similarity between features is calculated by the metric function, predicting the label of unseen class by the nearest neighbours. Experiments show that zero-shot image classification based on Pearson correlation coefficient is better than Euclidean distance and cosine similarity.


Introduction
The similarity metric [1] is a measure of the closeness between different individuals, and it is greatly significant to algorithms such as clustering and neighbourhood search. During the zero-shot classification [2], the prediction of unseen samples is realised by the similarity metric, which will directly affect the classification accuracy of the model.
The zero-shot learning aims to recognise the class that has not been seen in the training dataset, which relieved the dependence on the labelled samples. The area has received more and more attention. Li [3] proposed an end-to-end deep embedded zero-shot learning framework using multimodal fusion technology. Due to the hub problem, the projection direction from semantic space to visual space perform better. For the domain discrepancy of crossdata sets, the literature [4] presented a zero-shot classification based on the dual visual-semantic mapping path, which optimised the migration of visual-semantic mapping using the semantic intramanifold intrinsic. The zero-shot based on semantic autoencoder (SAE) proposed by Elyor Kodirov [5] alleviated domain drift.
Similarity metric is the basis for many visual tasks, such as target tracking, classification, and registration in robotic applications [6]. Ciobanu [7] proposed a new similarity metric to evaluate the effect of image denoising. In addition, it was also used to estimate the healthy quality of skeleton [8]. Chen [9] proposed an online event detection and tracking method based on similarity learning for the limitations of semantic representation and the poor performance of event detection and tracking. Gai et al. [10] proposed deep nonlinear metric learning applied into face verification. The literature [11] introduced depth transfer metric learning solving the problem of cross-domain visual identification.
The above zero-shot image classification aims to improve the mapping from the visual space to the semantic space by adding constraint terms, thereby enhancing the recognition accuracy and alleviating the domain drift problem. However, the similarity metric of zero-shot learning has attracted less attentions. The model usually accepts Euclidean distance or cosine to calculate the similarity, which is a limitation for classification. Although the deep metric learning has generalisation ability, and need cost long time. Considering comprehensively, the Pearson correlation coefficient is introduced to realise zero-shot image classification in the paper, improving the accuracy and reducing the training complexity.

Improvement method of zero-shot image classification algorithm
The whole framework of zero-shot learning is shown in Fig. 1, which contains visual models, natural language models, and multimodal fusion models. Visual model is trained by the deep convolutional neural network, extracting the features of image; then the semantic word vector corresponding to the class labels is obtained through the natural language model; the features is projected into the semantic space using the multi-modal fusion; finally, the prediction of unseen class is realised by comparison of the similarity between the features.

Feature alignment based on SAE
Multimodal fusion aims to project feature data of different modalities into the same space. The linear mapping can easily cause cross-domain drift.
In feature learning, the autoencoder obtains a meaningful feature representation based on the reconstruction error for the input data, achieving feature alignment. The traditional autoencoder is unsupervised learning. For zero-shot image classification, add the constraints between visual and semantic features. The network structure of the SAE is shown as Fig. 2.
Suppose the input data X = (x 1 , x 2 , …, x N ) ∈ R d × N including N d-dim feature vectors. The latent feature S = (s 1 , s 2 , …, s N ) ∈ R k × N is obtained by the mapping matrix W ∈ R k × d , which is projected into the original feature space by the W* ∈ R d × k producing the features X′ = (x 1 ′, x 2 ′, …, x N ′ ) ∈ R k × N . Therefore, the objective function is defined as Due to W* = W T , the objective function is simplified as As the constraint of Wx = S is too strong, the objective function can be written as λ is the weight coefficient, which determines the importance of both terms. The above formulation is a standard quadratic homogeneous equation and convex function with a global optimal solution. Using the trace properties of the matrix, the objective function is shown in The objective function meets the definition of Sylvester's equation. Finally, the mapping matrix W is obtained by Bartels-Stewar algorithm.

Similarity metric analysis in semantic space
For the different features, the similarity metric function selected will directly have an effect on the results of classification, which only depended on the target task. Generally, suppose data point p = (p 1 , p 2 , …, p k ) and q = (q 1 , q 2 , …, q k ), and the distance function is defined as d(p, q), which needs meet some rule as the following: • Triangle rule.
The most common way to measure the similarity between numerical points is the Euclidean distance, which is defined as Euclidean distance is intuitive. In addition, the smaller the distance is, the greater the similarity is. Besides, it has nothing to do with the distribution of the data. If the amplitude of p i direction is too large, it will enlarge the effect of p i dimension, and it cannot accurately express the similarity between related data. Mahalanobis distance can eliminate the correlation of different dimensions and the nature of different scales. Suppose p and q obey the same distribution and the covariance matrix is Σ, so Mahalanobis distance is defined as in It can avoid interference of correlation variables. However, it is affected by the instability of the covariance. Consider using cosine similarity as a measure of similarity, expressed as in Cosine similarity is only related with the direction of vector. However, it is influenced by the translation of the vector. So the Pearson correlation coefficient is introduced to improve the similarity computation by centralise the original data. For all the elements of every vector, calculate an average value μ p and μ q , and it can be defined as in The Pearson correlation coefficient has translation invariance and scale invariance, which is not affected by the amplitude. Due to its superior properties, it is used to calculate the similarity of the data features in the paper. The encoder matrix W k × d is obtained through SAE according to X Y , which project the visual features of the test samples X Z = (x 1 , x 2 , …x u ) ∈ R k × u into the semantic space, receiving S Z ′ = {s 1 ′, s 2 ′, …, s u ′} ∈ R k × u . In the space, the Pearson correlation coefficient is used to calculate the similarity with S to predict the label of the unseen sample, which can be defined as in

Zero-shot image classification based on
s i ′ is the visual feature in the semantic space of ith sample, and S j denotes the semantic ground truth of jth class. The j is the label of a test sample with the max similarity.

Dataset
In this experiment, the animals with attributes 2 [12] are selected as the training dataset, which concludes 37,322 images and 50 classes, and the distribution of the number of samples for each class is shown in Fig. 3. In order to meet the requirement of zero-shot learning, it is necessary to ensure that the training set and the test set have no intersection. The training dataset consists of 40 class, and the rest ten classes are used as a test dataset. Each class has an 85dimensional attribute [13] value, which is considered as the transferred knowledge between the seen and unseen class.
Currently, Due to the generalisation ability of deep volume and neural network, the visual features in the zero-shot learning is extracted by it. This experiment uses the 1024-dimensional visual features extracted by Googlenet [5].

Zero-shot image classification analysis
In the classification task, the accuracy of image classification can be enhanced using the meaningful and representative features, The features of the visual space are projected into the semantic space by a mapping matrix of SAE learning aligning with the features of the space, which can preserve the original visual information. Visualisation of the visual space and its corresponding features distribution in the semantic space is shown in Fig. 4 using the t-SNE [14].
It can be seen from Fig. 4 that SAE can not only align features, but also improve the distribution of visual features in the semantic space, shrinking the distribution interval of features of the same class, and contributing to predicting the labels of the unseen class through similarity measures.
The similarity metric is the last link in the zero-shot image classification, and it is also an important part. The choice of measurement method will directly affect the classification result. In the semantic space, the different metrics are selected in the experiment, whose results are shown in Table 1. Table 1 compares different metric methods, including Euclidean distance, cosine similarity, and Pearson correlation coefficient. It can be concluded that the accuracy of the zero-sample classification model based on Euclidean distance is significantly  lower than other methods, but it is the shortest time-consuming method. Compared the cosine similarity with the Pearson correlation coefficient, the latter is better because the Pearson correlation coefficient has translation invariance and scale invariance, which is not affected by the amplitude. The accuracy of the classification can reflect the relationship between the correct sample and the total number of test samples. It cannot intuitively reflect the samples of error recognition. Therefore, the confusion matrix is explored to evaluate the model based on Euclidean distance and Pearson correlation coefficient, which is shown in Fig. 5.
In the confusion matrix, the class IDs are [6,13,15,18,24,25,34,39,42,48], and the correct classification of the sample is represented by the box on the main diagonal. The deeper it is, the more the number of correct classifications for the class sample is. From Fig. 4, the Pearson correlation coefficient effectively improves the classification accuracy.

Conclusion
The similarity metric method in zero-shot classification is studied in this paper. The Pearson correlation coefficient is introduced to measure the similarity of features, which has nothing to do with the amplitude of features. In addition, it can accurately express the similarity of data and is not affected by the correlation between variables. The similarity metric matrix between the data points can better reflect the distribution of data in the semantic space, and then predict the labels of unseen samples. The experimental results show that the zero-shot classification based on Pearson correlation coefficient is obviously superior to the Euclidean distance, and the accuracy is closed to cosine similarity, and it is less than its computational complexity. However, the Pearson correlation coefficient is a fixed measurement method. Next, the metric learning method of adaptive data is further studied.