A Cross-Modal Alignment for Zero-Shot Image Classification

Different from major classification methods based on large amounts of annotation data, we introduce a cross-modal alignment for zero-shot image classification.The key is utilizing the query of text attribute learned from the seen classes to guide local feature responses in unseen classes. First, an encoder is used to align semantic matching between visual features and their corresponding text attribute. Second, an attention module is used to get response maps through feature maps activated by the query of text attribute. Finally, the cosine distance metric is used to measure the matching degree of the text attribute and its corresponding feature response. The experiment results show that the method get better performance than existing Zero-shot Learning in embedding-based methods as well as other generative methods in CUB-200-2011 dataset.


I. INTRODUCTION
To get interesting objects or parts from images, most of the supervised object detection methods achieved good results. When you ask a robot ''what is this ?'', the embedded automated Q&A system could answer the question correctly about 95%. This great success benefits from large amounts of data training. However, it is still hard to cope with some fine-grained recognition problem as annotation data is limited. Some special data, like medical images, is really hard to get. How to solve data deficiency problem? Fewshot learning (FSL) is one of the methods to deal with data limited problem. The essence of FSL is learning to learn, which face the main problem is data imbalanced during transfer. Different from the FSL, Zero-shot learning (ZSL) pays more attention to the alignment of semantic matching between visual feature and its text attribute. Like classifying unseen image from test data, ZSL is one of the methods proposed to stitch up this gap. On the one hand, it utilizes The associate editor coordinating the review of this manuscript and approving it for publication was Hongwei Du. the training data to get learning parameters, where there is no overlaps between training data and testing data. On the other hand, it is reasoning the attribute descriptors of different categories to make classification work. Currently,a great number of researches focus on the first case to form a classify model, while the latter one refers more to the sophisticated Nature Language Processing (NLP). In this paper, we focus on image classification using ZSL method, as shown in Figure 1. A category data of text attribute is adopted to assist classifying. We haven't paid special attention to NLP. Our contributions are as followed: 1) ZSL is always combining global appearance and its related text attribute to locate the salient object. This method ignores the unique importance of fine-grained features during attention transferring. In this paper, we propose to map both global and local feature to their corresponding class attribute vectors.
2) ZSL used to regularize semantic responding area relying on global features,which lead to multi-regional dispersion. We integrate visual appearance and text attribute vector to overcome the shortcomings. The query of text attribute takes advantage of interactive methods to guide feature map responding on the right channel.
3) The third one is the attribute which is difference between intra-and inter-class. We pay attention to attribute distribute in the class as well as the object parts distribute among classes through cosine distance metric in the paper.

II. RELATED WORK
ZSL [1] aims to recognize the object from seen to unseen classes via sharing the semantic space, in which both seen and unseen classes have their own semantic descriptors. In other words, the descriptor of class attributes bridge the training and testing images. Early ZSL [2], [3] learns a mapping between the visual appearance and word attribute/semantic domains. In paper [4], it considers the unseen classes as a cluster problem according to their attribute. However, these methods usually achieve relatively unsatisfied results, since they only utilize global features or shallow models. It is also hard to distinguish the feature with same text attribute but different classes [5]. This domain shift problems need to put extra constraints on the attribute allocation. Paper [6] introduces dual visual semantic mapping paths to recognize zero-shot objects. Recent advanced papers [7], [8], [9], [10], [11] in ZSL mainly focus on learning better visual-semantic embedding to transfer semantic knowledge from seen classes to unseen classes. Paper [12] proposes to span the structure space using a set of relations among the attributes. The objective functions preserve these relations in the embedding space, then inducing semanticity to the embedding space.But embedding-based methods suffer from data over-fitting problem in seen classes due to the dataimbalance [13]. To mitigate this challenge, the generative ZSL models have been introduced to learn a semantic/visual mapping to generate visual features of unseen classes for data augmentation [14]. Currently, the generalized zero-shot learning (GZSL) usually based on prototype learning [15], [16], [17], [18], [19], variational autoencoders (VAEs) [20], generative adversarial nets (GANs) [2], [21] to predict the class label of a target image belongs to the seen or unseen category. However, these methods still yield relatively undesirable results, since they cannot capture the subtle differences between seen and unseen classes. As such, attention-based ZSL methods utilize attribute descriptions as guidance to unite goal-oriented gaze estimation to locate fine-grained visual part [22]. Though existing methods have achieved great success on GZSL, as discussed before, the original visual feature space lacks the discriminating ability and is sub-optimal for GZSL classification. Thus, our work pay more attention to the semantic alignment of local features related to its proper text attribute through an encoder, which is an new vision in ZSL.

A. PROBLEM DEFINITION
In ZSL, we have two disjoint sets of classes: S is seen classes in C S and U unseen classes in U S . The image space of seen and unseen classes can be defined as X = X S X U . The priori attribute vector (Ground Truth) can be divided into G = G S G U . The training set consists of labeled images and attributes from seen classes, i.e.
Here, x t is an image in X S , c t denotes its class label which is available during training, and g c t ∈ R K is the class semantic embedding,i.e., a class-level attribute vector annotated with K different visual attributes. The training set consists of S u labeled images and attributes from seen classes or S a from both seen and unseen classes, where The goal for ZSL is to predict the label of images from unseen classes, i.e., X U → C U , while for more realistic and challenging situation, the aim is to predict images from both seen and unseen classes, i.e., X U → C S C U . In addition, the images from different data set are disjoint, i.e., x t x u x a = ∅ and x t x u x a = X . The lables from S t and S u are also disjoint, i.e., c t c u = ∅ and c t + c u + c s = C. Our end-to-end trained model consists of three parts, Encoder Module(EM), Attention Module(AM) and Classification Module(CM), as shown in Figure 2. EM is used to encode the global feature of images and attribute words to align them in the same dimension. It's convenient for subsequent input. AM acquires visual semantic knowledge of local attributes by the reply of different attribute words in the global scope,and then embed words into images.CM is a classifier with a Cosine Metric Space for the nearest neighbor search.

B. ENCODER MODULE
To make a connection between visual features and semantic information, we design an encoder to align them.

1) WORD ENCODER
To learn local features, GloVe is used to learn the attribute semantic vectors W = [w 1 , w 2 , · · · , w n ],which are utilized to guide the learning of the localized discriminative attribute. Then, a full connection layer is utilized to transform the attribute word vector W into visual, i.e., R K * n → R K * C , where K is the number of attribute, C indicates the channel of features. It can be represented as where V represents a learnable transition matrix.

2) VISUAL ENCODER
To extract features, we use ResNet101 encoding the images, i.e., F (x) ∈ R H * W * C . H * W denotes length and width of the output feature map, and C is the number of channel.

C. ATTENTION MODULE
To learn semantic features, an AM is proposed to parsing the object features.As shown in Figure 3, We embed the word vector of attributes Q (w) into visual features F (x) as a guide for semantic information. After transfer the attribute of features among images, the informative parameters is acquired. It's an important process for the subsequent classification. Firstly, the unfolded visual features F 1 (x) can be obtained by transposing the dimension of image features Finto C×H × W and reshaping it into C × H × W . Then, we can calculate the response range and size of the text attribute vectors in the image by the matrix multiplication of vectors to analyze how important each attribute is. After transposing and reshaping the feature, the response map Y (x) ∈ R K ×H ×W is acquired, such that The response of attribute semantic vectors is expected to gather to a salient region rather than scatter everywhere. To narrow the response region, we propose the loss function of attribute localization as where R i,j denotes the distance of coordinate from (i, j) to the center of the response point. VOLUME 11, 2023 D. CLASSIFICATION MODULE

1) ATTRIBUTE VECTORS EXTRACTION
In order to filter the global noise effectively, the max pooling is used to sample the response map rather than average pooling. It can be expressed as where indicates the i − th attribute vector. The k − th dimension of attribute vector represents the probability that the k − th attribute is in the image.

2) COSINE SIMILARITY SCORE
Different from previous methods that rely on Mean Squared Equidistant(MSE),etc. distance formula for logistic regression of attribute vector, we consider cosine similarity which can avoid learning from images lacking counter examples. Thus, it can cut down variance of output and produce a model with better generalization ability. We define similarity score as where g(c) measures c − th Ground Truth to predict cosine similarity score score c ∈ R B of class c.

3) CLASSIFICATION LOSS
The softmax function is used to classify the unseen images. Then the loss function of classification is defined as where σ is the scaling factor. On the premise of reducing within-class variance, the results shows that the accuracy of unseen classes is improved by the cosine similarity.

E. ZERO SHOT LEARNING TEST
After training the model, the learned cosine metric space is used for zero-shot recognition. The test image x is embedded to the attribute space by the visual semantic embedding, and then the classifier based on cosine similarity searches for the highest score via The test images may belong to both seen and unseen classes in GZSL. The results of GZSL will have a large bias towards seen classes since we only use seen classes during training part. In order to fix this problem, we employ a calibration factor γ to constrain the seen class scores. Specifically, the revised classifier is formulated as where γ denotes the bias of scores of seen class. In addition, = 1 if c ∈ C S and 0 otherwise.

IV. EXPERIMENTS A. DATASET
We conduct our experiments on a public ZSL benchmark dataset: CUB-200-2011.The CUB includes 11,788 images of 200 bird classes (seen/unseen classes = 180/20) with 312 attributes. It randomly select half of the images number in each seen class as the training set, The rest are used as the test set to complete the classification test task. Moreover, we adopt the Proposed Split (PS) to divide all classes on each dataset into seen and unseen classes.

B. EVALUATION PROTOCOLS
Under the conventional ZSL scenario, we only evaluate the per-class Top-1 accuracy on unseen classes. In GZSL, since the test set is composed of seen and unseen images, we evaluate the Top-1 accuracy on seen classes and unseen classes, respectively, expressed as R S and R U . The performance of GZSL is obtained by their harmonic mean:

C. IMPLEMENTATION DETAILS
We implement our method with PyTorch. It take ResNet101 pretrained on ImageNet as the backbone to extract features for each image without fine-tuning. SGD optimizer is used with hyperparameters (momentum = 0.9, weight decay = 10 −5 ). The learning rate and batch size are set as 0.001 and 300 respectively. The factor γ is set to 0.7.

D. COMPARISION WITH STATE-OF-THE-ARTS 1) CONVENTIONAL ZERO-SHOT LEARNING
We first compare our cross-modal alignment of ZSL with the state-of-the-art methods in the GZSL setting i.e. generative methods, common space learning and embedding-based methods. Table 1 presents the results of ZSL on the CUB dataset. Our cross-modal alignment of zero-shot image classification achieves the best accuracies of CUB under the conventional ZSL scenario. It shows that the cross-modal alignment extracts local fine-grained features effectively. Table 1 also compares the results of different methods. It can find that the most state-of-the-art methods achieve good results on seen classes but fail on unseen classes. Our cross-modal method achieves the sub-optimal R U , a little lower than APN, but generates the best result of Harmonic mean R H . These results come from the high granularity of semantic refinement of cross-modal alignment, enabling knowledge transfer from seen to unseen classes effectively.

E. DISTRIBUTION OF ATTRIBUTES
To intuitively show the effectiveness of our AM, we compared the GEM-ZSL and cross-modal learning under the same experiment conditions.As shown in Table 2, our model fit the target class through learning the distribution of various attributes among the classes.  The distance loss function is added on the basis of cross-modal alignment method. It can be seen that the prediction of invisible classes is worse due to data overfitting.

F. ABLATION STUDY 1) CLASS PEAK RESPONSE ANALYSIS
To verify the influence of distance, we added the loss function of distance on cross-modal alignment method. As shown in Table 3, the recognition accuracy of the seen class is similar, but the results of the unseen class are quite different. Since  the distance loss function limit the range of the response, the learning of local attributes is narrowed. Then the range of the peak response is reduced, which reduces the scalability of the model.

2) EFFECT OF THE TRAINING METHOD
An episode-based training method is employed to make model converges faster. We sample M categories and N images for each category in each mini-batch. Than changing the value of M and N to observe how the results behave at different values. To further analyze the performance of the event-based training method, it compare the performance with a random sampling training method with mini-batch size of 64. Table 4 shows that when M = 4 and N = 2, the model get the best accuracy. In fact, the accuracy is related to the number of categories and images positively.

G. VISUALIZATION OF LOCAL RESPONSE MAP 1) LOCALIZATION OF BIRDS UNDER SELF-SUPERVISION
As shown in Figure 4, we randomly sample some bird images in the CUB dataset. The response area may not be continuous, since we use the attribute vector as an auxiliary guide to detect the region of the object in a self-supervised manner. It's the superposition of various attributes embedded in the highly recognizable area. Thus, the stronger the response in the heat map, the part will be more distinguishable, such as head, wings, etc.

2) FINE-GRAINED PERFORMANCE
It selects a set of response heatmaps for class American Three Toed Woodpecker to analyse the fine-grained performance under our AM. As for the text attributes that are actually present in the image, the response area is expected to focus on the visual semantic area,and vice versa. As shown in Figure 5, the attribute word used as a guidance to match the semantic represented by the peak region of the heatmap, where the concentration performance is significant. Thus,the crossmodal alignment method is relative successful in extracting fine-grained semantic knowledge from the image.

V. CONCLUSION
In this paper, we claim that semantic matching is of great importance in zero-shot image classification. We proposed a new framework of cross-modal learning to guide local feature response according to the query of text attributes in the encoder module.At the same time, the cosine metric score is verified efficiency to match our encoder model in semantic alignment. Compared to existing ZSL method, the experiment results show that our cross-modal alignment of zero-shot image classification can perform comparably even better than the state-of-art methods in small object classification. In the future,we will extend our cross-modal alignment method to handle more datasets with sparse features,as well as small datasets. Furthermore, we will continue to explore the constraints of semantic alignment learning on sharing space between visual appearance and text attributes. ZHIHAO ZHAO received the bachelor's degree in electronic information science and technology from the Wuhan University of Technology, in 2017, where he is currently pursuing the master's degree in information and communication engineering with the School of Information. His research interests include few-shot learning and image processing. VOLUME 11, 2023