Zero-Shot Learning (ZSL) is a technique that transfers knowledge from seen classes to unseen classes by establishing cross-modal mapping relationships. However, traditional ZSL methods heavily rely on a large number of expensive labeled data, which may not be readily available in practical applications. In practical applications there is often a lack of labels, and the approach implies that the lack of effective supervised information in the transfer process of seen classes can lead to ’negative causality’ problem between different modalities. Therefore, we propose an unsupervised counterfactual approach to solve the above problem. Therefore, we propose an unsupervised counterfactual approach to solve the above problem.
In this paper, we propose an unsupervised learning model and use a Counterfactual Causal Inference framework to cross-modal mapping relationship adjustment (CMRA). Specifically, we aim to regard images as cause and Wikipedia text as effect form a causal relationship diagram. First, it uses multiple attributes attention to learn the visual semantic attributes of images and the corresponding Wikipedia text description words to form cross-modal alignment. Then, we combine contrastive learning and stop-gradient techniques to create a novel cross-modal mapping relationship. Finally, we conducted an investigation to assess the consistency of the multiple attribute attention in the distribution of visual semantic attributes before and after image transformation. To tackle this issue, we implemented a deactivation strategy specifically designed to eliminate the multiple attribute attention of visual semantic attributes that displayed noticeable distribution gaps at different stages. This approach also involves eliminating the mapping relationships between their corresponding Wikipedia text description words. This model evaluates the classification accuracy in AWA, CUB, APY, SUN. The experimental results show that the algorithm outperforms the state-of-the-art algorithms technology approaches.