Cross-Modal Search for Social Networks via Adversarial Learning

Cross-modal search has become a research hotspot in the recent years. In contrast to traditional cross-modal search, social network cross-modal information search is restricted by data quality for arbitrary text and low-resolution visual features. In addition, the semantic sparseness of cross-modal data from social networks results in the text and visual modalities misleading each other. In this paper, we propose a cross-modal search method for social network data that capitalizes on adversarial learning (cross-modal search with adversarial learning: CMSAL). We adopt self-attention-based neural networks to generate modality-oriented representations for further intermodal correlation learning. A search module is implemented based on adversarial learning, through which the discriminator is designed to measure the distribution of generated features from intramodal and intramodal perspectives. Experiments on real-word datasets from Sina Weibo and Wikipedia, which have similar properties to social networks, show that the proposed method outperforms the state-of-the-art cross-modal search methods.


Introduction
With the rapid development of mobile networks and "we media" [1], cross-modal information search [2] has become a research hotspot. Users publish multimedia information on social network platforms such as Weibo and Twitter, where public opinion is expressed through natural language and visual information. Cross-modal information search meets users' needs for data diversity, especially on social networks. Various types of topics (e.g., news, tips, and stories) occur in multimedia forms on social networks, conveying valuable information for various users, including common people, companies, and regulators. e most direct way to fulfill users' diversified information needs is to maximally mine the resemblance and correlations of the information and present the content relevant to users' queries [3,4]. However, cross-modal correlation analysis faces the basic challenge of bridging the heterogeneity gap [5,6] between different media, which is also a key issue for cross-modal search.
Bridging the heterogeneity gap in multimodal data, which feature different statistical characteristics, is the major issue in analyzing and processing multimodal datasets with intelligent technologies [7]. In general, some current research addresses the problem by constructing multiple nonlinear transformations [8] to build a common semantic subspace for multimodal data through deep learning [9]. With the subspace, the nonlinear transformations are learned to generate feature representations for correlation maximization [10]. e representative classical methods are canonical correlation analysis (CCA) [11] and variants such as deep CCA (DCCA) [12]. With the development of tabular learning and deep learning research, such strategic methods have gradually been divided into two groups: real-valued representations and binary-valued representations [13]. Other works focus on selecting relevant features that are, then, adopted to construct correlations from multimodal features to achieve cross-modal search through feature selection and matching [14,15]. e methods used according to this strategy are designed to discover dense feature clusters with high similarity learned by algorithms for crossmodal data [16].
In addition, the semantic sparseness of cross-modal data from social networks results in misleading content in both the textual and visual modalities. Cross-modal data on social networks present characteristics that reflect many aspects of real-world events in quality-restricted forms [14]. e massive quantity of cross-modal data on social networks provides an opportunity to uncover relations between events and discover additional content related to the target event in a variety of media. e forms and characteristics of social network cross-modal data require many details of features such as local correlations to be mined and learned by intelligent algorithms. To overcome the semantic sparseness of cross-modal data from social networks, we adopt self-attention [17] to discover the differential importance of local semantic features according to the target topic throughout the global representation sensors. Self-attention can be used to assign weight values for different items in feature sequences to perceive significance. Li et al. [18] proposed a positional self-attention with contention (PSAC) architecture to capture long-range dependencies and position information.
rough the application of self-attention to perceive significance, PSAC significantly outperforms its predecessor. Gao et al. [19] presented hierarchical LSTMs with an adaptive attention method to perceive the spatialtemporal attention for visual regions or frames to predict related words.
is method with adaptive attention outperforms the previous state-of-the-art methods.
In this paper, we propose a cross-modal search method for social network data that capitalizes on adversarial learning. In addition, we adopt self-attention-based neural networks to generate modality-oriented representations for further inter-modal correlation learning. A search module is implemented based on adversarial learning, through which the discriminator is designed to measure the distribution of generated features from intramodal and intramodal perspectives. e discrimination is deployed as a compound neural network whose parameters are optimized under union losses following the adversarial learning mechanism to generate the most appropriate representations of crossmodal data features. e contributions of the paper are summarized as follows.
(i) We propose a supervised cross-modal adversarial learning method integrated with self-attention. e method generates cross-modal representations following the original modality and topic label distributions from the perspective of social network data characteristics under the mechanism of selfattention. (ii) e proposed method incorporates local semantic features distributed as word groups in texts and blocks in images to maximize the cross-modal correlations based on adversarial learning. (iii) e part of the adversarial learning component in the designed adversarial learning framework is used effectively to rank the search results.
e unstandardized writing conventions of user-generated text and the frequently low quality of user-submitted images submitted on social networks result in semantic sparseness. Semantic sparseness is the main obstacle to cross-modal information search in social networks based on global semantic features. Our proposed method, cross-modal search with adversarial learning (CMSAL), integrates selfattention to explore local semantic features expressing key semantic features of the target topics. Words (in text) and pixel blocks (in images) conveying target topics are the local semantic features to be explored and mined. e generated representations integrated with the local semantic features constitute the semantic space for social network cross-modal information search. e designed maximum losses are optimized based on adversarial learning to promote the efficiency of the generated representations for cross-modal search. e learning method is trained iteratively with the representation-generating process from intramodal and intermodal perspectives. In classical generative adversarial networks (GANs) [20], the optimal discriminator is useless in most cases [21]. We reused the optimal intermodal and intramodal restriction to provide ranked search results based on distribution measures. In contrast to the existing methods, this paper takes the semantic sparseness of social network content into consideration for the specific task of cross-modal information search.

Social Network Cross-Modal Search.
With the development of information and mobile networks, social network platforms are becoming the most important source for multimedia data [22]. Cross-modal search strategies on social networks can be classified into two main groups: common semantic subspace learning and feature selection and matching. For multimodal data from social networks conveying more information [23], intelligent technologies are needed to excavate latent correlations within massive and complex cross-modal datasets from social networks. Cai et al. [24] proposed a joint topic model to track and search target social information based on cross-modal feature sequence analysis and learning. Fang et al. [25] proposed a data transformation method to handle heterogeneous data for cross-modal event analysis and searches in social networks. Qing et al. [26] proposed an event and content search method based on automatic identification and tracking from a large amount of cross-modal data from social networks. Lee et al. [27] provided a common search framework for online social network hotspot events. e method normalizes the data content of different media based on the graph-based algorithm combination sorting event list for content normalization. It unifies the stream-based media data and the registration-based cross-media data, which realizes the cross-media search for the target event. Zhang et al. [28] studied the hierarchical information quad-tree index structure based on spatiotemporal characteristics, including temporal proximity, spatial proximity, and visual relevance. e method is also used to solve cross-modal search problems in social networks. Deng et al. [29] proposed a deep hash network based on triplets for cross-modal retrieval of social networks. e method uses a triple label to describe the relative relationship between the three instances as a supervisor to capture a more general semantic correlation between cross-modal instances.
Social network cross-modal search is related to the traditional cross-modal search on multimedia representation extraction and correlation analysis. Furthermore, crossmodal contents from social networks need to pay attention to global and local semantic associations in semantic sparseness, which is determined by the characteristics of the social network data. e emergence of GANs [20] provides a series of methods for semantic extractions and representations under sparse semantic conditions that are gradually applied to the field of cross-media search.

Adversarial Learning Cross-Modal Search.
Recently, GANs [20] have been widely used because of their ability to learn and process visual and sequenced features. A series of approaches have been proposed to reduce the gap between different modalities based on adversarial learning of the statistical characteristics of the transformed features. Following this strategy, He et al. [30] introduced a cross-modal retrieval method based on unsupervised adversarial learning. e method constructed an adversarial learning feature transformation for the statistical properties on cross-modal search. Peng et al. [5] proposed a method for common crossmodal representation based on GAN. rough well-learned cross-modality representations, many applications such as cross-modal similarity matching can be conducted. Gu et al. [4] provided a GAN-based method incorporating corporate generative models into cross-modality embedding for crossmodal search. e method encouraged the textual features as the basis to generate an image similar to the ground truth, and vice versa for images to texts. Shang et al. [31] proposed a dictionary learning-based cross-modal search method. e method used a dictionary learned as a feature a reconstructor, cooperating with adversarial learning to mine cross-modality statistical characteristics. Wen et al. [32] proposed a cross-modal search method based on similarity transferring. e method uses adversarial learning to build a semantic structure in the common representation subspace for preserving the semantic structure between unpaired items across different modalities. Wang et al. [33] proposed an adversarial learning retrieval method that imposed triplet constraints for feature generation to minimize the heterogeneous gap of cross-modal data with the same semantic labels. e greatest advantage of adversarial learning is cross-modal synthesis. Gao et al. [34] presented a method named the perceptual pyramid adversarial network (PPAN) to synthesize photorealistic images and texts based on adversarial learning. e method is composed of a generator optimized with perceptual loss to obtain diverse images and a discriminator for multiple purposes, such as semantic consistency, image fidelity, and class invariance.
For other strategies, deep quantization and deep hashing based on adversarial learning are also used for cross-modal search. Yang et al. [35] proposed a method known as shared predictive deep quantization (SPDQ). In this method, a shared semantic subspace is defined for cross-modal features. e method builds a joint deep network architecture to exploit compact cross-modal representations. e method preserves intramodal and intermodal similarities in an efficient way. Deep hashing also follows the strategy to learn compact binary code for cross-modal similarity computation efficiency. Li et al. [36] presented a self-supervised adversarial hashing (SSAH) method. e method learns the high-dimensional features and hash codes for cross-modality information through two adversarial networks. e search similarity is maximized according to the semantic relevance in a highly computationally efficient manner.
In contrast to traditional methods of latent semantic subspace learning [37], cross-modal search based on GAN or adversarial learning takes advantage of the capacity for feature distribution construction and discrimination learning [33].
ere are also many methods that adopt adversarial learning for hashing to realize cross-modal search [38,39]. ese methods convert the matching problem in cross-modal search to the Hamming distance calculation based on the multimedia effective binary representation. Such a calculation strategy improves the matching efficiency of cross-modal search. However, in the construction of binary representations, some semantic features of the original multimedia are lost. e proposed method in this paper focuses on local semantic feature extraction based on self-attention [17] and adversarial learning [20] to solve the problem of minimizing the heterogeneity gap for cross-modal data with the same semantic labels.

Problem Definition.
In general, we define cross-modal data as P � C 1 , C 2 , . . . , C d , 1 ≤ d ≤ D, meaning that there are D topics in the data domain on the amount. For each topic, related contents are expressed in the form of text and images as In each topic, there are M text instances and N image instances conveying the related semantic information to C d labeled by l d . ere are some special cases for (M ≥ 1, N � 0) and (M � 0, N ≥ 1), in which the problem degenerates into the unimodal case. Another case is (M � 1, N � 1). In this case, the situation agrees with most definitions in current works.
Raw text and images are preprocessed into representation features by word embedding [40] and VGGNet [41], according to the modality. e presentation features for texts and images are interfaces for further complex computing in the learning procedure. For further correlation maximization learning, the presentation features are explored to extract local features that are sensitive to modality characteristics. e features convey the same semantics in word groups and image blocks represented as b d,k for image features with the parameters of θ v . S d t and S d v are the generation processes interacting with the discriminator to optimize parameters jointly by adversarial learning. A restriction is designed to measure the distribution of S d t and S d v from intramodal and intermodal aspects to guide the generation. S d t and S d v output more appropriate representation features by episodes. e general framework of the proposed method is illustrated in Figure 1.

Constructions of Cross-Modal Representation Feature
Generation. Cross-modal representation feature generation is conducted to explore the local semantic relationships between features from different modalities and reconstruct the representations to reflect the relationships in computational matrixes. e procedure is designed under a supervised representation learning mechanism in which self-attention is adopted. Taking text modality as an example, f t , g t , and h t are the functions to transform the original features (word features for text in fixed-size blocks) into a subspace as follows: where b d,k t means the k-th text block word embedding feature of a text document on topic d. w v . e original features of the two modalities are cut into fixed-size blocks. In general, we cut the original feature into K blocks. e blocks of original text features are composed of word vectors, while the blocks of original image features cover the CNN features of pixels. For example, the attention between the i-th and the j-th blocks is calculated as follows: where β indicates the model attention parameter related to the j-th feature block when generating the representation features of the i-th block in the specific word embedding feature of the corresponding text on topic d. Similar to image modality, β d,i,j v is used for images in CNN feature blocks. For the i-th block of a specific text piece of content, the representation features can be presented as follows: e representation features of a whole text about the topic d can be presented as S d determined by experiences and data contexts. In the experiment, we set the value of K according to the corresponding original cross-modality features. Otherwise, the value of K also determines the sizes of w as parameters. However, it will have little impact on the actual representations through cross-modal presentative feature generation.

Learning Metric for the Proposed Method.
In this section, we propose the generation and discrimination losses to train the proposed CMSAL. e generation loss guides the representation features generation and consists of a label loss and a similarity loss. e label loss aims to minimize the distribution difference between the representation features and corresponding topic semantic labels. e similarity loss is used to minimize the distance among the intermodal samples about the same topic.
ese two loss terms are defined as the generation loss for guiding the representation features generating procedure. e discrimination loss is defined to distinguish modalities. e multiple losses are collaborated into a minimax loss to optimize the generation of representation features for appropriate cross-modal search features.

e Generation Loss.
e generation loss is decomposed into two loss terms: the label loss and the similarity loss. e label loss ensures that the distributions generated representation features following those of semantic topics. e loss is presented as where y i t and y j v are the topic labels for corresponding features in the form of a one-hot vector. e symbol t is the function to predict topic probability distribution for each text or image term of the representation features. M and N are the amounts of the original features for text and images, respectively. As described in Section 3.1, we conduct the collection based on M � N for a clear expression and thinking. erefore, equation (4) can be further expressed as follows: e label loss guides the training of the parameters of θ t and θ v to generate representation features following the topic distribution of corresponding samples. e label loss is the intramodal loss used to maintain the intramodal data correlations. Based on the premise of M � N, the similarity loss is defined as follows: 4 Computational Intelligence and Neuroscience e similarity loss acts as the intermodal loss to maximize correlations between cross-modal samples with the same topic distribution by closing the distance difference of representation features and topic labels. e losses presented in equations (5) and (6) are the basics to guide representation feature generation by supervised learning for adjusting the parameters of the networks. As parts of the generation loss, the label loss and the similarity loss are integrated by weighted summation presented as equation (9).
where α and β represent the contribution weights of the corresponding deviation values to the loss function, through which the optimization of generation loss is directly affected by the two empirical values.

e Discrimination Loss.
e discrimination of the method is the key component to realize cross-modal adversarial learning. It aims to discriminate the modalities for the constructions of representation features about the same topic. We define the discrimination loss as follows: where m i is the modality label as a one-hot vector and p aims to map the generated representation features into the modality discrimination space under the parameter θ p . Different from the generation loss, the discrimination loss promotes representation feature generation indirectly. e generator will output more appropriate representation features by parameter optimization and adversarial learning with a discriminator.

e Adversarial Training Procedure.
To ensure the correlation maximum of cross-modal representation features for the same topic distribution, cross-modal representation feature generation and intermodal discrimination interact with adversarial learning. We construct the minimax game [20] as follows: where θ t , θ v , and θ p are optimized values for the joint losses.
e minimax game will minimize generation loss and maximize the discrimination loss. e generation loss is going to construct cross-modal representation features to maximize relationships for the same semantic topic distribution.
e discrimination loss will distinguish modality discrepancies. e parameters θ p are fixed for optimizing θ t and θ v during the minimization procedure, while θ t and θ v are fixed for optimizing θ p during the maximization procedure.
As presented in equation (6), the similarity calculation is included in the similarity loss. e similarity calculation is based on the optimized parameters θ t and θ v for appropriate results.
e matching algorithm is shown in Algorithm 1.
Sorting and picking up the top K similarities are executed as the evaluation scope with the corresponding representation features. e corresponding content of the representation features in a list is returned according to the sorted top K similarities as the evaluation scope. e algorithm outputs cross-modal search results according to the query. e matching similarities are calculated based on the trained proposed method to obtain the most appropriate results.

Experiments and Analyses
Experiments on real-world datasets are conducted to verify the effectiveness of the proposed method on cross-modal search from social networks. e real-world datasets consisted of text-image pairs collected from Sina Weibo. Without loss of generality, the widely used Wikipedia [42] and NUS-WIDE [43] cross-modal datasets are also used to verify the effectiveness of the proposed method. In this section, the effects of changing empirical hyperparameter values and cross-modal search efficiency are shown and analyzed.

Evaluation Metrics.
e mean average precision (MAP) for the top K and precision-scope curve are adopted as evaluation metrics to measure the performance of the proposed method. Following [33], MAP can be calculated as follows: MAP@K � Q q�1 AP(Q)@K Q . (12) In equation (11), Q is the number of queries. K is the amount of the contents to be searched for results. e top k search precision is denoted as P(k), which is also adopted as a measure for the search results for the scope K presented as a precision-scope curve. e average precision is computed in equation (11) as a component of equation (12).

Parameter Learning Results and Analyses.
We conduct an experiment to show the impact of the empirical values α and β in equation (6) for the searching performance, of which the results will provide a basis for setting the empirical values in return. MAP is used to evaluate the performance while the empirical values vary. e evaluations of the two datasets are presented in Figures 2-4. e empirical values of alpha and beta are the corresponding weight parameters for the label loss and the similarity loss.
As shown in Figure 2, we evaluate the top 50 search results based on computing MAP@50 for varying alpha and beta on the Sina Weibo dataset. e MAP@50 value shows different distributions with the common point that MAP@ 50 obtains a better situation when beta � 0.1. is means that the similarity loss requires a smaller weight value than alpha for a high MAP@50 evaluation. As shown in Figure 3, the effects of empirical values for searching performance on the Wikipedia dataset are smaller than those of the Sina Weibo dataset. Different from Figure 2, there is less fluctuation of MAP@50 varying the values of alpha and beta. e results presented in Figure 3 also provide a reference for the alpha and beta. Considering the situations of Figures 2 and 3, empirical values can be set with a group of suitable values for appropriate search results. Figure 4 presents the empirical values impacting the cross-modal search based on the NUS-WIDE dataset. e numerical distribution is relatively flat, as in Figure 3 for the NUS-WIDE dataset.
e results show that the dataset property has a direct impact on the empirical value assignment. Similar to the Wikipedia dataset, the semantics of the cross-modal information NUS-WIDE dataset are more obvious with less sparsity. Furthermore, the correspondence of cross-modal data in NUS-WIDE is clearer by using simple text content as a semantic label. erefore, the empirical values impacting the image-to-text search performance of MAP@50 in the NUS-WIDE dataset are greater than those in the Wikipedia dataset. 6 Computational Intelligence and Neuroscience e proposed method sets the empirical values of alpha and beta according to the dynamic evaluations as described.
e learning process is inseparable from appropriate empirical values. We incorporated appropriate values of alpha � 1 and beta � 0.1 for image searches with text input and alpha � 0.1 and beta � 0.1 text searches with image input in both the Sina Weibo dataset and the Wikipedia dataset. According to Figure 4, alpha � 0.1 and beta � 100 for image searches with text input and alpha � 100 and beta � 10 for text searches with image input will be appropriate for the NUS-WIDE dataset.  Table 1, while those on the Wikipedia dataset are presented in Table 2. e evaluations on the NUS-WIDE dataset are presented in Table 3.

Search Result Evaluations and
In Table 1, txt2img means entering a text query with the target topic to search from images with the same topics (img2txt means the reverse). As shown, the proposed CMSAL method outperforms the selected baseline methods. For CMSAL itself, the task of img2txt obtains better evaluations on MAP for the top 5 than those of the txt2img task. e reason for this situation is that original images contain abundant semantic information that will be extracted and represented appropriately. e extracted CNN features can preserve and present the valuable local semantics in detail Input: Query set Q � {q 1 , q 2 , . . ., q T } about the target topic d; cross-modal presentative data features from social networks generated by the networks with optimized parameters θ t and θ v : (1) For q in the query set Q: (2) Distinguish modality type of q (3) Preprocessing q into corresponding features blocks as for q in images. (4) Extracting representation features: For s in the cross-modal representation features set S doc : (6) Computing the similarity according to the query similarity � sim (S q , s) (7) End For     Table 1). e Sina Weibo dataset contains   Computational Intelligence and Neuroscience 9 typical raw real-world data from various users, including casual written text and low-resolution images, which provide sparse cross-modal semantics. As expected, the results on the Sina Weibo dataset achieved higher evaluation values than the results on the Wikipedia dataset. e reason for this situation is that semantic features in the Sina Weibo dataset are relatively concentrated and prominent. As shown in Table 3, the proposed CMSAL method outperforms the selected standard methods. MAP evaluations on the NUS-WIDE dataset are smaller than those on the Sina Weibo dataset. e main reason is that the characteristics of the NUS-WIDE dataset are different from those of the Sina Weibo and Wikipedia datasets. On the NUS-WIDE dataset, images are labeled with relatively simple text content, which clarifies the correspondence between text and images. In addition, in terms of image data quality, the NUS-WIDE dataset has simplified semantic information as public datasets. erefore, the MAP  evaluations of search results on the NUS-WIDE dataset are closer to those on Wikipedia datasets.

Precision-Scope Evaluations and Analyses.
As precision-scope curves are an indispensable form of evaluation for information search experiments, precision-scope curves of the proposed method CMSAL, and all the selected baseline methods. e experimental results on the Sina Weibo, Wikipedia, and NUS-WIDE datasets are shown in Figures 5-7.
As shown in Figures 5 and 6, the proposed CMSAL method shows a better performance than any of the other methods. In general, the measures of all the methods show similar trends with a small numerical gap. Similar to MAP evaluations, GAN-based methods achieve better performances than deep neural network-(DNN-) based methods which rely on targeted adversarial learning integrated with the advantages of DNNs. e classical DCCA method shows the worst values of the evaluation working in concert with MAP evaluations. e processing of nonlinear mapping and canonical correlation analysis learning is relatively independent for DCCA. However, the GAN-based method overcomes the disadvantages of traditional and DNN-based methods. e proposed method conducts appropriate representation feature generation to maximize correlations in adversarial learning. e results of the precision-scope curves demonstrate the effectiveness of the proposed method. Figure 7 presents evaluations of precision-scope curves on the NUS-WIDE dataset for the tasks of searching for images from text input and searching for text from image input. As presented in Figure 7, the proposed CMSAL method outperforms other selected baseline methods. In addition, the precision-scope curves of CMSAL on the NUS-WIDE dataset outperform those on the Sina Weibo and Wikipedia datasets. e reason is that the cross-modal content in the NUS-WIDE dataset is simple and clear. As the semantic labels of images, the text has clear semantic features; thus, the tasks of text-to-image and image-to-text search show good computing properties in local semantic mining and matching for CMSAL.

Conclusions
In this paper, we propose a cross-modal search method for social network cross-modal data based on adversarial learning (CMSAL). e proposed method integrates selfattention based on adversarial learning to realize the crossmodal search for the social network. e method explores cross-modal semantic features from the perspective of global representations of images and texts for a specific topic. rough adversarial learning, the method reconstructs representations for cross-modal matching. e designed adversarial learning framework is effectively used to rank the search results. Experimental results validate the effectiveness of the proposed method.

Conflicts of Interest
e authors declare that they have no conflicts of interest.