DM$^2$S$^2$: Deep Multi-Modal Sequence Sets with Hierarchical Modality Attention

There is increasing interest in the use of multimodal data in various web applications, such as digital advertising and e-commerce. Typical methods for extracting important information from multimodal data rely on a mid-fusion architecture that combines the feature representations from multiple encoders. However, as the number of modalities increases, several potential problems with the mid-fusion model structure arise, such as an increase in the dimensionality of the concatenated multimodal features and missing modalities. To address these problems, we propose a new concept that considers multimodal inputs as a set of sequences, namely, deep multimodal sequence sets (DM$^2$S$^2$). Our set-aware concept consists of three components that capture the relationships among multiple modalities: (a) a BERT-based encoder to handle the inter- and intra-order of elements in the sequences, (b) intra-modality residual attention (IntraMRA) to capture the importance of the elements in a modality, and (c) inter-modality residual attention (InterMRA) to enhance the importance of elements with modality-level granularity further. Our concept exhibits performance that is comparable to or better than the previous set-aware models. Furthermore, we demonstrate that the visualization of the learned InterMRA and IntraMRA weights can provide an interpretation of the prediction results.


I. INTRODUCTION
In industry (e.g., digital advertising [1], [2] and ecommerce [3], [4]), huge amounts of multimodal data are often obtained, in which each instance is described as a set of information from multiple modalities. Thus, handling multimodal information is a fundamental problem in most web applications. Deep neural networks (DNNs) also experience technical challenges in handling multimodal data [5].
The mainstream approaches of multimodal models are based on a mid-fusion architecture, which encodes the data from each modality and then concatenates multiple features [6], [7]. Previous studies have explored various midfusion models, such as the gated multimodal units (GMU) [6] and neural architecture search (NAS)-based models [7]. However, mid-fusion models often face several challenges. First, as the number of modalities increases, the dimensionality of the concatenated features generally increases as well. Moreover, mid-fusion models require the impractical assumption that each instance in a dataset must have the same number of modalities without missing information.
To address the above multimodal problems, Reiter et al. [8] proposed deep multimodal sets (DMMS) based on Deep Sets [9], which is a class of models that is defined according to sets. The underlying concept of DMMS is that a function defined on sets should satisfy the invariance to the number and order of elements in a set (i.e., modalities). However, as DMMS encodes/compresses each modality and then further compresses the features using a pooling operation, the architecture may omit the importance of signals in a single modality.
Although set-aware architectures such as DMMS [8] can deal with "unordered" sets of modalities, the data in multimodal information takes various forms; certain types of data are inherently sequential (e.g., text tokens), whereas others are not (e.g., categorical data). Recent BERT [10]-based multimodal methods often convert an image into textual tokens using optical character recognition (OCR), and thus, obtain a textual sequence [11]- [13]. In particular, such a preprocessing technique is effective in extracting rich semantic information that is difficult to capture by encoding raw modality data. Therefore, these situations motivate us to handle sets of sequences.
As the interpretability and explainability of the model predictions are indispensable factors in industry, it is necessary to provide an interpretation of the importance of each modality. The DMMS [8] enables an explainable prediction in terms of the importance of the modalities. However, conventional mid-fusion models may ignore the higher-order interactions between elements across sequences, as they only focus on the fusion of encoded modality representations. Furthermore, owing to the black-box nature of DNN-based models, such "encode-and-concatenate" models do not provide a breakdown of the importance of factors in a modality. This is an insufficient diagnosis in a practical setting; for example, it is often desirable to know the element-level importance to investigate the problem in text tokenization.
In this study, we propose a new concept for handling multimodal data known as deep multimodal sequence sets (DM 2 S 2 ). Our concept consists of three components that capture the relationship between multiple modalities: (1) a BERT-based encoder, (2) intra-modality residual attention (IntraMRA), and (3) inter-modality residual attention (In-terMRA). For (1), the BERT-based encoder outputs a rich representation that handles the inter-and intra-orders of the elements in a modality. In (2), IntraMRA captures the importance of the elements in a modality. In (3), InterMRA further enhances the importance of the elements with modality-level granularity. We employ OCR methods to extract semantic features from multimodal data (especially for the visual modality) that may not be captured by image recognition. The model based on our concept empirically demonstrates that it is preferable to use semantic features that are derived from OCR tokens over features that are encoded directly from visual information. Our two-step MRA can capture fine granularity representations that are not explicitly modeled by the DMMS [8].
Empirical experiments on various real-world datasets reveal that our sequence set-aware concept achieved performance that is comparable to or better than previous multimodal and set-aware models. In particular, the model based on our concept outperformed those deployed in the production environment on a large-scale production advertising (Ad) dataset. Furthermore, the model enables the visualization of the contribution of elements in modalities to model predictions based on the importance that is learned by the MRA.
The contributions of this study are summarized as follows: • We propose a new model for capturing higher-order interactions, and the inter-and intra-relationships of each modality, considering that multimodal data often include set structured contents. • Our set-aware model with a multimodal structure achieves equivalent or better prediction performance compared to previous multimodal and set-aware models. • An evaluation on datasets, including a real-world production dataset, demonstrates that our DM 2 S 2 can be reasonably applied in a real environment. The visualization of the learned MRA weights can be used to explain predictions that are important in industrial fields.

II. RELATED WORK A. MULTIMODAL MODELS
In general, DNN-based multimodal models include multiple streams of networks for modalities [14]. The models often have a component for fusing the features [2], [6], [15] to make a prediction based on intermediate features from the streams. Hence, by considering the fusion stage in the model architecture, multimodal architectures can be categorized into early fusion [15], mid-fusion [6], [7], [14], [16], and late fusion [2], [17]. The early fusion approach integrates the features of different modalities as inputs and uses a unified feature for the downstream task [15]. The mid-fusion approach involves the concatenation of features that are encoded from the raw data of each modality into a single feature [6], [7], [14], [16]. Typical multimodal architectures rely on the midfusion approach to combine multimodal information. However, mid-fusion may suffer from the increased dimensionality of concatenated features and the impractical assumption of complete modality information for each sample. The late fusion approach combines the prediction scores of different predictors for multiple modalities [2], [17]. The late fusion method often exhibits sub-optimality of the trained models owing to the dependency between modalities being ignored.
To solve this problem of mid-fusion architecture, which is the most common architecture in multimodal models, Reiter et al. [8] proposed the DMMS, which is a set-aware model based on Deep Sets [9] that represents a collection of features as an unordered set rather than an increasing fixedsize vector. The DMMS can deal with the set input by pooling an arbitrary number of modality features into a constantsize vector without dependence on the number of modalities. Nevertheless, the DMMS may ignore the importance of signals in a single modality by compressing the modality information through the leading encoder. Several midfusion models in the earlier stage of architecture introduced techniques to capture the higher-order interactions between modalities to construct each modality representation [18], [19]. Owing to the set-aware architecture, the DMMS can handle an arbitrary number of modality features without any unnatural workarounds (e.g., padding for missing modalities).
This study presents an architecture for retaining the finegrained signals in each modality while capturing the im-portance between the modalities. Our proposed mid-fusion architecture operates similarly to models in the class of Deep Sets with modality-level granularity as in the DMMS while leveraging the hierarchical structure of multimodal data, namely, sets of sequences. As recent BERT-based encoders output a sequence of embeddings rather than a single representation for an input sequence, we leverage it as a rich modality representation through our two-step attention mechanism.
This study employs an attention mechanism to capture the intra-modality signals and inter-modality importance based on the representation of sequence sets. Whereas our proposed MRA is closely related to the modality attention proposed by Moon et al. [21], our framework clearly differs in that it considers the hierarchical structure, such as the inter-and intra-relationships of modalities. Furthermore, we introduce a residual connection to preserve the information of the source feature representation.

III. METHODOLOGY
First, we briefly introduce Deep Sets. Our DM 2 S 2 is based on Deep Sets, which is a class of models defined according to sets. Subsequently, we describe the proposed DM 2 S 2 and its key component, namely MRA.

A. DEEP SETS
For each modality m ∈ M, we extract a sequence x (m) ∈ V lm of length l m from the tokens that are contained in vocabulary set V. Each token is expected to be extracted appropriately depending on the modality type; for example, a word or character for the textual modality and an OCRdetected word for the visual modality. Moreover, tokens can be obtained from images, audio sources, and videos; that is, a visual patch such as ViT [25] from an image, transcription from audio, and the results of the same procedure for images from each frame of video.
We wish to handle the set of sequences X = {x (m) | m ∈ M} as the input of a model; the loss functions should also be computed on such sets. To this end, we design our proposed architecture based on Deep Sets [9] as in the DMMS [8]. A model in the class of Deep Sets can be expressed as follows: where φ is the neural network that encodes the input set X, and ρ is the prediction network for a downstream task.
As described above, we use the sets of sequences that are extracted from each modality as inputs for the model. Therefore, in contrast to the conventional models [8], [9], we aim to handle the order between the tokens {x (2) IntraMRA, and (3) InterMRA. The input is the set of sequences from multiple modalities. Each set contains token representations, as described previously. Even if this unimodalized (tokenized) information is input, we consider this to be a multimodal setting.
First, to handle the sets of sequences appropriately, a BERT-based encoder is used as a building block of φ. We construct a unified sequence by concatenating the sequences of the modalities: We assume that each sample has the same number of modalities (i.e., M ). When a modality is missing for a sample, the corresponding entry x (m) can be filed with an empty sequence. Hence, by ensuring a fixed order of modalities in x, the BERT encoder can be viewed as permutation-invariant in terms of the order of the modalities, and permutationsensitive in terms of that of the tokens in a single modality.
As the BERT encoder can handle the order of tokens appropriately, we further enhance both the intra-and intermodality relationships. Intuitively, the importance of each token from multiple modalities can be decomposed into inter-and intra-modality factors; a token may be important because it is an essential modality for a prediction, and/or it is representative of its modality. We introduce a two-step MRA into the representation of H (m) to model this hierarchical structure of token importance.
We apply IntraMRA : R lm×d → R lm×d to quantify the token importance in each modality m ∈ M (described in detail in Section III-C2).
We denote the concatenated matrix of the IntraMRA outputs for each modality as H Intra = (H Subsequently, we apply

Intra-Modality Residual Attention
The input is the sets of sequence from multiple modalities. The sequence is encoded into the hidden states by the BERT-based encoder. Our Intra-and InterMRA learn each relationships and their contribution (as indicated by the red field) of the modalities. The output is a hidden vector that is obtained through the encoder and the mechanisms.
InterMRA : R L×d → R L×d to capture the importance of each modality (described in detail in Section III-C3): The representation H Inter is aggregated as the final hidden vector h ∈ R d through a pooling function ψ : R L×d → R d . In this study, we use the following mean pooling as ψ: where H Inter,i is the i-th token representation of H Inter . The projection that is defined by the above procedure can be considered as a permutation-agnostic encoder φ. This is because the Intra/InterMRAs and ψ are evidently permutation invariant in terms of the order of the modalities. Based on the representation of a sequence set φ(X) = h, we obtain the predictionŷ = f (X) through a multi-layer perceptron (MLP), ρ:ŷ = ρ(h).

C. MODALITY RESIDUAL ATTENTION (MRA) 1) Attention mechanism
Our concept of MRA originates from the attention mechanism [20]. The attention mechanism estimates the contribution of the input to the prediction. Using these properties, we propose an MRA that can learn the intra-and interrelationships for multiple modalities. A possible extension is available to replace our Intra-and InterMRA mechanisms by adding a Transformer module [26] to the BERT encoder; however, we will leave this as future work. We compute the attention score from the alignment function A : R l → [0, 1] l by passing the hidden state H ∈ R l×d to the attention mechanism, which consists of a similarity function S : R l → R l followed by the softmax function: where W ∈ R d×d is a trainable weight matrix, and q ∈ R d is a trainable self-attention vector [27]. The similarity function may be the additive attention [20] or the scaled dot-product attention [26]. In this study, we considered the additive attention, as follows: We demonstrate that additive attention is preferable to scaled dot-product attention in Section V-B2.

2) IntraMRA
IntraMRA learns the token-level attention score for the modality m. Although it is possible to learn the global attention (i.e., the attention of the entire token sequence) for the token sequence that is input into the encoder, the attention score for each token will be relatively small. IntraMRA helps to learn the modality inter-relationships by calculating the attention for the token sequence in the modality. We ensure that the value of the feature representation does not become excessively small by employing a residual connection [28] that preserves the information of the source feature representation. For the hidden state H (m) t that is obtained by the encoder, the IntraMRA score a (m) Intra,t is calculated as follows: where W (m) ∈ R d×d and q (m) ∈ R d are trainable parameters for modality m. Using the attention score, we obtain the intra-attended representation H where . IntraMRA outputs the concatenated matrix for each element: 3) InterMRA InterMRA learns the modality-level (e.g., visual-, textual-, and categorical-level) attention scores. The mechanism captures the modalities that contribute to the prediction, which is important for model training. We employ a residual connection for InterMRA with the expectation that the connection will provide the same effect as in IntraMRA. We first compute the summation of each modality representation for H ∈ R M ×d that contains the hidden representation for each modality H Intra : (12) Thereafter, we calculate the InterMRA score a where W ∈ R d×d and q ∈ R d are trainable parameters for InterMRA. We obtain the following inter-attended representation: H Inter for modality m with residual function F : where Intra Finally, InterMRA outputs the resulting concatenated matrix for each modality, as follows:

IV. EXPERIMENTS A. MULTIMODAL DATASETS FOR EVALUATION
We used three multimodal datasets, namely, MM-IMDB [6], Ads-Parallelity [2], and Production Ad-LP datasets for the empirical experiments. The MM-IMDB [6] dataset contains 25,925 movies with multiple labels (genres). We used the original split provided in the dataset and reported the F1 scores (micro, macro, and samples) of the test set. The Ads-Parallelity [2] dataset contains 670 images and slogans from persuasive advertisements to understand the implicit relationship (parallel and non-parallel) between these two modalities.
A binary classification task is used to predict whether the text and image in the same ad convey the same message. We reported the overall and per-class average accuracy and ROC-AUC scores of five-fold cross-validation [2], [8].
The Ad-LP dataset contains 257,235 search engine ads and landing pages (LPs) with search keywords, ad titles, descriptions, and LP text. The dataset was collected from CyberAgent Inc. 1 from August 1, 2020, to November 30, 2020. The task is to regress the conversion rate from the LP after clicking on the ad; the conversion rate expresses the number of clicks/total number of impressions of the ad. We divided the dataset into training, validation, and testing sets. The resulting splits comprised 179,922, 45,138, and 32,175 Ad-LP pairs, respectively.

B. INPUT FEATURES
We transformed the following multimodal information (i.e., visual, textual, and categorical data) into textual tokens and fed these into our proposed model. We used the Google Cloud Vision API 2 for the visual features to obtain the following four pieces of information as tokens 3 : (1) text from the OCR, (2) category labels from the label detection, (3) object tags from the object detection, and (4) the number of faces from the facial detection. We input the labels and object detection results as a sequence in order of confidence, as obtained from the API. We describe the visual, textual, and categorical features of each dataset below.  We used the following features: (1) as visual features, we used the result of applying the API to the screenshot for the first view of the LPs; (2) as textual features, we used the search words, titles, and descriptions of the ads, and the URL path for the LPs; and (3) as categorical features, we used the match type to search keywords as tokens. Refer to Appendix A for details on the production dataset.

C. IMPLEMENTATION DETAILS
We implemented our proposed concept using AllenNLP version 2.5.0 [29] with Hugging Face transformer version 4.5.1 [30]. We evaluated the test set only once in all experiments. The experiments were conducted on an Ubuntu 20.04 PC with an NVIDIA RTX A6000 GPU.
Following the DMMS of Reiter et al. [8], we used a pretrained RoBERTa large encoder [23] 4 in MM-IMDB and Ads-Parallelity as the BERT-based encoder. The encoder outputs the hidden state with dimension d = 1024. We froze the parameters of the encoder and used the output of the hidden representation. We used a pre-trained Japanese BERT 5 as the encoder in the Production Ad-LP dataset. This encoder outputs a hidden state with dimension d = 768.
We followed the settings of Reiter et al. [8] for the optimizer. Specifically, we used Adam [31] with a decoupled weight decay [32]. The learning rate was warmed up linearly from 0 to 0.001 during the first five epochs and then decayed with a cosine annealing schedule over the remaining epochs. For the MM-IMDB and Ads-Parallelity datasets, which are designed for single-and multi-label classification tasks, we used the sigmoid cross-entropy with class weights [33] to train the model on N training samples: where K is the number of classes and w k = 1/N k is the class weight for class k that is calculated with the class frequency N k . In this case, Y ∈ {0, 1} N ×K is the target label matrix in which the (j, k)-entry indicates whether the training sample j is in class k, andŶ ∈ [0, 1] N ×K is the predicted label matrix. For the Production Ad-LP dataset, which is a regression task, we used the root mean squared error (RMSE) to train the model, as follows: where y i is the i-th ground truth andŷ i is the i-th predicted value. Refer to Appendix B for the implementation details.

A. COMPARISON WITH BASELINES
For MM-IMDB and Ads-Parallelity, we compared our model, DM 2 S 2 , and the state-of-the-art baselines based on NAS, Transformer, and set-aware models with a BERT-based encoder. Tables 1 and 2 present the evaluation results for  MM-IMDB and Ads-Parallelity, respectively. We refer to the scores provided in each of the papers as a comparison and report them along with the DM 2 S 2 scores. For MM-IMDB, we compared our DM 2 S 2 with five stateof-the-art models: GMU [6], Bilinear-Gated [34], multimodal fusion architecture search (MFAS) [7], multimodal bitransformer (MMBT) [35], and DMMS [8]. In addition to the baseline above baselines, we compared our model with a pure BERT-based encoder (indicated in the table as RoBERTa encoder only) that uses token type IDs (also known as segment IDs) to differentiate tokens that belong to different modalities. This is a common approach in the multimodal literature [38].
As indicated in Table 1, DM 2 S 2 consistently outperformed the baselines on all performance measures. The comparison between DM 2 S 2 and the NAS-based MFAS reveals that our DM 2 S 2 exhibits the advantage of token-and modality-level fusion based on the proposed two-step MRA, in contrast to MFAS, which searches for the best mid-fusion architecture based on the NAS algorithm. As the DMMS is closely related to our DM 2 S 2 in terms of the set-aware fusion technique and BERT-based encoder, the performance gain of our DM 2 S 2 compared to the DMMS was remarkable. However, it should be noted that the DMMS uses a proprietary OCR method, and, thus, a fair comparison between DM 2 S 2 and the DMMS is difficult owing to the difference in the OCR performances. Therefore, we reproduced the DMMS as effectively as possible and trained the model on a dataset built with tesseract [37], EasyOCR [39] and the Cloud Vision API.
According to Table 1, DM 2 S 2 with the Cloud Vision API exhibited better results than the other OCR methods. The performance of DM 2 S 2 with EasyOCR, which is a neuralbased model based on CRAFT [40] and CRNN [41], was comparable to that of the DMMS. Although we conducted experiments using the same OCR methods (Cloud Vision API), our DM 2 S 2 outperformed the DMMS. We also obtained good results in all performance measures for the RoBERTa encoder only model with token-type IDs. Based on these results, we conclude that our MRA makes a more important contribution to learning multimodal data than the pure BERT-based model.
For Ads-Parallelity as indicated in Table 2, we compared DM 2 S 2 with Combined Classifiers [2] and the DMMS [8]. Furthermore, we compared the RoBERTa encoder only model, as described above. In terms of all measures, the DMMS and DM 2 S 2 (with the Cloud Vision API) substantially outperformed Combined Classifiers, which uses features that are carefully designed for this dataset; moreover, a visual feature in Combined Classifiers is based on Cloud Vision API. This demonstrates the effectiveness of the BERTbased encoder and set-aware architectures. Despite the underperformance of Combined Classifiers with visual features from the Cloud Vision API compared to DM 2 S 2 and the DMMS, our DM 2 S 2 outperformed the DMMS when using TABLE 1. Comparison of GMU, Bilinear-Gated, MFAS, MMBT, DMMS, and DM 2 S 2 on the MM-IMDB dataset. Compared to state-of-the-art models that handle multiple input formats (i.e., images and text encoded by their respective encoders), the proposed method that handles a single input format (i.e., multiple modalities as text) achieves comparable or better prediction performance. In summary, our DM 2 S 2 significantly outperformed the conventional methods on public datasets. Compared with the baseline models that encode images directly, our model, which tokenizes even images, consistently exhibited superior results. We believe that our model can capture more semantic features by employing the OCR method to extract the features explicitly, instead of implicitly, from images. Our model is expected to be effective for modalities other than images, such as audio and video, in which text tokens can be obtained via transcription. Compared to the DMMS, which is based on a set-aware architecture and a BERTbased encoder, our DM 2 S 2 outperformed the baselines on all evaluation measures. Considering that the DMMS uses proprietary visual features, we also examined the effect of the OCR methods in DM 2 S 2 as far as possible. Table 3 displays the comparison of the prediction performance of MRA when using both Inter-and IntraMRA. Although InterMRA and IntraMRA are beneficial, even when they are used independently, DM 2 S 2 achieved the best performance by leveraging them simultaneously. In contrast, the performance gain of IntraMRA was slightly larger than that of InterMRA. This is possibly because inter-modality importance is essential for capturing several important tokens in a sequence when the number of tokens is large. However, InterMRA can quantify the importance of modalities rather than that of tokens, and is thus complementary to IntraMRA. We further examined the effectiveness of our concept on a real-world dataset, namely Production Ad-LP. As a baseline model for the Production Ad-LP dataset, we considered a gradient boosting decision tree [42] model that provides daily predictions of conversion rates from ads and LPs in the production environment at CyberAgent Inc. The model was implemented by using LightGBM [43] and used the same input features as those described in Section IV-B. Moreover, as ablated variants of DM 2 S 2 , we considered DM 2 S 2 with or without Intra-and InterMRA. Table 4 presents the results of the models in terms of the RMSE, MAPE, and AUC. To avoid exposing the raw measurements of the production model, the performance of each model was divided by that of the LightGBM model, as indicated at the top of the table. We considered the RMSE as the main metric for the production scenario, and we were interested in the difference between the true and predicted probabilities because the prediction of the exact probability (i.e., conversion rate) is crucial for a cost-effective ad. DM 2 S 2 with Intra-and InterMRA exhibited a substantial performance gain in terms of the RMSE compared to the LightGBM model. All measures revealed the performance gains of the Inter-and IntraMRA. We observed the same trend for this dataset as those for the MM-IMDB and Ads-Parallelity datasets, whereas the gain of IntraMRA was larger than that of InterMRA, and DM 2 S 2 achieved the best performance with both of them. Table 5 presents a comparison of the DM 2 S 2 prediction performance with different variations of the similarity function. We observed an average performance difference of approximately 1% between the additive attention and scaled dotproduct attention on both the MM-IMDB and Ads-Parallelity datasets. This result confirms that the additive attention in our DM 2 S 2 is effective. An extension that replaces MRAs can be considered by adding a Transformer module on top of the BERT encoder. However, the module is based on scaled dot-product attention, the results suggest that the replacement would exhibit limited performance gains. Table 6 displays a performance comparison of DM 2 S 2 with and without a residual connection in the MRA. In both the MM-IMDB and Ads-Parallelity datasets, the residual connection contributed to an average performance improvement of approximately 2%. Therefore, the residual connection plays an important role in MRA.

4) Erasing Modality Feature During Inference
We compared the prediction performance when one of the modalities was erased to identify the modalities that contribute to the prediction in DM 2 S 2 . Fig. 2 depicts a visualization of the prediction performance of our model when one modality was erased during inference. We confirmed that erasing any modality feature impaired the performance.  Visualization of the change in performance when each modality was erased during inference. The performance was degraded regardless of which modality feature is erased, and this was particularly significant when erasing OCR text.
Remarkably, a significant performance degradation was observed when the OCR text was erased. This result implies that the OCR text that is obtained from the visual modality contributes to the model's prediction.  Fig. 3 (a) and 3 (b), where the predictions were correct, the MRA captured the corresponding words (or subwords) that contributed to the prediction. For example, in Fig. 3 (a), we can observe a strong response to words relating to the documentary film in IntraMRA (e.g., "documentary" in movie plot). Furthermore, InterMRA could select important modalities; for example, it strongly focused on movie plot and OCR text features that contained words directly relating to prediction. Remarkably, it can be observed from Fig. 3 (a) that high IntraMRA weights were assigned to the split words "document" and "ary" in the OCR text. This is possibly because the BERT-based encoder encoded the two words (i.e., "document" and "ary") by considering the context, and IntraMRA could assign importance to them based on the encoded representations. Consequently, the errors in the OCR could be compensated. This can be viewed as evidence for that, even with EasyOCR, DM 2 S 2 can achieve comparable performance to that of the DMMS with a proprietary OCR method (see Tables 1 and 2). In the samples in which the model made an incorrect prediction (as illustrated in Fig. 3  (c)), although the model could capture important words such as "friends" in the movie plot, the overall importance was  dominated by the other modalities.

VI. CONCLUSION AND FUTURE WORK
We have proposed a new concept for a multimodal set-input DNN architecture, namely DM 2 S 2 . Our proposed concept consists of three projections that capture the global and local relationships among multiple modalities. DM 2 S 2 is specifically designed for sequence sets, which are the structured data that are often collected in real multimodal applications. Whereas DM 2 S 2 is compatible with the sequential embeddings that are provided by recent BERT-based encoders, the Intra-and InterMRA capture the hierarchical importance of the elements in the modality sequences. We empirically demonstrated the effectiveness of our concept compared to state-of-the-art multimodal models. Furthermore, we examined its applicability to real applications by confirming the substantial performance gain from production-running models using real-world Ad data.
In future work, we would like to explore attention mechanisms for various structures of multimodal data, such as grouped modalities (i.e., sets of sets) and hierarchical fields (i.e., sets of trees), further. Moreover, we are planning to investigate the theoretical relationship between our attentionbased fusion and conventional multimodal methods, such as gradient blending [17] and factorization machines [44]. .

APPENDIX A THE DETAILS OF THE PRODUCTION AD-LP DATASET
The statistics for the Production Ad-LP dataset are listed in Table 4. The dataset was collected from CyberAgent Inc. 6 from August 1, 200, to November 30, 2020. This Japanese dataset contains 257,235 search engine advertisements (Ad) and landing pages (LPs), and these two modalities are called high-level modalities. Each high-level modality includes some low-level modalities. For example, the Ad modality contains search keywords, multiple titles (title_1, title_2, and title_3, where title_2 and title_3 are optional values), multiple descriptions of the ads (description_1 and description_2, where description_2 is an optional value), an LP URL path (path_1 and path_2, where both are optional), and a match type. 7 Several low-level modalities exist in the LP, such as the LP text that is obtained by applying optical character recognition (OCR) from LP screenshot images and safe search results (five types: adult, spoof, medical, violence, and racy), labels, object face detection results, and page performance scores with the lighthouse tool. 8 The statistics of the features of each row are shown in the Details column. We used a screenshot of the first view of the landing page at iPhone 8 resolution (1334 × 750). The lighthouse performance has a value between 0 and 100. We divided the dataset in an Ad group-based manner to train and evaluate the models. The Ad group 9 contained one or more ads that shared similar targets. In most advertising systems, ads are served in campaign units, and multiple creatives with similar trends are developed during a campaign. Each ad campaign consists of one or more ad groups. Thus, to avoid data leakage owing to potential similarity, we divided the dataset into a training set, a validation set, and test sets in an ad group-based manner. Note that, as our experiments were based on open datasets that are widely available to the public for performance comparison and results from analysis, no privacy handling issues would occur. In addition to the advertising data received from the company, information that could identify individuals was also discarded. Moreover, we took great care to ensure that the results of our analysis would not be disclosed in any manner that would violate privacy or ethics.

APPENDIX B IMPLEMENTATION DETAILS
We used the OCR 10 , label detection 11 , object detection 12 , and facial detection 13 services to obtain visual features as tokens in the Google Cloud Vision API. We converted the obtained features, including numeric ones, as strings and treated them as tokens.
To segment the Japanese text of the ads into words in the Ad-LP dataset, we used MeCab [45], which is a type of morphological analyzer. We used the analysis dictionary for the morphological analyzer MeCab (UniDicMA), as the custom dictionary. These are in accordance with the standard methods of the Japanese BERT.