Learning Multimodal Word Representations by Explicitly Embedding Syntactic and Phonetic Information

Word embedding (i.e., word representation) transforms words into computable mathematical expressions (usually vectors) according to semantics. Compared with human semantic representation, these purely text-based models are severely deficient because they lack perceptual information attached to the physical world. This observation promotes the development of multimodal word representation models. Multimodal models have been proven to outperform text-based models on learning semantic word representations, and almost all previous multimodal models only focus on introducing perceptual information. However, it is obvious that syntactic information can effectively improve the performance of multimodal models on downstream tasks. Therefore, this article proposes an effective multimodal word representation model that uses two gate mechanisms to explicitly embed syntactic and phonetic information into multimodal representations and uses supervised learning to train the model. We select Chinese and English as examples and evaluate the model using several downstream tasks. The results show that our approach outperforms the existing models. We have made the source code of the model available to encourage reproducible research.


I. INTRODUCTION
Word embedding is often used in natural language processing (NLP) tasks such as machine translation [59], text classification [1], and dialogue systems [50]. There are various word embedding models, such as word2vec [56], GloVe [20], etc. Well-performing word embedding should reflect semantics accurately. At present, most popular methods for learning word embeddings are based on the distributional hypothesis, which utilizes cooccurrence statistics from massive text datasets. However, the process of humans understanding semantics is known in psycholinguistics as language comprehension [16]. Humans are first stimulated by perceptual information (text, sound, etc.), then extract implicit syntactic information from the brain, and finally use the information from their brain to reprocess the perceptual information and understand semantics. Therefore, compared to human semantic representation, these purely text-based models are The associate editor coordinating the review of this manuscript and approving it for publication was Arianna Dulizia . severely deficient because they lack perceptual information attached to the physical world. This observation has led to the development of multimodal word representation models that utilize both linguistic (e.g., text) and perceptual information (e.g., images and audio). Such models can learn better semantic word representations than text-based models, as evidenced by a range of evaluations [8], [33].
A typical example is that the meaning of concrete words, such as ''bird'' and ''thunder'' are mostly learned from perceptual experiences of seeing, touching and listening. In contrast, more abstract words, such as ''obscure'' and ''lovely'', are less associated with perceptual modalities and act as relatively fixed parts in the sentence structure. According to different types of words, information from different modalities contributes differently to the meaning of words, which has been found in cognitive psychology [21], [22] and computational experiments [10].
However, the existing multimodal models focus on the processing of perceptual information and ignore the introduction of syntactic information. Syntactic information, such as part of speech, refers to the results obtained by dividing the combinatorial relations between words in a sentence according to specific standards. Recently, Vashishth et al. [57] and Wang et al. [52] completely addressed the word cooccurrence information collected implicitly based on the distributional hypothesis as syntactic information. This approach relies heavily on continuous context, which is the integrity of the training corpus. For target words with multiple parts of speech, if the contexts of a specific part of speech are abundant but their occurrences in training data are low or they do not appear, then the corresponding semantics will not exist in the embedding of the obtained word representation. For example, if the word representation for ''break'' does not have the semantic corresponding to the noun but only the semantic corresponding to the verb, then it is clearly not ideal. For low frequency words, it is more difficult to obtain syntactic information through the distributional hypothesis.
These factors inspire us to build a multimodal word representation model that can embed syntactic and perceptual information effectively, and the model is called MSP. To this end, two fusion mechanisms have been added to the MSP: a modality-specific gate and a language-specific gate. After constructing the perceptual and syntactic representations, the modality-specific gate uses the seq2seq neural network [2], [32], [35] to explicitly embed syntactic and phonetic information in word representations and train the model based on the supervised method. The second mechanism is a language-specific gate. It uses dynamic fusion methods [52] to assign fusion weights to each modality to increase the adaptability of MSP to different language groups. The reason is that in MSP, phonetic information acts as perceptual information while different languages have different emphases on phonetic information. For example, phonetic languages (such as English) are more dependent on phonetic information than ideographic languages (such as Chinese). In addition, extensive analysis was conducted to clarify the principles of the proposed method. In summary, we have two major contributions: • We propose the multimodal word representation model called MSP. Compared with the existing word embedding models, MSP explicitly embeds syntactic and phonetic information in the model, simulates multimodal information fusion through two gate mechanisms, and obtains a multimodal word representation model with excellent performance through supervised training. The core idea of this model is that it uses supervised training to learn a set of general language information fusion rules.
• The use of syntactic information can significantly improve the performance of the multimodal word representation model. On various NLP tasks, we use multiple word representation models and pre-trained language models as baselines to compare the performance and set MSP-with no processing of syntactic information as a control. The task results confirm this conclusion.

II. RELATED WORKS
Researchers have been working on building multimodal representation models for many years, most of which can be divided into two types.

A. JOINT TRAINING MODELS
These models build multimodal representations with raw inputs of both linguistic and perceptual resources. The recently introduced work is an extension of the skip-gram model [56]. For instance, Hill et al. [10] propose a corpus fusion method that inserts the perceptual features of a word in the training corpus, which is then used to train the skip-gram model. Lazaridou et al. [31] proposed the MMSkip model, which injects visual information in the process of learning linguistic representations by adding a max-margin objective function to minimize the distance between linguistic vectors and visual vectors. The joint training methods implicitly propagate perceptual information to word representations and simultaneously learn multimodal representations. However, the abovementioned models do not introduce syntactic information. This weakens the effect of introducing perceptual information and consequently leads to only limited improvement. Vashishth et al. [53] incorporate syntactic and semantic information in word representations by using graph convolutional networks, and explicit embedded syntactic information effectively improves the performance of the model; however, this model does not introduce perceptual information.

B. SEPARATE TRAINING MODELS
These models independently learn linguistic and perceptual representations and integrate them afterwards. The simplest approach is concatenation, which fuses linguistic and visual vectors by concatenating them. Concatenation has been proven to be effective in learning multimodal models [8], [10], [11]. Variations of this method apply transformation and dimension reduction techniques, including the singular value decomposition (SVD) [8] and canonical correlation analysis (CCA), to the concatenation result [10]. In addition, Vashishth et al. [53] and Silberer et al. [54] use a stacked autoencoder to learn multimodal representations by embedding linguistic and visual inputs into a common space with the objective of reconstructing the individual inputs. However, the abovementioned methods can only generate multimodal representations of those words that have image information, thus drastically reducing the multimodal vocabulary. Wang et al. [52] build a multimodal model that can dynamically fuse semantic representations of different modalities according to different types of words. In the last two years, the research of constructing multimodal word representation using phonetic information has also been carried out. Zhu et al. [58] propose enhanced double-carrier word representation via phonetics and writing. It trained written embedding based on phonetic embedding and the final word representation fuses writing and phonetic embedding. Zhu et al. [63] use a synchronized way that adopts an VOLUME 8, 2020 FIGURE 1. The four numbers correspond to the four steps of the method. L w , P w and S w are the linguistic representation, perceptual representation and syntactic information of the target word w , respectively. In the fourth step, w and w are semantic relational word pairs.
attention model to utilize both text and phonetic perceptual information in unsupervised learning tasks. In terms of the two types of models discussed in this section, MSP belongs to separate training model. Based on the existing researches, the above methods are all effective methods to generate multimodal word representation. However, no matter the joint training model or the separate training model, most of them only focus on the introduction of a class of modality information during the learning process. In contrast, MSP uses gate mechanisms to introduce perceptual information and syntactic information in the one model. Fig. 1 shows the framework of our proposed MSP, which contains four stages:

III. PROPOSED METHOD
• Build the perceptual representation -Language comprehension begins with receiving perceptual stimuli. Most linguists believe that sound is the primary perceptual form of language, so the model processes the phonetic feature of words and treats the result as a perceptual representation.
• Construct the syntactic information -Janda [27] have experimentally demonstrated that syntactic information plays an irreplaceable role in language comprehension. In MSP, for each word, we construct the probability distribution of the part of speech as the syntactic information.
• Modality-specific gates and language-specific gates are used to explicitly embed syntactic information in training and fuse the linguistic representation and perceptual representation. We employ the GloVe and word2vec vectors as our linguistic representations, which are trained using global word cooccurrence statistics.
• We design the objective function and train the MSP model using supervised learning.

A. CONSTRUCT PERCEPTUAL REPRESENTATION
The goal of this phase is to build the perceptual representation P w . According to linguistics, different perceptual information of the word considers different information on concepts. For example, image may include information such as shape and color. By contrast, voice contain the concept of information is less, but the phonetic context and the text context can't be regarded as duplicated, they are a complementary relationship that provides a richer semantic for each other. For example, in the case of disambiguation, ''minute'' has two meanings. When the pronunciation of ''minute'' is ['mınıt], it indicates a time unit, and when it is pronounced [maı'nju:t], it means tiny. For words with similar sounds and different meanings, the text can provide richer semantics for the model (such as ship and sheep), and the difference in their writing helps us distinguish the different meanings of the two words. Moreover, while every word has a corresponding pronunciation, images do not have this natural advantage. In this article, we choose sound, which is the primary perceptual stimulus, as the perceptual information; therefore, the model needs to obtain the phonetic representation of words. Specifically, the automatic segmentation of spoken words has been successfully trained and reported previously [3], [6]. The training audio corpus in the present work has been previously segmented into phonetic words. We use the Mel-scale Frequency Cepstral Coefficient (MFCC) method -a common approach to obtain the phonetic features of the audio -to convert the speech frames of words into vectors. Those vectors contain a considerable amount of noise, such as background noise and speaker characteristics; however, what we want to obtain is the phonetic structure [61], which is not changed by the environment or the speaker. To disentangle the phonetic structure and noise, we use an end-to-end approach to process phonetic vectors and obtain the results as perception representations [58].

B. CONSTRUCT SYNTACTIC REPRESENTATION
MSP uses part of speech (POS) information to construct syntactic representations. Part of speech is the most common syntactic structure. It is the result of the classification of words based on grammatical features (including syntactic functions and morphological changes) and helps people to collocate and understand the meanings of words. Modern English words can be divided into fourteen parts of speech, but only five are used most often -nouns, verbs, prepositions, adverbs and adjectives. In this model, GCNW uses WordNet to structure syntactic information. WordNet is an English dictionary based on cognitive linguistics in which the relationship between words is human annotated [14]. It can label the POS tag of a word in each specific context. Handling polysemy is the key to constructing POS features. The problem of obtaining the POS tag can be formulated as p = F(w, c), where F is the mapping function that obtains the corresponding POS tag p based on the target word w and specific context c. First, we use WordNet to label the POS tag of each word in the corpus. Note that the same word may be labeled differently in different contexts. Next, for target word w in the corpus, we count the occurrence Occ w p of each POS p. Overview of the modality-specific gate, where L w , P w and S w represent the linguistic representation, perceptual representation and syntactic feature, respectively, of the target word w .
In equation (1), m is the total number of times that word w has occurred in the corpus, Then, Occ w p is normalized; thus, the probability distribution of part of speech of the word w is obtained,  (2) Finally, we treat the probability distribution of the POS as the syntactic information of word w and construct it into a feature vector that is used in the next phase.

C. GENERATE REPRESENTATION IN MSP
In this phase, the model explicitly uses two fusion mechanisms, fusing linguistic representation and perceptual representation, to introduce syntactic information in training.

1) MODALITY-SPECIFIC GATE
To simulate the role of syntactic information in language comprehension, namely, the reprocessing of perceptual information, we add a modality-specific gate to the model. The modality-specific gate is basically a seq2seq model based on the attention mechanism [2], [55], which is a training method that transforms sequences in different domains. As shown in Fig. 2 [39] to decode c to obtain output sequence y. For the output sequence [y 1 , y 2 . . . , y i−1 ] and the current i th dimension input X , y i can be expressed as: For the M -dimension phonetic representation used as the input, y i is determined by three factors as g (y i−1 , s i , c i ) the hidden state s i at the i th dimension, the intermediate semantic vector c i , and the output y i−1 at the i-1 th dimension, where s i is related to the hidden state s i−1 , and c i is obtained by equation (4). In equation (4), e ij is the alignment model in the attention mechanism and is used to measure the influence of the j th dimension information of the input sequence on the i th dimension information of the output sequence.
The encoder needs to initialize the parameters during training at which time the effect of the syntactic information is reflected. The model uses the syntactic feature vector of word w to initialize the parameters h i and ← h i in training. The network output y i of the end-to-end type is the probability distribution. Softmax calculations are performed on each dimension of the sequence [y 1 , y 2 ,. . . , y M ], and p whose dimension is equal to the input phonetic representation is obtained. Finally, a linguistic representation and p are concatenated to obtain Output ms .

2) LANGUAGE-SPECIFIC GATE
In linguistics, languages can be divided into ideographic languages and phonological languages according to the dependence of text and sound. Ideographic languages (Chinese, etc.) focus more on text than phonological languages (English, etc.). The use of neural networks to dynamically fuse different modalities has been proven to be effective [52]. Based on this observation, in order to improve the applicability of the model, we add the language-specific gate to assign weights for the linguistic representation and the perceptual representation. In the joint training phase, the model uses a neural network to simulate the current language's dependence on different modalities, and the weighted linguistic representation and weighted phonetic representation will be concatenated to obtain Output ls .

3) JOINT TRAINING
In this phase, the model will integrate the outputs of the two gates. According to the literature [8], [52], dynamic weighted fusion is an effective method. Thus, we add a set of variable weights {w ms , w ls } to the network to weight the outputs and superimpose the results to generate Output MSP as equation (5).
To train the model, WordNet is introduced as the training dataset. WordNet can search the synonym set corresponding to the target word according to semantic conditions, and VOLUME 8, 2020 the semantic similarity is also human annotated. In the joint training phase, according to equation (6), the model first calculates the mean cosine similarity between MSP representations corresponding to words in the synonym set.
Then, according to the training objective, the model minimizes the loss, namely, the difference between the mean cosine similarity and the human-annotated similarity. The model performs iterative training, during which the MSP representations will be updated with the network.
Suppose the dictionary contains M words, each word w corresponds to N synonyms w, and the human-annotated similarity between w and w is Sim(w, w). To train the model and learn the network parameters, we minimize the objective function as follows: Although WordNet provides a set of annotated synonyms for almost all words, this does not mean that all words can find a synonym set. For some unqualified words, the model deletes them before training.

IV. TASK EVALUATION A. BASELINE ALGORITHMS
Word2vec is the most common word representation model. It includes two training modes, CBOW and skip-gram. In the tasks, we compare MSP with word2vec implemented with the CBOW structure. GloVe [20] is another efficient word representation model that incorporates global word cooccurrence information. DFM [52] is a multimodal model that uses three novel dynamic fusion methods to assign importance weights to each modality, and the weights are learned under the weak supervision of word association pairs. DCWE [58] is enhanced double-carrier word representation model via phonetics and writing, and it trained written representation based on phonetic representation and the final word representation fuses text and phonetic embedding. DPWR [63] is trained in a synchronized way that adopts an attention model to utilize both linguistic and phonetic information in unsupervised learning tasks. SynGCN [57] incorporates syntactic and semantic information in word embeddings by using graph convolutional networks. GloVe-ph is a multiple information connection model that directly concatenates the linguistic representation and the perceptual representation. MSP is the multimodal word representation model generated by the method described in this article in which the linguistic representation is represented by GloVe. MSP-w2v changes the linguistic representation in MSP from GloVe to word2vec. MSPremoves the modality-specific gate in MSP to verify the effectiveness of the method described in this article. We also compare the pre-trained language models, including ELMo and BERT, on tasks; however, considering the constraints of the pre-trained language model on task types, they are only used for text classification task. ELMo [36] is a pretrained language model that trains a model with multiple BiLSTM layers, and the output of the model is a sentence representation. BERT [19] is a pretrained transformer network model. In the comparative experiment, the model consists of 12 layers, 768 hidden layers, 12 heads, and 110 M parameters.

B. EXPERIMENTAL SETUP
For the English linguistic representation, we use the 300-dimensional GloVe and word2vec, which are trained on the Common Crawl corpus consisting of 840 B tokens and a vocabulary of 2.2 M words. For the Chinese linguistic representation, we also use the 300-dimensional GloVe and word2vec, and those vectors are trained on the Wikipedia data set and web news corpus and use Jieba 1 for word segmentation. The dimension of the perceptual representation in the MSP is set to 100. To control the dimensions, other word representation models used for comparison are also retrained according to the dimensions of the MSP. The MSP model is implemented by using TensorFlow. We set the initial learning rate to 0.02 and the batch size to 100, and we randomly initialize the parameters of the model according to a normal distribution. We set the minimum word frequency to 5 by default. If a word appears in the document less than 5 times, it is discarded. The related data and code will be posted on GitHub for replication 2 .
We use four intrinsic and two extrinsic evaluation methods to evaluate MSP. Intrinsic evaluation methods include concept categorization task, word similarity task, word analogy task and part of speech tagging task. Those methods focus on measuring lexical internal pattern information, such as semantic information. However, a language model that performs well in an intrinsic evaluation does not necessarily produce similar performance in an extrinsic evaluation. Therefore, this chapter added text classification task and text similarity task as extrinsic evaluation methods to verify the applicability of MSP to different types of tasks.

C. CONCEPT CATEGORIZATION TASK 1) DATASET AND EVALUATION CRITERION
Concept categorization involves grouping nominal concepts into natural categories. For instance, computers and phones should belong to the electronic products class. In our experiments, we evaluate the models on the AP (Almuhareb, 2006), Battig (Baroni and Lenci, 2010), BLESS (Baroni and Lenci, 2011), and ESSLI (Baroni et al., 2008) datasets. We calculate the classification accuracy σ % to evaluate the models, and a higher accuracy corresponds to a better model.  Table 1 lists the results of the concept categorization task. Overall, we found that MSP is superior to existing word representation methods in all four data sets, and MSP-w2v also performs well. On average, we obtain an approximately 1.4% absolute increase in performance on the concept categorization task compared to the best performing baseline. The concept classification task needs to calculate the topic similarity (topically related words) between different words rather than the functional similarity (in place substitutable words). The supervised learning method used by MSP in the training captures the topic similarity of words by utilizing the synonymous relationship between words, which provides advantages for the performance of the model on the task.  Table 2 lists the information of those datasets.

2) RESULTS AND DISCUSSION
The task uses the cosine similarity between a pair of word representations as the similarity of semantics and employs the Pearson correlation ρ to evaluate the relation between the human-annotated semantic similarity and the cosine similarity. A larger ρ indicates a higher correlation and a better model.

2) RESULTS AND DISCUSSION
The results are listed in Table 3 and Table 4. For English, when the Pearson coefficient ρ is the evaluation criterion, MSP and MSP-w2v perform the best for all four datasets at 1.1∼5.9% higher than the state-of-the-art baseline models. For Chinese, MSP performs the best for both datasets. These results show that MSP generated better performances than the existing models. However, because the word similarity  information is introduced into the objective function, the results of the word similarity task cannot be used alone to prove the good performance of MSP. The addition of the word similarity task is intended to validate the applicability of the model over different language sets.
Further analysis shows that the task performances are much lower than those of the text-based models when the linguistic and perceptual representations are directly concatenated. This indicates that the direct concatenating representations increase the information of the word representation, but this approach is not applicable to the subsequent tasks.

E. WORD ANALOGY TASK 1) DATASET AND EVALUATION CRITERION
This task is to predict word b 2 given three words a 1 , a 2 , and b 1 such that the relation b 1 : b 2 is the same as the relation a 1 : a 2 . We compare models on SemEval-2012 (Jurgens et al., 2012) and MSR (Mikolov et al., 2013c) using the Pearson correlation.

2) RESULTS AND DISCUSSION
The evaluation results on the word analogy task are summarized in Table 5. Overall, we find that MSP outperforms all the existing word representation models.
Compared to the best performing baseline model, on average, MSP obtains an approximately 3.6% increase in performance. The results demonstrate that the learned VOLUME 8, 2020  representations from MSP more effectively capture the semantic and syntactic properties of words.

F. PART-OF-SPEECH TAGGING TASK 1) DATASET AND EVALUATION CRITERION
Part-of-speech (POS) tagging aims at associating with each word, a unique tag describing its syntactic role. For evaluating word representation models, we use Lee et al.'s LSTM model [64] on Treebank POS dataset (Marcus et al., 1994) and evaluate performance with tagging accuracy. Table 6 shows the experimental results of part-of-speech tagging task. Compared with the existing word representation models, MSP has a better performance -MSP gets an excellent result like grammar enhancement model SynGCN, which is 2.2% more accurate than the text-based word representation models and 1.5% more accurate than the multimodal models. The introduction of syntactic information effectively improves the performance of multimodal model.

2) RESULTS AND DISCUSSION
Combining the results of other intrinsic evaluation tasks, it can be concluded that the word representation generated by the MSP model contain more semantic and syntactic information, and that such information can be used in relevant downstream tasks.

G. TEXT CLASSIFICATION TASK 1) DATASET AND EVALUATION CRITERION
We also perform a text classification task to check our method's applicability. The task is based on several public  datasets, including scale, IMDB, and Yelp reviews. The scale v1.0 dataset, which we obtained from (Pang and Lee, 2005), is used as the evaluation dataset; and this dataset contains 5004 samples with review texts labeled with 1-4 stars. The IMDB data set contains 50,000 film reviews, including 25,000 opinion-filled reviews for training and 25,000 reviews for testing; and these data set can be used for classification. We also use Yelp reviews as a dataset, which we obtained from (Zhang et al., 2015). This dataset contains 1,569,264 samples of review texts labeled with 1-4 stars. For the text classification task, we use the mean of the word representations to represent a sentence or document. The text classifier was trained with LIBLINEAR 3 [65]. For the corpus that does not distinguish between the training and testing sets, 75% of the characters are selected as the training set, and the remaining 25% are used for testing. We calculate the classification accuracy σ % to evaluate the models 2) RESULTS AND DISCUSSION Table 7 and Table 8 list the results of the text classification task. Compared to other baseline word representation models, MSP performs the best for all datasets, which shows that MSP not only significantly improves the model performance, but it is also applicable to different downstream tasks. Moreover, other models with embedded syntactic information, such MSP-w2v and SynGCN, also perform well. This shows the effectiveness of the introduced syntactic information for this type of task. When compared to the pre-trained language models, the difference between other models' and MSP's performance on the text classification task is slight. However, BERT and other language models are only applicable to tasks with larger granularity, such as those at the sentence level; and they require extremely large numbers of parameters and training costs. Therefore, MSP has its own advantages in this application.

H. TEXT SIMILARITY TASK 1) DATASET AND EVALUATION CRITERION
The content of text similarity task is to calculate the similarity s 1 of a pair of sentences, and then measure the performance of the model by comparing the difference between the similarity s 1 and the similarity s 2 of manual annotation.
We superimpose the word vectors in the sentence, express the average vector as the sentence representation, and take the cosine similarity between the two sentence vectors as the similarity s 1 . Pearson correlation coefficient is used to calculate the correlation between s 1 and the s 2 . We experimented with the SICK and STS datasets. The SICK data set contained 9,927 pairs of sentences (4,500 pairs of training sets /4,927 pairs of test sets /500 pairs of validation sets). The STS data set consists of 8,628 sentence pairs, divided into training sets (5,749 of training sets /1,500 of test sets /1,379 of verification sets). Table 9 list the results of the text similarity task. According to the results, MSP performs best across all data sets. Compared with text-based word representation and multimodal word representation without introduction of syntactic information, the results obtained by MSP are improved by 0.016 and 0.012 respectively.

2) RESULTS AND DISCUSSION
Based on the results of the extrinsic evaluation methods in this chapter, it can be concluded that MSP not only performs well in the intrinsic evaluation method, but also gets similar results in the extrinsic evaluation, which indicates that MSP not only can effectively improve the internal mode information represented by words, but also has good applicability for different types of tasks.

V. MODEL ANALYSIS
Compared with the existing word embedding models, MSP achieves a great improvement. Its gate mechanisms effectively integrate multimodal information, which is reflected by its good performance. The MSP consistently performs better than the MSP-model on all task results; and when MSP removed the modality-specific gate, the performance of the model experienced a significant decrease but was still higher than that of GloVe-Ph. This suggests that after the removal of the modality-specific gate, the model loses the reinforcement effect of syntactic information. However, language-specific gates still play a role in adjusting the weights of the modality; and without this mechanism, MSP would completely degenerate to GloVe-ph.
For the text classification task, when compared to other text-based models and multimodal models, MSP is still better than MSP-and has the best performances in three datasets. Moreover, the improvement effect is better than those for the other tasks, indicating that the introduced syntactic information plays a role in making MSP more suitable for tasks that utilized syntactic information.
The applicability of MSP to different languages is also quantitatively analyzed. Table 10 presents the combination weights of the linguistic and perceptual representations learned in language-specific gates for English and Chinese. The ratio between the linguistic information and perceptual information was 0.8225:0.1775 for English and 0.8976:0.1024 for Chinese. Linguistic representation has a higher weight for both languages, which indicates that text is more important for carrying information. However, phonetic languages such as English have a stronger dependence on phonetic information than ideographic languages such as Chinese, which is in line with the linguistic viewpoint. The above results indicate the following: • MSP is a word embedding model with better comprehensive performance because the MSP includes extra multimodal information and uses effective mechanisms to process that information. This is demonstrated in a series of tasks.
• Adding syntactic information can effectively improve the performance of the model. Similar to perceptual information, syntactic information is also needed for building multimodal representations and can effectively improve the performance of the model on downstream tasks.
• MSP is applicable to different languages. The learned weights show clear differences between phonetic and ideographic languages.

VI. CONCLUSION
Based on the observation that almost all previous multimodal models only focus on introducing perceptual information VOLUME 8, 2020 and ignore syntactic information, we propose the new multimodal word representation model MSP. MSP uses two fusion mechanisms to embed explicit syntactic information and phonetic information and uses supervised training to learn performance-enhancing multimodal word representations. Experimental evaluations show that our proposed model achieves substantial gains on all benchmarks. Qualitative analysis further proved the validity and applicability of MSP. As one of the main research directions related to the development of language representations, the performance of multimodal models depends not only on the source of the perceptual information but also on the method used to incorporate that information. Such an incorporation method should not be limited to the incorporation of only two kinds of information and should also be capable of incorporating information from more than two modes. Future work includes exploring better representations of semantic words by combining information from other modalities. We believe that the multimodal model is of great significance in promoting the development of applications related to natural language processing. SHUANG LIU is currently pursuing the master's degree with Shanghai University. His main research fields include artificial intelligence, natural language processing, and machine learning.
CHAOMING LIU is currently pursuing the master's degree with Shanghai University. His main research fields include artificial intelligence, natural language processing, and machine learning.
XIAOYA YIN is currently pursuing the master's degree with Shanghai University. Her main research fields include artificial intelligence, natural language processing, and machine learning.
XIAPING XV is currently pursuing the master's degree with Shanghai University. Her main research fields include artificial intelligence, natural language processing, and machine learning. VOLUME 8, 2020