DeepFusion: Fusing User-Generated Content and Item Raw Content towards Personalized Product Recommendation

. Personalized recommender systems, as eﬀective approaches for alleviating information overload, have received substantial attention in the last decade. Learning eﬀective latent factors plays the most important role in recommendation methods. Several recent works extracted latent factors from user-generated content such as ratings and reviews and suﬀered from the sparsity problem and the unbalanced distribution problem. To tackle these problems, we enrich the latent representations by incorporating user-generated content and item raw content. Deep neural networks have emerged as very appealing in learning eﬀective representations in many applications. In this paper, we propose a novel deep neural architecture named DeepFusion to jointly learn user and item representations from numerical ratings, textual reviews, and item metadata. In this framework, we utilize multiple types of deep neural networks that are best suited for each type of heterogeneous inputs and introduce an extra layer to obtain the joint representations for users and items. Experiments conducted on the Amazon product data demonstrate that our approach outperforms multiple state-of-the-art baselines. We provide further insight into the design selections and hyper-parameters of our recommendation method. In addition, we further explore the relative importance of various item metadata information on improving the rating prediction performance towards personalized product recommendation, which is extremely valuable for feature extraction in practice.


Introduction
With the exploding growth of the network scale and the number of products, it is difficult for customers to deal with the large amount of available information. To alleviate information overload [1], personalized product recommendation has been utilized by e-commerce websites to present products that best meet customers' needs and expectations. e success of many e-commerce companies is due to their accurate and personalized product recommender systems, such as Amazon, eBay, Yelp, and Netflix [2].
Collaborative filtering (CF) [3][4][5][6] is one of the most successful recommendation algorithms in both industry and academic communities. e basic idea of the technique is that people who share similar ratings tend to have similar preferences. However, the CF-based method easily suffers from the cold-start problem [7] when there are only a few ratings for items, which severely deteriorates the accuracy of recommendation.
Recently, researchers have found that additional data sources beyond ratings are extremely helpful in personalized recommendation. However, most existing recommender systems take into account only the user-generated content such as textual reviews in user/item profiling, while ignoring the raw content of items. First, the user-generated content is very sparse in many applications. For example, in the Netflix dataset [8], the number of movies that have been rated by users is only approximately 1 percent of the total number of movies. Second, the distribution of user-item interaction data is highly unbalanced. According to Anderson [9], due to the long-tail effect, only a few users interact with a large number of items, while most users rarely, or never, interact with items. Besides, these additional data sources come in very different and heterogeneous forms, which make it difficult to fuse them in a unified way.
Fortunately, deep learning technique has shown powerful representation learning performance and outstanding scalability, which has shed light on this problem. Several deep learning-based recommender systems have been proposed. Yet, most of them [10][11][12][13][14] were usually restricted to limited data sources or learned the latent representations of users and items independently. As a result, these approaches cannot achieve fine-grained modeling of user preferences and item features. In addition, the metadata of product (item), i.e., price, brand, title, and description, on e-commerce websites plays an important role on user buying behaviors. However, to the best of our knowledge, no study has been conducted in which the relative importance of various item metadata information was considered.
In this paper, we propose a novel personalized recommendation method based on deep learning and multiview fusion, called DeepFusion, for rating prediction task. In this framework, each kind of data source is considered as a view and different views describe different aspects of user preferences and item features. Because of the nature of heterogeneous data sources, we utilize multiple types of deep neural networks, i.e., multilayer perceptrons (MLP) and convolutional neural networks (CNN), to make the best use of these inputs. en, representations from each view are further mapped to a shared semantic space with a merged layer to obtain the integrated user/item representations. Finally, a multilayer perceptron layer is introduced to capture the complex relations between users and items. For model learning, the loss function is defined as the error between the predicted rating and the actual rating, and model parameters are adjusted via a backpropagation. Our proposed model is evaluated over the Amazon product data and compared with classic and state-of-the-art recommendation approaches. e experimental results demonstrate that DeepFusion significantly outperforms all baselines. e main contributions of our work are summarized as follows: (i) We propose a novel personalized recommendation method based on deep learning and multiview fusion, called DeepFusion, for rating prediction task in product recommendation. e method is capable of incorporating user-generated content and item raw content including numerical ratings, textual reviews, and item metadata in a unified space. (ii) We utilize multiple types of deep neural networks that are best suited for each type of heterogeneous data sources to jointly learn user and item representations, which is beneficial to tackle the sparsity problem and the unbalanced distribution problem. (iii) We conduct a series of extensive experiments on a real-world dataset. e experimental results demonstrate that our approach outperforms all the baselines. And we further study the impact of the design selections and hyperparameters of our recommender system.
(iv) To the best of our knowledge, we are the first to explore the relative importance of product (item) metadata, i.e., price, brand, title, and description, on user buying behaviors on e-commerce websites. e remaining of the paper is organized as follows. In Section 2, we present the overview of related work. In Section 3, we describe our proposed model in depth. en, we present the experimental results and analyses in Section 4. Finally, we conclude our work and introduce the future work in section 5.

Related Work
ere have been extensive works on recommendation systems with a myriad of publication. In this section, we briefly review a representative set of approaches that are mostly related to our proposed approach.

Additional Data Sources for Recommender Systems.
In recent years, numerous works have been proposed to exploit additional data sources for personalized recommendation. A popular research line is the joint modeling of numerical ratings and textual reviews for recommendation. Textual reviews are able to express user opinions towards various item features. McAuley and Leskovec [15] proposed the hidden factors as topics (HFT) model. is model extracted latent topics from reviews via topic model latent dirichlet allocation [16] and associated topics with rating dimensions. Chambua et al. [17] introduced the linguistic similarity between review texts and incorporated it into the probabilistic matrix factorization (PMF) model [18]. Cheng et al. [19] emphasized the importance of considering users' varying attentions on different aspects and applied the aspect-aware topic model (ATM) on the review text to estimate the aspect attention weights of a user towards an item. Although the mixture of aspects discovered by topic-based methods may describe a corpus fairly well, aspects often consist of unrelated or loosely related concepts [20]. erefore, failing to preserve the original order of words and ignoring their semantic meaning, the above methods cannot achieve the successful modeling of a given review. To tackle these limitations, researchers have paid extensive attentions to neural network methods. Zheng et al. [11] presented deep cooperative neural networks (DeepCoNN) for learning user behaviors and item properties by using two parallel CNNs. Chen et al. [12] aimed at exploiting the usefulness of reviews and developed a neural attention regression model for predicting ratings and selecting highly useful reviews simultaneously. Cheng et al. [21] developed a novel aspectaware recommender model named A 3 NCF, which can capture a user's special attention on each aspect of the targeted item with an attention network.
In addition to textual reviews, additional data sources such as tags, item descriptions, item images, and user social networks have been used as supplemental information for sparse ratings. Ma et al. [22] investigated the combination of tags and genre information for identifying user interests via an augmenting matrix factorization approach. Kim et al. [23] utilized a CNN to capture contextual information of item description documents and integrated it into the PMF method with consideration of the Gaussian noise. Cheng et al. [24] applied a proposed multimodal aspect-aware topic model (MATM) on textual reviews and item images to model user preferences and item features from different aspects. Huang et al. [25] constructed a hybrid multigroup coclustering recommendation framework to cluster users and items into multiple categories simultaneously, which fully utilized various data sources including ratings and user social networks. Qian et al. [26] fused three types of representative heterogeneous information to comprehensively analyze user features, such as ratings, user social networks, and user review sentiments. Zheng et al. [27] considered the evolving nature of user preferences over time and developed a time-sensitive and tag-aware recommendation framework. Bougiatiotis and Giannakopoulos [28] presented a contentbased movie recommender system that was based on textual information and audio and visual channels.

Deep Learning for Recommender Systems.
Recently, deep learning has realized tremendous success in recommender systems. In this section, we review several representativerelated approaches. ere are extensive works that combine neural network structures with collaborative filtering. Li et al. [29] learned effective latent representations via a deep architecture for CF, which coupled PMF with marginalized denoising stacked autoencoders (mDAs). Wang and Wang [30] integrated the deep belief network and the probabilistic graphical model and used the hybrid model to learn features from audio content. Wu et al. [31] designed a collaborative denoising autoencoder (CDAE) for top-N recommendation by training on a corrupted version of the known user-item interactions.
Due to the strong performance of deep learning on feature representation and combination, many deep models have been developed for learning the latent representations of users and items. Wang et al. [32] proposed collaborative deep learning (CDL), which learned a deep representation from movie content by using the generalized stacked autoencoder (SAE) model. Yu et al. [33] proposed an interactive attention mechanism to learn the latent representations of users and items and provided interpretable item recommendation. Zhou et al. [34] developed the deep interest network (DIN) for adaptively learning the representation of user interests via a local activation unit.
Several recent studies focused on the interactions between features. He et al. [35] proposed a general neural network-based collaborating filtering (NCF) approach for modeling nonlinear interactions between users and items. After that they [36] developed the neural factorization machine (NFM), which seamlessly combined the linearity of the factorization machine (FM) [37] and the nonlinearity of the neural network to capture second-order and higherorder feature interactions between users and items. Cheng et al. [38] designed the wide&deep learning model for enhancing the memorization and generalization performances of recommender systems. Guo et al. [39] presented DeepFM, which is an end-to-end learning model, for emphasizing both low-and high-order feature interactions. Chambua et al. [40] established a hybrid recurrent neural networklong short-term memory (RNN-LSTM) architecture for learning user preferences with item aspects. Zhou et al. [41] proposed a deep interest evolution network (DIEN) for click-through rate prediction, which can capture temporal interests via an interest extractor layer.
Recently, several deep learning recommendation methods with multiview fusion have been proposed. Chen et al. [42] argued that, in multimedia recommendation, there exists item-and component-level implicitness which blur the underlying user preferences and proposed an attention mechanism in CF. In their model, the CNN was used to extract image features and video features. Zhang et al. [43] used three heterogeneous data sources including ratings, reviews, and image information to jointly model the user and item representations based on deep representation learning architectures. Gan et al. [44] adopted a convolution neural network to extract the hidden feature from item description and then fused it with tag information. Elkahky et al. [45] proposed a content-based cross-domain recommender system, which learned rich user and item features according to user web browsing histories and search queries. Tal and Liu [46] presented a textual and contextual embedding-based neural recommender (TCENR) for point-ofinterest (POI) recommendation. Multiple types of deep neural networks were utilized to analyze additional data sources including social networks, geospatial locations, and textual reviews. Guo et al. [47] presented a multimodal representation learning method to predict user preferences based on multimodal content, including visual features, text features, audio features, and user interactive history in short video understanding and recommendation. Chen et al. [48] proposed a novel neural architecture for fashion recommendation based on both image region-level features and user review information. Cui et al. [49] presented a visual and textural recurrent neural network (VT-RNN), which simultaneously learned the sequential latent vectors of user's interest and captured the content-based representations.
ough better performance against modeling ratings is achieved, the above models learned the latent representations of users and items independently. us, it is difficult for the above models to effectively model user preferences and item features.

Overview of DeepFusion.
e architecture of our proposed method DeepFusion is illustrated in Figure 1. factorization (MF) to acquire pair-dependent latent features on the basis of numerical ratings. Shown in the right side of Figure 1, the Item Metadata Modeling component introduces four deep neural networks to learn item complementary features based on item raw content, i.e., price, brand, title, and description. e outputs of three components are further mapped to a shared semantic space with a merged layer to obtain the integrated user preferences and item features. en we develop an MLP architecture to capture the complex relations between users and items and predict ratings. Finally, the loss function is defined as the error between the predicted rating and the actual rating, and model parameters are adjusted via a backpropagation process. Key notations adopted in this paper are summarized in Table 1.

Reviews Modeling.
To capture the underlying meanings of textual reviews, we use the Reviews Modeling component to improve the model's coverage. is process is conducted with two similar CNNs [11]: one network for users and one network for items. In the following, we describe the user network in detail. e first layer is the word embedding layer, which receives user reviews as the input in their original order and outputs a c-dimensional distributed vector. In this paper, user u reviews refer to all the reviews that were written by user u. As a widely used word representation in information retrieval, the bag-of-words model is based on onehot representation [50] and is usually used to transform a word into a feature vector. However, the one-hot representation of a word, ignoring semantic and grammatical interpretation, tends to suffer from the curse of dimensionality [51]. To alleviate this problem, we resort to a widely used natural language processing method Word2Vec [52] to convert the dictionary of words into formal and uniform vectors. e output of the word embedding layer can be represented as where d u t denotes the t-th word of user u review, T represents the length of user u review, f(d u t ) is a lookup function that returns the corresponding word vectors, and [f 1 , f 2 , . . . , f T ] denotes the concatenation of word vectors. e next layer is the convolution layer, which consists of m neurons. Each neuron is associated with a convolution kernel K ∈ R w×c , where w is the window size. e convolution kernel can perform multiple convolution operations on word vectors V u 1: T and add bias to obtain the feature map. e features of neuron K n can be defined as where symbol * denotes the convolution operation, b n is the bias, and f is the nonlinear activation function. According to [33], we employ the rectified linear unit (ReLU), which has been widely used in neural networks, as an activation function: en, we use the maxpooling operation to select the maximum value as the significant feature for extraction. e pooling layer not only reduces the dimension of data, but also retains the representative features. Suppose that After many applications of similar operations, the final output of the pooling layer is the concatenation of various features of m neurons, and the output is As expressed in equation (6), the output O is fed to the fully connected layer, which comprises weight matrix W ϵ R m×k and a bias term b ϵ R k and uses ReLU as the activation function. e outputs of the fully connected layer are user preferences or item features based on textual reviews and are denoted as U Reviews and I Reviews , respectively:

Ratings Modeling.
Matrix factorization is one of the most popular model-based CF methods and can learn linear interactions between users and items. Inspired by [35], we develop a full neural treatment of MF for deriving pairdependent latent representations on the basis of user-item rating pairs, which is named as the Ratings Modeling component.
As is illustrated in Figure 1, the inputs of the neural network are the identities of users. e embedding layer projects the sparse representation to a dense vector via a lookup function ∅. Let ID 1 , ID 2 , . . . , ID u be the ID embeddings of users. en, the corresponding dense vector P is defined as e next layer is a full connection layer whose outputs can be regarded as the latent vector for user U Ratings in the context of numerical ratings. Here, we still use ReLU as the activation function: e same process is adopted by the item network with corresponding layers, and we can acquire the latent vector for item I Ratings in the context of numerical ratings.

Item Metadata Modeling.
e above two components are built based on user-generated content and easily suffer from the sparsity problem and the unbalanced distribution problem. erefore, we develop the Item Metadata Modeling component to model item raw content. Different types of item metadata describe different aspects of item features. Considering feasibility and availability, we use four types of product (item) metadata including price, brand, title, and description as supplemental information for item features.
We regard "price" and "brand" as structured indicators, which are typically easy to understand. us, the network of price is analogous with that of the brand and they differ only in terms of their inputs. In the first layer of the price network, an embedding function φ: M ⟶ R k maps the price range into a k-dimensional corresponding vector V i . en, we use a full connection layer to learn the representation of price feature E i : Similarly, we use the brand category as the input of the brand network and acquire the brand feature representation F i .
In addition, title and description texts typically reflect an overall profile of item and are structured in natural language. erefore, we use two parallel convolutional neural networks to learn the title and description feature representations H i and D i , respectively. e CNNs are similar to the user network that is described in Section 3.2.
After that, to map the above four aspect features into a unified feature space, we concatenate them directly:  To prevent high dimensionality of feature vectors and emphasize the most important features of item, we feed G i into a full connection layer to obtain the item features in the context of the item metadata:

Multiview Fusion and Prediction.
To effectively use multiview information in recommender systems, we fuse the above three components by merging the outputs of their last layers. e final representations for users consist of two data sources: user-item numerical ratings and user-item textual reviews. Except for ratings and reviews, the final representations for items comprise the item metadata: I � I Reviews + I Ratings + I metadata .
en, to better capture the complex relations between users and items, we utilize an MLP architecture [34], which is a widely used technique with excellent scalability. We concatenate U and I into a single vector Z � [U, I] and feed Z into new hidden layers to predict ratings.
e number of hidden layers can be customized to better model the latent structure of user-item interactions. In our model, we use two hidden layers with 32 and 16 hidden units.

3.6.
Learning. According to [12,53], we adopt the squared loss as the objective function, which is commonly used in rating prediction problems. Suppose R u,i and R u,i represent the predicted value and the ground-truth value of user u on item i, the objective function can be defined as where Γ denotes the set of instances for training. en, we optimize the model via adaptive moment estimation (Adam) [54] and adjust the model parameters via a backpropagation process. By automatically selecting the learning rate, the deep learning model quickly converges. In addition, to avoid overfitting, we adopt the dropout [55] strategy. After obtaining the merging vector Z, we randomly drop ρ percent of the neurons and their connections, where ρ is the dropout ratio.

Dataset.
To evaluate the performance of our model in terms of rating prediction, we conduct extensive experiments on three real-world datasets that were collected from Amazon product data (http://jmcauley.ucsd.edu/ data/amazon): Musical_Instruments, Automotive, and Sports_and_Outdoors. ese datasets consist of users' explicit ratings on items on a scale of 1 to 5 and textual reviews for various products. In summary, each rating record is a four-tuple (userID, itemID, rating, and reviews). Each item contains itemID, price, brand, title, and description. Items that do not have metadata information such as the price, title, or description documents are removed from the dataset. e users and the items that have less than 5 ratings are also removed. e detailed statistics are presented in Table 2.
We preprocess textual documents including reviews, titles, and descriptions according to [12]. Since the length of the text and the size of vocabulary exhibit a long-tail effect [9], we only keep p percent of the length of reviews, where p is set to 0.85, and keep r percent the size of vocabulary, where r is set to 0.7. We project the prices into specified price ranges with intervals of 10 and classify items for which the price exceeds 260 into the same class. Items that lack the brand attribute are classified as "other." And, we argue that users who purchase those items do not pay much attention to the brand attribute.

Evaluation Criteria.
We adopt the root mean square error (RMSE) as the evaluation criterion, which is a standard evaluation metric for rating prediction in recommender systems [12]. Given a predicted rating R u,i and a groundtruth rating R u,i from user u to item i, the RMSE score is computed by We also used the mean absolute error (MAE) to evaluate our model, which has been widely used in previous studies [18]. e MAE score is calculated as where N denotes the number of ratings between users and items. A lower RMSE (equation (13)) and a lower MAE (equation (14)) correspond to a better recommendation performance.

Baseline Methods.
To evaluate the performance of our proposed DeepFusion method, we select four comparison algorithms and describe them briefly as follows: (i) PMF [18]: probabilistic matrix factorization based on ratings is a standard rating prediction method that initializes the latent factors for users and items from a Gaussian distribution. (ii) NeuMF [35]: neural matrix factorization based on ratings is a state-of-the-art model that fuses generalized matrix factorization (GMF) and MLP to jointly model the linear and nonlinear interactions between user preferences and item features. (iii) ConvMF [56]: convolutional matrix factorization is based on ratings and textual description documents. ConvMF is a strong baseline that integrates CNN into PMF to improve the rating prediction accuracy.
6 Complexity (iv) DeepCoNN [11]: the deep cooperative neural network method based on ratings and textual reviews utilizes two parallel neural networks to jointly learn user preferences and item features and enables these two latent factors to interact with each other in a manner similar to factorization machine. Since DeepCoNN was evaluated against strong topic models, such as collaborative topic regression (CTR) [57], hidden factors as topics (HFT) [15], and collaborative deep learning (CDL) [32], and demonstrated superior performance, we do not repeat those same comparisons in this paper.

Experiment Details.
Each dataset is split randomly into a training set (80%), a validation set (10%), and a test set (10%). e training set contains at least one rating on every user and item and is used to train our model. e validation set is used to tune hyperparameters and early stop the training phase. e test set is used to conduct the final performance comparison.
According to the results of parameter tuning, we set the number of latent factors to k � 50 and regularization parameters to λ u � 0.1 and λ v � 0.1 for PMF. For ConvMF, we leverage reviews on the items as item description documents and set the latent dimension of U and V to 50 and the regularization parameters to λ u � 100 and λ v � 10. For deep learning-based methods, NeuMF, DeepCoNN, and Deep-Fusion, we set the learning rate to l r � 0.001, the batch size to a � 128, the dropout ratio to ρ � 0.5, and the number of latent factors to k � 16. For the CNNs in DeepFusion, we reuse most of the hyperparameter settings that were presented by the authors of DeepCoNN, and we set the number of neurons to m � 100 and the window size to w � 3. We use Word2Vec as the pretrained word embedding model. We initialize word latent vectors via pretrained 300-dimensional word embeddings, which were trained on more than 100 billion words from Google News [41].

Performance Evaluation.
e performances of our proposed algorithm and all baselines on the three datasets are reported in Tables 3 and 4. e experiments are repeated 3 times, and the averages are reported with the best results indicated in bold.
Between the methods that consider ratings, NeuMF outperforms PMF on both evaluation criteria. e main limitation of PMF is that it learns the latent factors via a global optimization strategy and predicts an unknown rating by the dot product of the targeted user and item latent factors. As a result, the performance could be severely compromised locally for individual users or items. In contrast, NeuMF combines the linearity of MF and the nonlinearity of deep neural networks to learn user and item latent features. e hybrid architecture accurately models the interactions between them and thus offers better performance.
In addition, the experimental results demonstrate that all review-based rating prediction methods (ConvMF, Deep-CoNN, and DeepFusion) outperform PMF. is is because a rating only reflects the overall satisfaction or judgment of a user towards an item. Relying solely on ratings makes it hard for PMF to explicitly and accurately model user and item features. Textual reviews indicate user opinion and emotion towards items' various features. erefore, these methods can provide such fine-grained analysis. Moreover, Deep-CoNN and DeepFusion outperform ConvMF. ConvMF only utilizes a CNN to acquire item features from reviews, while user preferences are just learned from numerical ratings. As a result, ConvMF suffers from poor rating prediction.
Overall, our method DeepFusion outperforms all the baseline methods. e reasons are as follows. First, our approach jointly learns user preferences and item features from multiview data: numerical ratings, textual reviews, and item metadata, which support our hypothesis that incorporating user-generated content and item raw content will improve recommendation performance. Second, our model utilizes multiple types of deep neural networks that are best suited for each type of heterogeneous data sources, which is beneficial to tackle the sparsity problem and the unbalanced distribution problem. ird, our method adopts a multilayer perceptron architecture to learn the complex relationships between users and items and thus obtains a better rating prediction.

Model Analysis.
In this section, we discuss the effects of hyperparameters and several design selections on our model's performance.

Impact of k.
Our method encodes feature representations of users and items into two k-dimensional latent vectors, where k denotes the number of latent factors. To empirically study the effect of k, we compare the performances of DeepFusion on the validation set among the values of k in [4, 8, 12, 16, 32, 50, and 64]. Figure 2 plots the RMSE and MAE changes in our model as functions of k on three datasets. e increase of k from 4 to 16 promotes the performance, while the increase of k from 16 to 64 does not boost the performance, but rather causes degradation. Hence, a relatively low or high value of k may cause underfitting or overfitting. erefore, a proper value of k can enhance the performance of the recommender systems, and k � 16 is the optimal setting for our experiment. (i) DeepFusion-dp: we use a simple dot product of the latent features of users and items as the rating predictor. (ii) DeepFusion-lmf: the merged layer enables the latent features of users and items to interact with each other, similarly to latent matrix factorization (LMF) [6].
(iii) DeepFusion-fm: we utilize the factorization machine [37] in place of the original neural prediction layer.
As shown in Figure 3, DeepFusion-fm outperforms DeepFusion-dp and DeepFusion-lmf on three datasets. is is because the factorization machine models not only the first-order interactions, but also the second-order interactions between the representations of users and items. In addition, the factorization machine automatically selects helpful features and thus offers better performance. More importantly, our original method DeepFusion significantly outperforms all variant methods. e results demonstrate that the multilayer perceptron architecture effectively learns the hidden intricate relationships between users and items   and models the nonlinear interactions between them. erefore, the multilayer perceptron architecture was integrated into our final model.  Figure 4 plots the prediction accuracy of the three variants and DeepFusion in terms of RMSE values. First, DeepFusion outperforms DeepFusion-Reviews on the three datasets. is is because the semantic and syntactic information of textual reviews compensates for a shortage of ratings. Moreover, this result demonstrates that textual reviews cover rich information, which is beneficial to reveal user preferences and item features.
Second, DeepFusion-Ratings performs relatively weaker than DeepFusion. Hence, the application of the interactionbased learning process on the basis of user-item pairs is conducive to rating prediction. e pair-dependent latent representations complement the independent reviews and item metadata learning approaches. Meanwhile, MF easily captures linear interactions between users and items and is thus effective in improving the prediction performance.
Finally, DeepFusion-Metadata performs the worst by far on all three datasets. e results show that the item metadata information effectively reflects the comprehensive characteristics of items. Since user-item interaction data are extremely sparse in e-commercial recommender systems, approaches only using user-generated content easily suffer from poor performance. By incorporating item metadata, richer features can be acquired for items that had few ratings and reviews, which is conducive to tackling the sparsity problem and the unbalanced distribution problem. Since the three components complement with each other, our proposed model DeepFusion can reap the benefits of additional data sources and realize superior rating prediction accuracy.
Furthermore, to further explore the relative importance of various product (item) metadata in promoting the rating prediction performance, we conduct a set of experiments using DeepFusion-Price, DeepFusion-Brand, DeepFusion-Title, and DeepFusion-Des. ese variants are summarized as follows:    performs the worst among these variants. Hence, the item brand attribute plays the most important role in product recommendation. Furthermore, this phenomenon demonstrates that customers pay more attention to the brand attribute of products; therefore, businesses should make efforts to enhance brand images. In addition, although Deep-Fusion-Des and DeepFusion-Title slightly outperform DeepFusion-Brand, they are both outperformed by Deep-Fusion. e results demonstrate that incorporating textual titles and textual descriptions into our model promotes the rating prediction accuracy. Textual titles and textual descriptions typically reflect an overall profile of an item and facilitate to acquire the item features. Finally, DeepFusion-Price outperforms other variants but is outperformed by DeepFusion; hence, the price attribute slightly contributes to boosting the recommendation accuracy. In summary, the relative importance of product (item) metadata is different on user buying behaviors in e-commerce. e brand attribute plays the most important role in product recommender systems, followed by textual descriptions and textual titles, and the price attribute has a relatively slight effect on the rating prediction performance.

Conclusions
In this paper, we proposed a novel personalized recommendation method based on deep neural networks and multiview fusion, called DeepFusion, for rating prediction task in product recommendation. e model is capable of incorporating user-generated content and item raw content including numerical ratings, textual reviews, and item metadata in a unified space. Meanwhile, we designed three neural network components to jointly learn user and item representations. Finally, a multilayer perceptron layer was employed to capture the complex relations between users and items. Evaluated over the Amazon product dataset, our proposed DeepFusion model achieved better performance than all the baselines. In addition, we further explored the relative importance of product (item) metadata on user buying behaviors on e-commerce websites.
In the future, we intend to evaluate our proposed method over additional datasets. In addition, we will consider incorporating more heterogeneous data sources such as item images and user social networks towards a unified, multiview informed practical recommender system.

Data Availability
e Amazon product data supporting the findings of this study are from previously reported studies and datasets, which have been cited. e processed data are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.