Collaborative ﬁ ltering recommendation algorithm based on variational inference

Purpose – The purpose of this paper is to alleviate the problem of poor robustness and over-ﬁ tting caused by large-scale data incollaborative ﬁ ltering recommendation algorithms. Design/methodology/approach – Interpreting user behavior from the probabilistic perspective of hidden variables is helpful to improve robustness and over-ﬁ tting problems. Constructing a recommendation network by variational inference can effectively solve the complex distribution calculation in the probabilistic recommendation model. Based on the aforementioned analysis, this paper uses variational auto-encoder to construct a generating network, which can restore user-rating data to solve the problem of poor robustness and over-ﬁ tting caused by large-scale data. Meanwhile, for the existing KL-vanishing problem in the variational inference deep learning model, this paper optimizes the model by the KL annealing and Free Bits methods. Findings – The effect of the basic model is considerably improved after using the KL annealing or Free Bits method to solve KL vanishing. The proposed models evidently perform worse than competitors on small data sets, such as MovieLens 1M. By contrast, they have better effects on large data sets such as MovieLens 10M andMovieLens20M. Originality/value – This paper presents the usage of the variational inference model for collaborative ﬁ ltering recommendation and introduces the KL annealing and Free Bits methods to improve the basic model effect. Because the variational inference training denotes the probability distribution of the hidden vector, the problem of poor robustness and over ﬁ tting is alleviated. When the amount of data is relatively large in the actual application scenario, the probability distribution of the ﬁ tted actual data can better represent the user and the item. Therefore, using variational inference for collaborative ﬁ ltering recommendation is of practical value.


Introduction
With the continuous development of social networks, the problem of information overload has become increasingly serious. Although information overload can be alleviated by search engine retrieval, users are required to actively summarize the search content. However, meeting the needs of users is often difficult when using search results. Therefore, many applications currently use personalized recommendation combined with information search engines to serve the users. Personalized recommendation is the process of recommending items that may be of interest to users based on the historical behaviors of users or information contained in items (Adomavicius and Tuzhilin, 2005). To date, personalized recommendation is widely used in many fields, such as e-commerce (Yan et al., 2018), movie (Deldjoo et al., 2019), music (Schedl et al., 2018) and article reading (Cao et al., 2017).
In the recent years, researchers have proposed a number of personalized recommendation algorithms based on neural networks, which can improve the accuracy of recommendations more effectively compared with traditional recommendation algorithms. However, most of these algorithms lack persuasive theoretical explanations. Several researchers combined the neural network with the Bayesian probability-based recommendation model to obtain a recommendation model with improved performance and strong theoretical support to enhance the interpretability of the recommendation. These models assume that user behavior data obey a specific probability distribution then perform further mathematical derivation and modeling, which mainly includes collaborative subject regression (Wang and Blei, 2011) and collaborative deep learning .
However, the abovementioned models frequently need to calculate complex probability distributions and only a few hypothetical conditional simplified models can be added for calculation. Furthermore, the neural network structure in the aforementioned models mostly uses simple models such as auto-encoder (AE) or multi-layer perceptron and cannot mine deep user relationships. Variational inference can effectively solve the two aforementioned problems. The concept of variational inference is to approximate a complex probability distribution that is difficult to solve by transforming known simple probability distributions, such as normal distribution and Bernoulli distribution. With the development of deep learning in recent years, the calculation of variational inference no longer requires the use of complex algorithms, such as EM. However, the neural network can be constructed to train the parameters of probability distribution. Therefore, variational inference has become a crucial technique in deep learning. Given the difficulty in calculating the complex probability distribution, variational inference uses the neural network to fit the parameters of the probability distribution. Compared with the single vector training of the traditional deep learning model, the training of distribution can comprehensively represent the user and item vectors, thus mining a deep relationship.
Based on the preceding analysis, this paper uses variational inference to solve the problem of traditional collaborative filtering algorithms caused by large-scale data. First, the user score matrix is directly filled using variational inference and then Top-N recommendation is performed. For the existing KL-vanishing problem in the variational inference deep learning algorithm (Bowman et al., 2015), several available solutions are proposed. The KL annealing (Bowman et al., 2015) and Free Bits  methods are then selected to construct the model. Finally, two collaborative filtering recommendation algorithms based on variational inference are obtained.
2. Related work 2.1 Recommendation algorithm 2.1.1 Collaborative filtering recommendation algorithm. The core idea of the collaborative filtering recommendation algorithm is to recommend items that the user may like according to the relationship between similar users or items. Sedhain et al. (2015) proposed a collaborative filtering recommendation algorithm based on AE, abbreviated as ACF.

IJCS 4,1
The main idea of ACF is to use AE as a data-filling tool, take the scores as input and output the filled score matrix after AE training. Then, a score prediction or top-N recommendation is performed. The advantage of this method is that the neural network can be used to mine the non-linear relationship between users. However, the main problem is that the network structure is very simple and noise resistance is weak. Wu et al. (2016) proposed a cooperative denoising auto-encoder (CDAE) model based on Sedhain et al. (2015), which replaces the AE in Sedhain et al. (2015) using a denoising autoencoder (DAE) model (Vincent et al., 2010).
Three differences are observed between CDAE and ACF, which lead to the better performance of the CDAE model for top-N recommendations: CDAE adds the same offset to each input variable. CDAE uses implicit feedback recommendation. CDAE adds Gaussian noise to the training to improve the noise immunity of the model.
In the CDAE comparison model, instead of direct comparison with ACF, the AE in ACF was replaced by DAE for comparison. Experiments in Wu et al. (2016) show that the recommendation of DAE for Top-N is superior to AE, which can enhance the robustness of the model. 2.1.2 Hybrid recommendation algorithm. The hybrid recommendation algorithm mainly considers how to combine the user registration information or the item content information and the collaborative filtering algorithm more closely. This method can alleviate the problem of cold start in collaborative filtering algorithm and make the recommendation better.
At present, one way to use these user or item auxiliary information is to construct a model directly and train the auxiliary information and user-scoring matrix together. The representative model is sparse linear model (SLIM) (Ning and Karypis, 2011). Ning and Karypis (2012) proposed several methods of using auxiliary information in recommendation model. The best one is the collective sparse linear model (cSLIM).

Variational inference
Variational inference is a type of technique for approximating complex probability distributions in Bayesian estimation and machine learning. This technique can be applied in numerous fields, such as natural language processing (Duh, 2018;Wang and Blunsom, 2015), computer vision and robotics (Hu and O'Connor, 2018;Krishnan et al., 2018), and computational neurology (Daunizeau et al., 2014;Gershman et al., 2014). The core idea of variational inference comprises two steps: (1) assume a known probability distribution q(z;l ); and (2) approximate q(z;l ) to p(z|x) by changing the parameter l of the distribution.
In this case, the calculation of p(z|x) is converted into the calculation of the following formula: After convergence of Formula (1), the actual probability distribution of p(z|x) can be replaced by q(z;l *) as the posterior distribution.
Collaborative filtering recommendation Kingma and Welling (2013) proposed the use of deep learning to solve variational inference and introduced the variational auto-encoder (VAE) model.

Combination of variational inference and deep learning
2.3.1 Using deep learning to construct variational inference. For clarity, this paper reports the KL divergence of the joint probability distributions p(x,z) and q(x,z): According to the probability formula, the following formula can be obtained: Formula (3) shows that ln p(x) of the first term is independent of z, and p(x) is uniquely determined by the sampled data. Therefore, the first term is a constant. Thus, according to the Bayesian probability formula, the ultimate optimization target becomes the minimum value of Formula (4): From the perspective of probability, the expectation can be approximated to the sampling calculation. Therefore in practical application, the calculation of the first term of Formula (4) is equivalent to obtaining z from the p(z|x) sample, and then substituting it into -ln q(x|z) for calculation. q(x|z) is assumed to obey a specific distribution. The specific distribution parameters can be directly trained by constructing a neural network from z to x. The same assumption is true for p(z|x). The value of KL divergence is solved according to the parameters obtained above. Figure 1 depicts the overall model structure. The model optimization target is Formula (4).The solid line in the figure represents the main flow of the model, and the dashed box indicates the optimization target that must be met at a certain step of the process.
2.3.2 Variational Auto-Encoder. The VAE is an application of variational inference. In the model of the VAE, the number of samples of the specific p(z|x) is directly taken as 1 in Kingma and Welling (2013) because of the random process of generating z for each normal distribution sampling. When the number of iteration steps becomes sufficient, the sampling is also considered sufficient. Therefore, Formula (4) can be converted into the following: If p(z|x) is assumed to be normally distributed and q(z) is taken as the standard normal distribution, then the following formula can be obtained: In Formula (6), x can directly train m (x) and d (x) through a deep learning network.
Assuming that q(x|z) obeys the normal distribution with mean m (z) and variance of fixed value d 2 , the following is derived: In this formula, z can directly train m (z) through the deep learning network. Thus, the two parts of Formula (5) can be derived from Formulas (6) and (7). Therefore, a VAE model can be constructed as shown in Figure 2, and the model optimization target is Formula (5).

Collaborative filtering recommendation algorithm based on variational inference
Most existing collaborative filtering algorithms have poor robustness and overfitting problems with the expansion of the recommendation data scale. This section mainly discusses the construction of a collaborative filtering recommendation algorithm through variational inference and optimizes the recommendation results to solve these problems.
The VAE in the variational inference model pertains to the generation model. Thus, this paper first attempts to directly use the VAE for score matrix filling and then performs a top-N recommendation. Afterward, the findings indicate that the variational inference deep  . ., I} denotes the item, X [ N U * I is the score matrix of the associated users and items, and x u = [x u1 ,. . ., x uI ] T represents the score of the u-th user on all items, abbreviated as x. z = [z 1 ,. . ., z D ] T denotes the hidden vector obtained from x, where D is the dimension of z. Referring to Wu et al. (2016), the following model uses implicit feedback recommendation and processes the final score data with only two values, namely, 1 and 0, to ensure the applicability of the model. A value of 1 indicates that the user likes the item, whereas a value of 0 indicates that the user does not like the item or has not rated the item.
The architecture of the entire generation model depends on the VAE model derived from Section 2.3.2. The optimization target of the model is derived as follows: In Formula (8), the KL divergence is similarly solved as that in Section 2.3.2. According to the foregoing discussion, only two values, namely, 0 and 1, are applied for x|z. Therefore, the actual probability distribution is close to the Bernoulli distribution, as shown in the following formula: Thus, Formula (10) can be obtained as follows: In Formula (10), r (k) (z) can directly trained through the neural network constructed by z.
Meanwhile, according to the Bernoulli distribution constraint, r (k) (z) [ [0,1]. Therefore, the final layer of the neural network must be activated using the sigmoid function. The model in Figure 3 can then be constructed as follows.
The complete algorithm is shown in Algorithm 1.
Algorithm 1 Collaborative Filtering Recommendation Algorithm Based on Variational Auto-Encoder Input: User rating information x Output: Filled rating information x' 1. Process the score data x and turn it into binary data. 2. Construct a neural network according to Figure 3. 3. Calculate the -ln q(x|z) term and the KL divergence during each training iteration: Determine the final optimization target of the neural network: . Output x' after training and further use x' for Top-N recommendation.
3.1.2 Algorithm analysis. The final optimization target of Algorithm 1 is Formula (8). The first term pertains to the reconstruction error of the entire model, whereas the second term denotes the probability distribution of the hidden vector z and the KL divergence of the standard normal distribution. Ultimately, reducing the result of Formula (8) is equivalent to maximizing the reconstruction error -ln q(x|z) and minimizing the KL divergence. For the KL divergence, if the input vector x is simply independent of the hidden vector z, then is satisfied. At this point, the posterior probability distribution will degenerate into a prior probability distribution, and the KL divergence can take a minimum value of 0. For the reconstruction error part, given that the neural network can fit any distribution, relying on z to train the distribution of q(x) after the decoder is fully trained is no longer necessary. The two reasons ultimately lead to the rapid optimization of the KL divergence to 0 during training, which is no longer important in the training process. The aforementioned problem is called the KL-vanishing problem.
3.2 Optimization method of KL vanishing problem 3.2.1 Existing optimization methods. Many researchers have proposed their own solutions to solve the KL-vanishing problem. Bowman et al. (2015) presented the KL-annealing method, which is as depicted as follows: At the beginning of the training, parameter b is set to 0 and then gradually incremented to 1 with the increase in the training step. The advantage is that q(z|x) can obtain additional time to embed the information of x into the hidden vector z.
Unlike the KL-annealing method, the Free Bits method  uses a technique of "reserving a little space" for each dimension of the KL divergence. Specifically, a threshold « is added to each dimension of the KL divergence, and the model will only optimize a dimension larger than « . Therefore, a loss function is obtained as shown below: Moreover, the entire KL divergence can be controlled without subdividing into each dimension when controlling for the KL divergence size. However, this tendency may result in only a few working dimensions, and most of the dimensions of the final z will not contain the information of x.
In addition to the two aforementioned methods, the normalizing flow method  and auto-encoding method (Shen et al., 2018) are introduced. Normalizing flow aims to obtain an improved prior probability distribution. The core of the normalizing flow method is to first sample the hidden vector from a simple distribution and then stabilize the hidden vector by continuous iterative reversible transformation. Conversely, the auto-encoding method is mainly used for dialogue generation. Combining VAE and cyclic neural network (CNN), their loss functions will interfere with each other at the beginning of training. Simultaneously, when using the VAE to model the dialogue, whether the hidden vector z obtained by prior probability can contain the information of the input variable cannot be guaranteed. Therefore, the AE can be explicitly constructed for z, and the VAE and AE can be separately trained to ensure easy convergence.
3.2.2 Using KL annealing to optimize collaborative filtering recommendations. Section 3.2.1 shows that the final optimization target of the KL-annealing method is Formula (12), IJCS 4,1 which introduces the equilibrium parameter b into KL divergence compared with the optimization target of the original variational inference. b will gradually increase during the training process. Therefore, the optimization step size of the KL divergence will be enlarged, such that the information of x can be injected into z. Meanwhile, as b increases, the weight of the generator part -ln q(x|z) will decrease. Therefore, relying on z is necessary to generate results.
Thus, the score matrix-filling algorithm obtained in Section 3.1.1 and the KL-annealing method is combined to obtain a new algorithm step as shown in Algorithm 2.
Algorithm 2 Collaborative filtering recommendation Algorithm based on KL-annealing method Input: User rating information x Output: Filled rating information x' 1. Process the score data x and turn it into binary data. 2. Construct a neural network according to Figure 3. 3. Calculate the -ln q(x|z) term and the KL divergence during each training iteration: Determine the final optimization target of the neural network: where b is defined as a variable, which is an input from outside the model.
5. Select the total annealing step total_anneal_steps, for the i-th iteration, and calculate b as follows: b ¼ i total_anneal_steps 6. Take x and b as inputs, output x', and further use x' for Top-N recommendations.
3.2.3 Using free bits to optimize collaborative filtering recommendations. In Section 3.2.1, the final optimization target of the Free Bits method is Formula (13), which adds a constraint to each dimension of the KL divergence compared with the optimization target of the original variational inference. The optimization of the corresponding dimension is only performed Collaborative filtering recommendation when the value in the KL divergence dimension is sufficiently large. Such an approach does not strictly require that the probability distribution of the final hidden vector is completely close to the normal distribution. However, the approach allows for a certain artificially defined deviation. In this manner, protection for KL divergence in the early stage of training can be provided, such that KL divergence will not be quickly iterated to 0 at the beginning of training.

Algorithm 3 Collaborative Filtering Recommendation Algorithm Based on Free Bits Method
Input: User rating information x Output: Filled rating information x' 1. Process the score data x and turn it into binary data. 2. Construct a neural network according to Figure 3. 3. Calculate the -ln q(x|z) term and the KL divergence during each training iteration: Determine the final optimization target of the neural network: where « is defined as a variable, which is an input from outside the model.
5. Select « , take x and « as inputs, output x', and further use x' for Top-N recommendations.

Experiment results
The experiments in this section mainly verify the effects of the proposed models in Sections 3.1 and 3.2. This paper mainly discusses the Top-N recommendation and chooses Recall rate, which is frequently used in Top-N recommendation evaluation as an indicator of accuracy evaluation. Correspondingly, this paper opts for normalized discounted cumulative gain (NDCG) as the correlation evaluation indicator.

Context of the experiment
The experiment uses Ubuntu 16.04 64 bit as the operating system, Python 2.7.13 as the programming language, Google Tensorflow 1.1.0 as the deep learning framework, and the IJCS 4,1 single-card NVIDIA GTX 1080 as GPU. All contrast models in the following experiments are conducted in the same environment. The experimental data sets in this section include MovieLens 1, 10 and 20 M data sets. For the Movie Lens 1 and 10 M data sets, the experiment only uses "ratings.dat" files. For the MovieLens 20 M data set, the experiment uses only "ratings.csv" files. In the data processing and according to the practice in Wu et al. (2016), this paper replaces the data with scores of four points and above with 1 whereas those with scores below four points are replaced with 0.
During the experiment, the stochastic gradient descent method is used to optimize the model objective function. The data batch size is 128. The initial learning rate of the optimizer Adam is 0.001. The learning rate is attenuated once every 15 cycles, and the decay rate is 0.025. A total of 3,000 cycles are iterated.

Experiment results
In the following experimental results, DAE, CDAE and cSLIM represent the three aforementioned models in Section 2.1. VAE is used to represent the VAE model built in Section 3.1. VAE_KL represents the VAE recommendation model using the KLannealing method, while VAE_FB represents the VAE recommendation model using Free Bits method. NDCG and Recall represent the evaluation indicators. In this paper, @20, @50, and @100 respectively indicate Top-20, Top-50, and Top-100 recommendations. Tables I, TII, and TIII provide the results of all comparison  experiments. Based on the experimental results of the three tables, VAE does not perform well in terms of recommendation for a small (such as MovieLens 1 M) or large-scale (such as MovieLens 10 M and MovieLens 20 M) data sets. The main reason for such performance is the KL-vanishing problem mentioned in Section 3.1.2.
The effect of the model is considerably improved after using the KL annealing or Free Bits method to solve KL vanishing. VAE_KL and VAE_FB evidently perform worse than DAE and CDAE on small data sets, such as MovieLens 1 M. By contrast, Collaborative filtering recommendation whereas VAE_KL and VAE_FB have better effects on large data sets such as MovieLens 10 M and MovieLens 20 M. This finding is mainly because when the amount of data is insufficiently large, the probability distribution of fitting the actual data is often inaccurate, and the fitting of the data distribution can only accommodate a local situation. Constructing a probability distribution becomes suitable in reality only when the amount of data reaches a certain scale or by adding other information auxiliary data for fitting. In this respect, the filling of the scoring matrix by directly using variational inference is suitable for comparing the real scenes of large data scale. The effectiveness of this approach is mainly because of the following reasons: The DAE and CDAE methods adopt a fixed noise-adding method, which will reduce the robustness of the model. The noise of VAE_KL and VAE_FB is mainly derived from the sampling of the probability distribution of the hidden vectors, and the degree of human interference is small. The distribution sampling of VAE_KL and VAE_FB is random. The same model parameters may eventually produce different hidden vectors. Based on this notion, different output data are trained. In this manner, overfitting is less likely to occur compared with the DAE and CDAE methods. VAE_KL and VAE_FB analyze the specific distribution of user score data obedience from the perspective of probability, which is more theoretical than DAE and CDAE. When dealing with large-scale data, the fitting of the probability distribution will be close to the real scene with practical values.

Conclusion
This paper presents the usage of the variational inference model for collaborative filtering recommendation. After introducing the KL annealing and Free Bits methods, the basic model effect is improved. Compared with the traditional method, the fixed noise-adding method will reduce the robustness of the model. Meanwhile, variational inference training denotes the probability distribution of the hidden vector, and the model noise mainly comes from the sampling of the probability distribution, which requires no artificial noise. Therefore, the robustness of the model will be improved. Meanwhile, the sampling of the probability distribution obtains different hidden vectors each time. Based on this tendency to obtain different output data, the occurrence of overfitting is reduced. When the amount of data is relatively large in the actual application scenario, the probability distribution of the fitted actual data can better represent the user and the item. Therefore, using variational inference for collaborative filtering recommendation is of practical value.