Collaborative Filtering Recommendation Algorithm Based on User Characteristics and User Interests

Among many e-commerce platforms, collaborative filtering recommendation algorithm is currently the most widely used recommendation technology. In order to alleviate the deficiencies of the traditional user-based collaborative filtering algorithm in cold start and recommendation accuracy, this paper proposes a collaborative filtering recommendation algorithm based on user characteristics and user interests. The similarity of the algorithm in this paper is composed of user score similarity, user attribute feature similarity and user interest similarity, in which user registration information is used to extract attribute features to calculate user feature similarity; use the number of user evaluations of project attributes to measure users’ interest in different project attributes, use the similarity calculation formula to calculate the interest similarity value between users. The user attribute feature similarity and user interest similarity are combined with the user rating similarity to obtain the final similarity for recommendation. Finally, a simulation experiment is performed on the MovieLens movie data set. Through the experimental results, it can be seen that the improved collaborative filtering algorithm based on user characteristics and user interests not only solves the cold start problem, but also improves the recommendation accuracy.


Introduction
In recent years, with the rapid development of information technology, the Internet produces a lot of data every day. Taking the Douyin App, which is well known in recent years, as an example, as long as we register an account, we have the opportunity to become a video producer and publisher, recording our own lives and also sharing our lives with others; another example is microblog, each of us can post our blog on the Internet to express our personal feelings and share with others. We can also spread our knowledge and communicate with others through blog. In the face of so much data, researchers have come up with a number of ways, for example, information retrieval technology that requires manual input of key information saves search time to a certain extent, but if the retrieval content cannot be given accurately, the searched product often cannot fully meet the user's interests and hobbies, moreover, with the exponentially increasing amount of information every year, the amount of data retrieved has also increased rapidly, so this technology is facing severe challenges [1]. The emergence of recommendation technology has solved this problem well. It uses some behaviors of users and some mathematical algorithms to predict what users may be interested in, and actively recommends to users [2], which is equivalent to intelligent recommendation. The application field of the recommendation algorithm is very wide. It can be applied to book websites, music websites, video websites, map websites, and news portal [3,4].
There are many types of recommendation algorithms, such as content-based recommendations, collaborative filtering (referred to as CF)-based recommendation algorithms and association rules-  [5]. Among the many recommendation algorithm technologies, one of the earliest and most successful technologies is the CF [6].
The CF includes model-based algorithm and memory-based algorithm [7]. Memory-based CF technology is divided into two categories, one based on users and the other based on items [8]. CF usually uses the nearest neighbor technology, count the distance between users through the acquired user's historical preference data, and then forecasts the intent user's preference for characteristic products based on the weighted value of the goal user's neighbor users' product evaluation degree, and finally recommend to the target user based on this preference. The strongest point is that it doesn't pose any restrictions on the recommended objects and it can process unstructured complex objects such as movies and music. However, collaborative filtering recommendations still have some shortcomings, such as sparse problems, scalability problems, new user problems, poor quality of recommendations at the beginning of the system, cold start [9].

Related Work
Various improved algorithms are emerging one after another. Literature [10] investigates how to leverage the heterogeneous information in a knowledge base to improve the quality of recommender systems. Literature [11] proposed a new user similarity model to improve the recommendation performance when only few ratings are available to calculate the similarities for each user, which improves the recommendation performance. Cao [12] and others uses knowledge graphs to analyze the reasons for user preferences, understand the user's points of interest, and recommend items they may like to the user based on the points of interest. Ma [13] proposed a consider the situations involving ratings and trust. Literature [14] proposed a weight-based time similarity calculation method, which effectively reflects the user 's interest, improved the accuracy of recommendations. Literature [15] proposed a movie recommendation model based on word vector features, improves the recommendation effect, but the input of a single text is Improving the accuracy of recommendations may not be very effective. Literature [16] proposed an ensemble based k-NN collaborative filtering, this work fuses the merits of the user based k-NN algorithm and item-based algorithm, and give better recommendation quality. Literature [17] proposed a similarity measure model considering users' preferences for item attributes, which effectively improves the performance of the recommended algorithm. Literature [18] proposed a method of filling unrated data on the user-item rating matrix by using linear regression model, alleviate the data sparsity. Literature [19] introduced an enhanced recommendation algorithm based on modified user-based collaborative filtering, improve the recommendation quality.
Based on the above research, this paper presents an improved collaborative filtering recommendation algorithm. It is fully considers the influence of the user's attributes characteristics on the cold start in terms of the recommended quality. By computing the resemblance of the user's characteristics, the problem of poor recommendation quality at the beginning of the system is alleviated; in terms of prediction accuracy, the user's personalized interest information is fully utilized to establish an interest matrix to calculate the interest similarity value. The user attribute feature similarity is merged with user's interest resemblance and user's rating resemblance to obtain the final resemblance value, and finally the similarity value is scored and predicted on the user-item rating matrix, and filtered using the Top-N algorithm [20], use the average absolute error value to measure whether the algorithm has been improved and become better, through the experimental results, we can see that the algorithm in this paper has better advantages than the traditional CF algorithm.

Establish a scoring matrix
The data involved in the User-based CF algorithm is the user 's historical rating data for the item, and the rating matrix N is used to indicate the rating data, m indicates the number of users and n represents the count of items. N ij is the rating value of the user U i for the item I j . The rating value is generally

Select nearest neighbor
This article takes the user-based CF as an example to illustrate the process of generating the nearest neighbor set. First, using the following formula, count the resemblance between different users and the intent user a. From this, the nearest neighbor set K a = {a 1 , a 2 , ..., a k } of the target user a is found, which is composed of the K most similar neighbors of a. This article gives three common methods of calculating similarity. Cosine similarity: The score vector of user a in the n-dimensional object space is denoted by i. The score vector of user b in the n-dimensional object space is denoted by j, the calculation is shown in (1): Modified cosine similarity: For the user rating scale, the cosine similarity is not considered, explain in detail, for example, if it is [1][2][3][4][5] in the scoring area, for user a, a rating of 3 or higher is what you like, and for user b, a rating of 4 or higher is your favorite. The modified cosine similarity measurement method can balance the difference of users' ratings, mainly by the user's rating minus the user's average score to balance, which improves the above problem. The modified cosine similarity is shown in (2): In the modified cosine similarity formula N ab represents the common rating set of user a and user b, the rating set of a is represented by N a , the rating set of user b is represented by N b and N a, i represents the rating of a on item i. N b, i represents the rating value of b for i and the average value of all ratings of user a is represented by a N , similarly user b's ratings average is represented by b N . Pearson related similarities: The strength of the correlation between the two variables can be described by the Pearson correlation coefficient. The larger its absolute value, the greater the correlation. The mathematical definition of Pearson's related similarity is shown in (3): Among them, N a, b stand for the set of items that users a and b have rated together; N a, i and N b, i represent the ratings of a and b for item i, and a N and b N represent the average of all ratings of user a and user b. This article only selects the Pearson correlation similarity with relatively superior performance under the same conditions for explanation.

Predict scores and generate recommendations
After obtaining the K nearest neighbor set, we can predict the unrated items of the goal user according to the data of similar users in the set, and sort the predicted scores from high to low to generate Top-N recommendations to the target user for selection. The calculation is as shown in (4), the set K a is the neighbor set of the target user. N b,i represents user b's rating of item i, a N and b N represent the average of all ratings of a and b, respectively. P a,i is user a's predicted rating value of unrated items.

User feature similarity calculation
Generally, when a user starts to register an account on a website, some basic information needs to be filled in [21]. The information to be filled in on different websites may not be exactly the same, but most of them will include an account name, age, gender, occupation, and postcode. These seemingly unimportant basic information can actually reflect users' interests and hobbies. Users with the same characteristics often have similarities in interests and hobbies and become similar users. When new users are coming and there is no historical data, user characteristics are particularly important, so introducing the user's basic characteristics into the calculation of similarity can well avoid the problem of poor recommendation results brought by the user's cold start. This article uses the user's gender, age, occupation, zip code to calculate the user's feature similarity. Age similarity: As users of different ages, their favorite items will be different. As far as film and television works are concerned, children prefer cartoons, young people prefer idols, and old people tend to prefer time movies. Therefore, age similarity is a key factor of user feature similarity. We set the age of user a to A a and the age of user b to A b . The following (5) is used to calculate the similarity of age attributes: Gender similarity: The user's gender is different, there will be a big difference in the preferences of the project. Since the gender of a person is only a man and a woman. This article uses the following (6) to count the gender resemblance of user a and user b: This paper adds the above three feature similarities to the weight coefficients for addition, where L 1 , L 2 , L 3 ∈ [0-1], the sum of L 1 , L 2 , L 3 is 1. Formula (8) is the user attribute similarity Sim f (a, b) of user a and user b:

User interest similarity calculation
Human interest is composed of subjective expression and objective description, which are explicit interest and implicit interest. Explicit interest usually obtains the user 's interest in the form of conversation or questionnaire. For example, after registering on Sina Weibo, it will recommend many bloggers to let users choose the bloggers they need to pay attention to. Niuke.com is a famous professional platform for programmers to learn and grow, will also allow new users to choose their favorite language and technology when they just register. However, because some users may not have enough patience, give up or fill in at will, it may lead to inaccurate final results. The user's hidden interest is mainly reflected in the user's behavior preference data, mostly user back-end data. The back-end data is mostly user behavior information, such as user click information on commodities, evaluation data on movies. In this way, the recommendation system can calculate the recommendation results from all the user's various behavior record data, which can greatly reduce the user's time, obtain and calculate the user's potential interests and hobbies in the background. Therefore, this paper uses implicit interest to compute the user's interest resemblance. This article defines the user's level of interest by counting the sum of the number of item attributes evaluated by the user. For example, a person has seen many movies (a movie contains more than one attribute). If you want to count the movies of the love attribute that the person has seen, then you can measure the person's preference for love movies from the number of times the movie contains love attributes. The number of visits is positively related to the degree of interest. Therefore, an interest matrix is established to express the user's interest in each item attribute.
In the matrix, the total count of users is represented by m, and the total count of attributes is represented by z, and the total number of times user i evaluates item attributes j is represented by w ij . Based on the user preference matrix, we can compute the similarity between users' interest in item attributes. The calculation is revealed in (9): In the formula, the total number of times the item evaluated by user a includes attribute z is represented by w a,z , and the total count of times the item evaluated by user b contains attribute z is represented by w b, z . The average value of the count of times the user a evaluates all project attributes is denoted by a W , and the average value of the number of times user b evaluates all project attributes is denoted by b W , and Z ab represents the collection of all item attributes evaluated by user a and user b.

Similarity Fusion
We use (3) to calculate the user's rating similarity to get Sim r (a, b), the feature similarity is expressed by sim f (a, b) of (8), and the similarity of user interesting for project attributes is represented by sim i (a, b) of (9). The three similarities are fused to get the final similarity sim(a, b). The three weight parameters d, e, f introduced here, where d, e, f ∈ [0-1], and d + e + f = 1. As shown in (10): ,

Improved algorithm flow
Input user rating information, item rating matrix, item attribute matrix, number of neighbors k, user feature information, and output target user 's predicted rating. Here are the brief steps of the algorithm: 1. Convert the user rating information into a user rating matrix, and calculate the similarity between users using (3), obtain Sim r ; 2. Use user characteristic information to obtain various characteristics of users, and calculate the age similarity , gender similarity, label similarity between users; 3. Substitute the obtained similarity of various features into (8) to obtain the overall similarity Sim f of user features; 4. Scan the item attribute matrix, and use Equation (9) to count the interest resemblance value between users; 5. The results obtained in step 1 and the results obtained in steps 3 and 4 are calculated using (10) to obtain the end user similarity Sim(a,b); 6. Using (4), predict the intent user score based on the set of k neighbor users on the rating matrix.

Data Sources
In order to confirm the usefulness of the improved algorithm, this experiment uses the famous MovieLens data set for experiments, which provides a user message, a movie message and a score message. The user information table includes the user's gender, age, occupation, mailbox, etc. The movie message includes the movie release time and movie type. The rating information table includes 943 users, 1,682 movies, and 100,000 ratings. The rating range is [1][2][3][4][5] points. Each user has rated at least 20 movies. The score is positively correlated with the user's preference. The degree of data sparsity of this data set can be calculated according to the data sparsity formula, as shown in (11): In the formula, spa is the degree of sparsity of the data set, num represents how many ratings there are, m represents how many users there are, and n represents the number of items, use (11) to calculate the spa=(1-100 000/ (943×1 682)) =93.69%. Here we separate the data set into a training set and a test set, and guarantee to be randomly assigned, the rate is 4:1. The training set is used for algorithm experiment and prediction estimation, and the test set is used for comparing the prediction estimation results. The average absolute error MAE is used as the criterion for judging the quality of recommendation results. This method of measuring recommendation accuracy is relatively easy to understand. Its essence is to count the average deviation between the forecast value and the true value.

Evaluation index
The smaller the value of MAE, the higher the recommended quality, the corresponding calculation formula is shown in (12), where p i is the user's real rating value; q i is the rating value forecast by the algorithm; and the number of items predicted to score is N . ,

Experiment 1. Determine the number of nearest neighbors for tuning
In order to select the suitable number of nearest neighbors to achieve better recommendation results when adjusting parameters, this experiment analyzes the MAE values of different numbers of neighbors. First select the traditional CF recommendation algorithm, set the value range of the nearest neighbor to , the increment is 5, and then calculate MAE, the experimental results are as follows : Fig.1 The value of K on MAE Fig.1 shows, when the nearest neighbor user is 40, MAE tends to be stable, and the effect is better at this time, so the nearest neighbor user is selected as 40 in the subsequent parameter adjustment.

Experiment 2. Determination of user characteristic parameters
According to (8), L 1 , L 2 and L 3 represent the weight of gender, age, label similarity in user attributes, respectively, and L 1 , L 2 , L 3 ∈ [0-1], L 1 + L 2 + L 3 = 1, set the increment to 0.1, the abscissa is the value of the weight L 1 , and then calculate the MAE value corresponding to different weights, the experimental results are as follows :

Experiment 3. Determination of d, e, f parameters
According to (10), d, e, and f represent the weight of user rating resemblance, user attribute resemblance, and user interest similarity, respectively, and d, e, f ∈ [0-1], d + e + f = 1, set the increment to 0.1, the abscissa is the value of the weight d, and then calculate the MAE value corresponding to different weights, the experimental results are as follows :

Experiment 4. Comparison experiment of different recommendation algorithms
In order to better reflect the effect of the algorithm be put forward in this article, the improved algorithm in this paper is compared with the traditional CF algorithm (Pearson Similarity) and the improved algorithms in literature [17], literature [21]. The values of the nearest neighbors are all in the range of , and the increment is 10. Compare the MAE values of the respective algorithms under different numbers of neighbors. The experimental results obtained are shown as follows:  Fig.4 The comparison of different recommendation algorithms In Fig.4 , we can see that when the number of nearest neighbors is small, the algorithm results in this paper are good. With the increase in the number of nearest neighbors, although the MAE value obtained by the algorithm in this paper increases, at 35 points, same as the literature [17], when the nearest neighbor is greater than 35, it is slightly inferior to the literature [17], but still has a smaller MAE value than the traditional algorithm and the algorithm of the literature [21]. Therefore, the improved algorithm in this paper is more suitable for the case where the number of nearest neighbors is small.

Conclusions and Future Works
This paper aims at the shortcomings of traditional CF recommendation in such aspects as cold start and recommendation accuracy, a CF algorithm based on user characteristics and user interest is presented. The addition of user features effectively avoids the trouble caused by cold start,simultaneously, the introduction of user interest models enhances personalized recommendations and further improves the accuracy of similarity calculation; finally, the user rating similarity, user feature similarity and user interest similarity are used to find suitable fitting parameters through separate experiments. The final similarity value is obtained, and the reliability of the method is verified by experiments. Future work will study the impact of time changes on user interest.