Nonindependent Session Recommendation Based on Ordinary Differential Equation

the original


Introduction
e traditional recommendation algorithms are based on the fact that all items that interact with the user are independent of each other. ese methods ignore the continuity of information about the user's behavioral sequence and increasingly fail to meet the individual needs of the user. For example, in the Item-based recommendation algorithms [1][2][3], more attention is paid to the similarity between items, and the similarity calculation is used to infer the user's preference.
e User-based recommendation algorithms [4][5][6] are similar to the Item-based recommendation. e difference is that the similarity between the preferences for different users is calculated. Although there are many hybrid recommendation methods [7][8][9] to compensate for the incompleteness of the above two methods, personalized recommendations are still not implemented. However, in recent years, due to the increasing amount of data, the limitations of traditional algorithms have become more and more visible, and Neural Networks have once again been pushed highly to the research. Network structures such as Long Short-Term Memory (LSTM) [10][11][12], Gated Recurrent Unit (GRU) [13,14], and Recurrent Neural Networks (RNNs) [15] have been widely used in user behavior serialization modeling problems and personalized recommendation [15,16]. e environment of the recommendation system is more complex and varied than expected. e user's behavior sequences are often indefinite, and the internal correlation between each sequence is very close, so Graph Neural Networks (GNN) [17,18] are introduced into the recommendation system. We use a graph-structured model to capture the transformation of items and generate accurate item embedding vectors accordingly. is is clearly different from traditional methods, because we are not building a single item embedding. With this in mind, more and more people are beginning to use the graph network to implement the method of session recommendation.
In recent years, more and more researchers have focused on recommendation methods based on user sessions. In most session-based recommendation studies, a unified definition of a session is a sequence of records of a temporal relationship that can be used to track a user's browsing, clicking, or purchasing behavior. Compared to other methods of directly modeling the relationship between users and items, the session-based approach can bring more implicit feedback.
In recent years, more and more researchers have focused on recommendation methods based on user sessions. In most session-based recommendation studies, a unified definition of a session is a sequence of records of a temporal relationship that can be used to track a user's browsing, clicking, or purchasing behavior. Compared to other methods of directly modeling the relationship between users and items, the session-based approach can bring more implicit feedback. But these studies are based on the fact that user behaviors are independent of each other, but it is not the case in real life.
In our research, we still use the idea of GNNs to model sessions, and Gated Graph Neural Networks (GGNNs) [19,20] capture the complex transitions between items within the session. However, since the discretization of such data has been often undefined, leading to the direct use of neural networks to learn such data, there may be problems of data loss or inaccurate inconsistencies in certain time intervals. At this time, we introduce Ordinary Differential Equations (ODE) [21][22][23] to make up for this shortcoming.
We will present our model in this article: Sess-ODEnet. It uses Neural Differential Equations to propose another novel idea for session recommendation. e original intent of this approach was to model the user's nonindependent intent. In other words, we can get the potential trajectory at any point in time by solving the Ordinary Differential Equations [24], which allows us to make forward or backward predictions at any point in time.
e contributions of this work are as follows:

Related Work
A reexamination of the user's behavioral sequence becomes very important. More and more researchers agree to use the form of conversation to represent the sequence of user behavior, so there are more and more researches based on user sessions.
For the user, the relationship between the items inside each session is more closely compared to other items, so session-based recommendations are very significant. For the first time, Hidasi et al. [25] applied RNNs to session recommendation, treating a series of click behaviors occurring in a session as a sequence, and completing item-based sequence modeling to predict user preferences. Jannach and Ludewig [26] combined RNNs with kNN in 2017 and applied it to session recommendations, recognizing user behavior by RNNs, taking into account the mixture of simultaneous signals and sequence patterns.
Li et al. [27] studied a hybrid encoder (NARM), which has an attention mechanism to model the user's sequential behavior, capture the user's main purpose of the current session, and then combine it into a unified session representation. NARM's consideration of the primary purpose of the user's current action distinguishes it from previous sequence modeling methods.
However, the user's behavior sequences do not always have the same lengths. For these variable-length sequences, there are also many studies on their processing. According to the idea of word2vec, Barkan and Koenigstein [28] proposed a method of Item2vec. In item2vec, the item in the session is equivalent to the word in word2vec. If it appears in the same collection (sequence purchased or clicked by the same user), the item is considered a positive example. Although this method processes the indefinite length sequence by embedding and then calculates the probability of the prediction by calculating the similarity of the embedded vector. Grbovic and Cheng [29] proposed a "list embedding" method that provides a new perspective for item embedding. e internal relationship between these listings that the user clicks is mined by the context of the user's click sequence.
ese proposed embedded-based methods focus on sequences of variable length but inevitably make the elements inside the session more sequential [30][31][32][33]. We need to consider the intimacy of all the items inside the session, not the items that are next to each other. Ideally, we can keep the intimacy of the items inside the session and prevent them from relying too much on timing and spatial location. In fact, the Graph Neural Network can do this.
Graph Neural Networks are further used to recommend modeling system scenarios [34,35]. Wu et al. [36] proposed SR-GNN in 2018 to model the session sequence of graph structure data to extract the embedded vector of the item using GNNs. e advantage of this is that the model does not rely on the user's relevant representation at this time, using the embedded layer of the session for the recommendation. It is different from the classic embedding method. It not only solves the problem of embedding unequal sequences but also gives different representations for the embedding of each item.

Description of Nonindependent Identically Distributed Intention Problems
In this section, we will primarily describe the intelligibility of the problem we are trying to solve, which will not contain more algorithmic details. We also give descriptive meanings to the symbols that appear in our work.

Problem Description.
e data in the recommendation system appears as multi-source heterogeneous data. In this case, the analysis and prediction of user behavior through traditional independent and identical distribution assumptions will bring specific problems. erefore, the processing of such multi-source heterogeneous data is a fundamental problem that the recommendation system cannot avoid. It requires us to design the model, and it needs to be more specific to the actual problem and the characteristics and complexity of the actual data.
Let us consider the state of user behavior in real life. For example, a user has purchased K identical items in a year. In the previous recommendation algorithms, they considered the user's K-time behavior as independent and identical distribution. However, it is easy to imagine that he might have purchased the item for different reasons at different points in time. We refer to this behavior as "nonindependent identical distribution intent." Compared to Independent and identically distributed situations, the above user behavior is more common in real life.
is kind of thinking is straightforward to be recognized. We cannot get specific factors that affect user behavior, but we can mine such a layer of information from user behavior data. One idea is that we can use the context information of the user's click action to determine if the same action is due to the same situation.
From the perspective of continuous-time and space, we looked for another analysis method for the user's nonindependent behavior. e problem we are trying to solve is how to model such intentions. We try to achieve this by using Neural Ordinary Differential Equations [37]. In its forward propagation process, the original discrete states inside the network are connected by time t, presenting a continuous spatial state. However, how do we get the potential preferences of segmented users? Reverse-mode automatic differentiation [38] can solve this problem. It helps us solve the adjoint states at different times.
at is, we assume that user behavior is different between different points in time.
is means that we can "split" the user state in the continuous latent space.

Definition of Symbols.
We describe the key representations used by the model and show them in Table 1 for quick access.

Sess-ODEnet: Use ODE Solver to Make Session Recommendation
In this section, we will introduce our model in detail. Our work can be divided into two parts. First, we can get a neural network that can be solved and then use ODE to solve the neural network to get the predicted results.

Recognition Network and Ordinary Differential Equation
Solver. According to the description in Neural Ordinary Differential Equations [21], we need to define a recognition network. It can be arbitrary. We first considered the importance of user sessions. In the actual application scenario, long-term, orderly history may not be relevant to the user. User behavior may occur when an action occurs to an irregular point in time. erefore, the internal continuity of the conversation and the irregular time points are the focus of our research.

Session
Graph. In our model, we still use the definition of the session graph in [36]. Each session graph appears as a graph structure derived from the user's click sequence. e session graph is defined as G � (V, E). In the session graph, each node represents an item v s,i ∈ V, and the user clicks on the item in the session as the edge of the graph, denoted as We assign a normalized weight to each edge, which is calculated as the number of occurrences of that edge divided by the degree of the starting node of that edge. is is to avoid duplicate items in the user's session. We embed each item in a unified embedding space. e node vector v ∈ R d represents the potential vector of the item learned through the Graph Neural Networks, where d is the dimension. us, each session s can be represented as an embedded vector S consisting of the node vectors used in the session graph.

GGNNs as Recognition Network. Because Ordinary
Differential Equation solver is suitable for any form of a neural network, we use Gated Graph Neural Networks (GGNNs) to learn session graphs. It is to allow information to propagate in space to model the complex relationships between items. e gated graph neural network embeds each session graph into the graph, at which point each node contains context information. e GGNNs can be represented by the following equation: (1) Among them, S (1) v represents the initial state of the D dimension of the node v. x T v represents node feature. A is an adjoint matrix that includes in-and out-degrees, and a (t) v represents a 2-dimensional vector of the result of the interaction between nodes and adjacent nodes. Since the adjoint matrix A contains both in-degree and out-degree, the result of the calculation is similar to that of a cyclic neural network; that is, it contains bidirectional information v indicates the newly generated information, and S (t) v indicates the node status of the final update.
It shows that we take advantage of the gated graph neural network, which is its "forgetting" and "update" mechanisms. It essentially acts as an attention mechanism, which allows us to consider the user's long-term and short-term preferences better because information with too small weight will be filtered out. Finally, we can get the final representation of the individual node vectors in the graph.

ODEsessSolver.
We will detail the ordinary differential equation solver, and its forward and backward propagation processes, which are the core of Neural Ordinary Differential Equations.

Forward Propagation on the Graph.
In GGNNs, the activation values of all layers need to be preserved after forwarding propagation, because these activation values are used to make backpropagation gradients on the computational path. However, it takes up an ample memory space, which makes the training process of the network limited. erefore, we use GGNNs to parameterize the derivative of the hidden state, rather than directly parameterizing the hidden state as usual. It brings two benefits: (1) the level and parameters of continuity are implemented within the graph network; (2) the continuous graph network space eliminates the need for hierarchical propagation gradients and parameter updates.
In order to make the network hierarchy continuous, it is required that the error between hidden layers should be close to infinity within a network. When our GGNNs are added to a hidden layer that approaches infinity, the network can be considered to be contiguous. We represent this continuous transformation as an Ordinary Differential Equation: where g denotes the Gated Neural Network layer, t changes from the initial to the end. e change in S(t) represents the forward propagation result. At this point, we can see that we only need to find the solution of the equation, which is equivalent to completing the forward propagation. Formally, we need to transform the above formula to find the solution we need. Given the initial state S(t ini ) and the Gated Graph Neural Networks, the hidden state S(t ter ) of the end time is solved: Note that both S(t ini )and t ter t ini g(S(t), t, θ)dt can be solved by ODEsessSolver. At this point, we solved the termination state S(t ter ), which is equivalent to the completion of the forward propagation. is method has a high degree of maturity and recognition in the field of mathematics, we only need to regard it as a "black box solver." We gave pseudocode for forwarding propagation shown in Table 2. As you can see, we can define a neural" network and an ordinary differential equation solver. e solution results can be obtained by "feeding" the initial state, the final state, and the neural network to the solver.
Among them, gatedNet is defined as a Gated Graph Neural Network, S and t, respectively, represent state and time, and West Tower is a learnable parameter of the model.

Reverse-Mode Automatic Differentiation on the Graph.
To make the GGNNs' network hierarchy continuous, the challenge is how to make the gradient pass through ODE-sessSolver. If the gradient is reversed back along the calculation path of forwarding propagation, it is very intuitive. However, the memory usage will be extensive, and the numerical error cannot be controlled. As stated, we treated the forwardpropagating ODEsessSolver as a "black box operation," and the gradient does not need to be passed in at all, just bypassing it. We learned in the previous section that it uses GGNNs to parameterize the derivative of the hidden state, where the derivative of the parameterized hidden state is similar to

Symbol
Meaning G � (V, E) Session graph. V denotes a node in the graph, and E denotes an edge between the nodes S e embedding vector of the session graph g A single layer representation of a gated graph neural networks t, θ ey represent each moment in the graph and the current gradient a(t) Adjoint. It represents the dependence of the descending gradient on the hidden state at each time point Terminal state of time t ODEsessSolver Ordinary differential equation solver I t It represents the sampled output at any time t obtained by the ODE solver q(·) It represents the posterior probability of the sampled output. Specifically, it represents the probability of the item to be recommended constructing the hierarchy and parameters of continuity, rather than discrete levels. So the parameter is also a continuous space; we do not need to layer the gradient and update the parameters. It should be noted that ODE does not store any intermediate results during forwarding propagation, so it only needs to approximate the memory cost of the constant level.
In this way, taking the initial state S(t ini ) and the terminating state S(t ter ) as an example, we can give the loss function of the backpropagation: It should be noted that the input to the loss function is the result of ODEsessSolver. It can be concluded from the above equation that the optimization problem is a gradient optimization problem that is converted to θ.
We use the adjoint sensitivity method [39] to calculate the inverse gradient. is method calculates the gradient by solving the second augmented Ordinary Differential Equation backward. is method is linear with the size of the problem, has low memory costs, and can explicitly control numerical errors. In this method, the dependence of the descending gradient on the hidden state S(t) at each time point is defined as an adjoint a(t) and has a(t) � zL/zS(t). On every instant, there are Among them, the accompanying amount of the initial time point t ini can be directly solved by the ordinary differential equation. For [t 1 , . . . , t n ], it can be calculated backward from its final value. For parameter θ, its gradient depends on the current hidden state S(t) and the accompanying amount a(t): Among them, a(t) T (zg/zS) and a(t) T (zg/zθ) are vector-Jacobian products [40]. ey can all be evaluated by the automatic differentiation method [41][42][43]. e integrals forms of S, a, and zL/zθ are solved by ODE. In the solution process, the original state, the accompanying state, and other partial derivatives of the nodes in the layer at each moment are connected into a single vector. Note that when Loss relies on the intermediate state, we use the reverse-mode automatic differentiation [37,38,42] to decompose the derivative into a series of solutions according to the time interval between each successive output pair. e method of reverse automatic differentiation is different from the backward propagation method of the neural networks. It means that we can get his termination state at any time through the user's initial state [44,45].

Session Recommendation Using ODEsessSolver.
In the previous section, we have already mentioned that our goal is to find a way to model conversations differently from traditional sequential learning ideas. Since the user's behavior exhibits a continuous state or time-series data with different frequencies, durations, and starting points, we may have the same behavior at different points in time for different reasons. For example, we went to the pharmacy twice to buy painkillers, but one was because of a headache, and the other was because of a toothache. It makes it easy to tell from the context of the user's behavior that these are two different situations. At present, the application of modeling of this idea is not enough. Such irregular time-series data can be divided into discrete steps of weeks, days, or even hours by parallel sessions.
In Section 4.1, we have included some details of the gated graph neural network and the ordinary differential equation solver. To achieve nonindependent and identically distributed session prediction, they are critical tools. In this section, we use the above ideas and tools to implement recommendation-based dependent sequence prediction. We presented model diagrams (Figure 1) to understand the recommendation process better.
Give the observed time T � t ini , t 1 , . . . , t n , an initial state S(t ini ). Again, ODEsessSolver is used to calculate the potential state S � S t ini , S t 1 , . . . , S t n representing each time point, while generating the sampled output I t � I t ini , I t 1 , . . . , I t n for each potential state in any time. Our model can be defined as follows: S t 1 , S t 2 , . . . , S t n � ODEsessSolver S t 1 , g, θ g , t ini , t 1 , . . . , t n , where each layer of the Gated Graph Neural Network takes the corresponding S at the current time point and outputs the gradient zS(t)/zt � g(S(t), θ g ). After the GGNNs consume the data in an orderly manner, the posterior probability of each sample is output, that is, the probability of the item we use for prediction: For GGNNs, each of its graph network layers g is timeinvariant and given any potential state, S t ; its anti-pattern derivative trajectory should be unique. At any time, we can make any session prediction forward or backward. For example, if the initial session state S t ini is the current input session, its potential trajectory to the termination state S t ter should be unique. Extrapolation from this termination state allows for the prediction of potential states in future time. Mathematical Problems in Engineering 5 at is, by solving the termination state S t ter , we can predict the potential behavior of the user for a certain period time from this point in the future.
According to the results of the ordinary differential equation solver, we used softmax to get the results of multi-classification. In fact, we only need to get the first few items of the forecast.
We gave a diagram ( Figure 2) containing this example to feel the process visually:

Experimental Settings.
Our experiment was to verify the correctness of the idea of applying Neural Ordinary Differential Equations to conversational recommendations. In the context of session-based sequence learning, we chose Gated Graph Neural Networks as the recognition network and explored the user's preferred sessions using the Latent Neural Ordinary Differential Equations Model proposed in [21]. We used a learning rate of 0.005 in training and used the Adam optimizer for optimization learning. To alleviate the overfitting, we use the L2 loss function and take advantage of the early-stop training method. Ten epochs without improvement mean that training saturation stops training. Our experiments were implemented on Windows 10, Python 3.6, and pytorch 1.0 frameworks and accelerated using TITAN XP GPUs.

Experimental Datasets.
We tested for four real datasets, including two large datasets in the form of conversations (Yoochoose (http://2015.recsyschallenge.com/challege.html) 1/64 and Diginetica (http://cikm2016.cs.iupui.edu/cikmcup)), a dataset containing the user's music playback history (separated by timestamp) (Last.fm (https://grouplens. org/datasets/hetrec-2011/) And a user behavior data set Retailrocket (https://www.kaggle.com/retailrocket/ecomm erce-dataset) (Users' clickstream data). e Yoochoose dataset is from RecSys Challenge 2015, which contains user clickstreams on e-commerce sites within 6 months. e Diginetica dataset is from the CIKM Cup 2016, which uses only its transaction data. After removing sessions of length 1 and too few occurrences, we can get the data shown in Table 3. ese two datasets are classic datasets based on session recommendations. ese two datasets are commonly used for session-based recommendations. We used a data preprocessing method similar to [36] because this processing method allows us to get a clearer input.
Last.fm is a dataset about music recommendations. e data includes a list of the most popular artists, as well as the number of plays and tags for songs. Retailrocket dataset is the behavioral data of a real e-commerce website user. It includes, a slightly shorter period of time than Yoochoose, only 4.5 months of website visitor behavior data. In the dataset, there are three types of user behaviors: browsing, adding to shopping cart, and transaction. We used Table 4 to show the specific data of these two data sets.
ese two datasets are commonly used in recommendation systems based on user behavior analysis. ey are characterized by easy access to user behavior in the form of sequences.  (RNNs). It models the session by using a deep-loop neural network consisting of GRU cells. NARM: the Neural Attentive Recommendation Machine. Based on the cyclic neural network, the attention mechanism is added. Based on the analysis of the sequential behavior of the cyclic neural network, the main behavior of the user is more closely concerned. SR-GNN: this model was proposed by ShuWu et al. in January 2019 to aggregate separated session sequences into graph structure data. rough the Graph Neural Networks, the global session preference and local preference are comprehensively considered.

Metrics.
We use the two evaluation indicators commonly used in the recommendation system, namely, the recall rate and the average reciprocal ranking, to qualitatively evaluate the experiment.
Recall@S: it is a very important one of the recommended system evaluation indicators and is used to measure the recall rate of the first S items in all test instances in the recommendation list. Recall@20: It represents the proportion of correctly recommended items in the top 20 items. Recall@50: It represents the proportion of correctly recommended items in the top 50 items. Mean Reciprocal Rank: it measures the ranking of the predicted positions of the real target items in all test cases and counts them down and averages them as to accuracy. MRR@20: it represents the average of the peer-level levels of the correctly recommended items in the top 20 items.

Results.
We tested on four data sets. To train the model's ability to predict the data at irregular time points, we randomly selected the time points for extraction in each trajectory. At the same time, each time of new input is connected to the next predicted time difference to improve further the ability of the Gated Graph Neural Network to observe irregularities. We showed the experimental results in Table 5.
Gated graph neural networks

S(t ter )
ODEsessSolver Terminal state Initial state Figure 2: e process of processing the data by the model. It depicts the flow from the initial state to the termination state. For the input, we nonlinearly transform the input through gated graph neural networks (GGNNs) to get (g). ODEsessSolver integrates the neural network g and adds the initial state value to get the final prediction. Our model gives a new definition of the specific potential state of a user session over a continuous period of time. In our baseline algorithm, NARM and SR-GNN are the models based on the session recommendation proposed in the past two years. When the model predicts user behavior, the results they get should be very similar, because these models are all proposed on independent and identically distributed assumptions. Nevertheless, our model is more focused on the different companion states that users present in different periods. ese adjoint states indicate that although the user has generated the same behavior at different points in time (e.g., clicks, purchases), it may be for different reasons. e ability to model user sessions in complex spaces is enhanced. Most of the recommended algorithms use RNNs or GRU to model user sequences. However, in a complex recommendation system, although the user's action behavior exhibits a sequence state, the inside is still closely related in the form of a graph network. In our model, the Gated Graph Neural Network combined with the Ordinary Differential Equation preserves the ability of the network to model complex data, enabling it to propagate in a contiguous space. e model has a lower complexity in the solution process. When we use ordinary differential equations to solve the problem, we usually do not need to solve the complete form of the solution of the equation, as long as the obtained solution is gradually close to the optimal value. e Ordinary Differential Equation Solver eliminates the need for hierarchical propagation gradients and parameter updates by parameterizing the derivative of the network's hidden state [39]. We can get the desired result by solving the network once, without the need for a gradient-like approach. erefore, our model usually has a constant storage cost for ordinary differential equations.

Conclusion
We must realize that session recommendation plays a vital role in the user's implicit preference mining, but it cannot be considered as an independent and identical distribution. We propose a recommendation model based on Neural ODE: Sess-ODEnet. e model combines differential equations with gated graph neural networks to model complex sessions. e model derives the representation vector of each embedded item by representing the session as the structure of the conversation graph and then through the Graph Neural Network. On this basis, the ODE solver is used to predict and recommend nonindependent intentions at any point in time. Not only that, Neural ODE is different from traditional neural network training methods, which makes our models have not only low memory usage but complicated search time as well.
However, the focus on this work is still to regard the user session as a continuous state in time. In the future, we hope to apply this model to the sequence of behaviors lacking time steps and to model the nonindependent intentions of users at higher levels, with recommendations [14].

Data Availability
e illustrative example data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare no conflicts of interest.