A Lambda Layer-Based Convolutional Sequence Embedding Model for Click-Through Rate Prediction

: In the era of intelligent economy, the click-through rate (CTR) prediction system can evaluate massive service information based on user historical information, and screen out the products that are most likely to be favored by users, thus realizing customized push of information and achieve the ultimate goal of improving economic benefits. Sequence modeling is one of the main research directions of CTR prediction models based on deep learning. The user 􀆳 s general interest hidden in the entire click history and the short-term interest hid‐ den in the recent click behaviors have different influences on the CTR prediction results, which are highly important. In terms of capturing the user 􀆳 s general interest, existing models paid more attention to the relationships between item embedding vectors (point-level), while ig‐ noring the relationships between elements in item embedding vectors (union-level). The Lambda layer-based Convolutional Sequence Em‐ bedding (LCSE) model proposed in this paper uses the Lambda layer to capture features from click history through weight distribution, and uses horizontal and vertical filters on this basis to learn the user 􀆳 s general preferences from union-level and point-level. In addition, we also incorporate the user 􀆳 s short-term preferences captured by the embedding-based convolutional model to further improve the prediction re‐ sults. The AUC (Area Under Curve) values of the LCSE model on the datasets Electronic, Movie & TV and MovieLens are 0.870 7, 0.903 6 and 0.946 7, improving 0.45%, 0.36% and 0.07% over the Caser model, proving the effectiveness of our proposed model.


Introduction
With the advent of the era of smart economy, while the network services are developing in quantity, service providers pay more and more attention to the user expe-rience, and take user satisfaction as the guide to improve the fineness of services [1] .Among them, the development of e-commerce is also devoted to customizing user experience.Many e-commerce platforms like Amazon and Google Store have introduced a click-through rate (CTR) prediction system to improve economic benefits.The CTR prediction system can evaluate massive service information based on the users historical information, screen out the products most likely favored by users, and push them to customers on the e-commerce platform.Existing research shows that users know little about other products that meet their personal preferences, and the search process takes a lot of time and effort [2] .The application of the CTR prediction system can filter and effectively retain information, enable users to efficiently access relevant data, increase the length of time users stay on the platform, and thus achieve the ultimate goal of improving economic benefits [3,4] .
The main method of CTR prediction is to find suitable features from the user dimension, service dimension or time dimension based on user profile, user history and other information to model, and then to predict the probability of users clicking on the service.The click behavior refers to the process of accessing or evaluating products according to the user  s interest in the network service platform.In the field of computational advertising, the revenue of an advertising platform relies on the product of the cost per click [5] and the click-through rate (CTR) [6] , so the accuracy of CTR prediction will significantly affect the revenue of online advertising platforms [7] .In the field of recommender systems, CTR prediction often affects the performance of recommendations.CTR is one of the main evaluation indicators of recommender systems [8] , and the CTR prediction model is frequently used in the ranking stage to recommend the ones that users are more inclined to click on.Therefore, improving the accuracy of CTR prediction has become the key to the research, which plays an essential role in promoting the development of e-commerce in the smart economy.
In actual e-commerce scenarios, users  short-term preferences reflect recent needs, and providing services similar to recent browsing records can indeed efficiently satisfy customers.However, user  s demand for similar products is often limited.For example, after browsing a large number of smartphone entries, a customer chooses one to buy, but has no demand for the second.Another example is that after watching a historical documentary on a video website, customers want to change their taste and choose a science fiction film.This shows that it is not enough to only focus on the capture of short-term preferences, which limits recommended services to the same field.In order to improve the users overall experi-ence level, it is necessary to provide novel product information to attract the users interest at all times, which requires capturing long-term preferences.
Sequence modeling is one of the main research directions of the CTR prediction model based on deep learning.It is a common method to explore the influence of user preferences hidden in the user  s historical click behavior.The short-term and general preferences expressed from the users recent behavior and entire historical behavior respectively will have different degrees of influence on the users subsequent behavior.Ideally, the system should combine both when making predictions.
A variety of sequence models based on CTR have been proposed.Markov chain [9,10] is a method to predict the user  s next click behavior based on recent click behavior, but this method ignores the influence of the users general preference on the users click behavior.Wang et al [11] used embedding representation and fusion to express user general preferences and short-term preferences, but the information captured by embedding only is often limited.Tang et al [12] used a vertical filter in the convolutional neural network to fuse the users recently clicked item information (point-level, which means user information) to predict the users next click behavior.In addition, the Caser model [12] proposed by them also fuses item information from the embedding elements perspective (union-level, which means service information) through a horizontal filter.The above models focus more on users short-term preferences, and the research on users  general preferences is somewhat insufficient.The capture of user  s general preferences only uses the method of user information embedding or item preference fusion, but in fact, it can be combined with the attention mechanism [13] to be further optimized.
Aiming at the problem of combining user  s shortterm preferences and general preferences, based on the Caser model, this paper improves the capture of user  s general preferences.We add a convolutional sequence embedding module based on the Lambda layer, combine the information of the embedding layer to capture the users general preferences, and finally propose a Lambda layer-based Convolutional Sequence Embedding (LCSE) CTR prediction model.Different from other CTR prediction models (such as the DSIN model [14] ) that combine convolutional neural networks and attention mechanisms, the LCSE model has the following advantages: (1) In terms of user preference capture, the LCSE model combines the union-level information and point-level information when using convolutional neural networks; (2) In terms of attention mechanism, the LCSE model uses a Lambda layer with a time complexity of O(n) to replace the common scaling dot-product attention mechanism, which not only improves the prediction results of the model, but also improves the training efficiency.Experimental results on public datasets also demonstrate the effectiveness of our LCSE model.

Related Work 1.CTR Prediction Models Based on Feature Interaction Learning
The current research focuses on the CTR prediction model based on deep learning.The main research directions include feature interaction learning and sequence modeling.
The Wide & Deep model [15] is a method proposed by Google in 2016 to improve the overall performance of the CTR prediction model.It combines the linear features of the linear model and the high-order features of the Deep model.The model brought an increase in the app download rate to the Google Play Store, proving its effectiveness.The Deep & Cross model [16] uses the Cross network to replace the linear model in the Wide & Deep model, and the Cross network can extract high-order cross features with better interpretability, thus improving the expressiveness of the model to achieve better performance on model metrics like logloss.The DeepFM model proposed by Guo et al [17] uses a factorization machine to replace the logistic regression (LR) model of the Wide part.Compared with the Wide & Deep model, DeepFM has a stronger learning ability for sparse data.It can not only consider the interaction of high-order and low-order features, but also does not require additional manual feature engineering.The model has also achieved good results in the Huawei AppGallery.Based on the DeepFM model and the Deep & Cross model, the xDeepFM model [18] proposed by the Social Computing Group of Microsoft Research Asia makes the feature interactions occur at the vector level.It has better memory and generalization ability, which can automatically learn high-level feature interactions explicitly and implicitly at the same time.It also has a certain level of improvement in the accuracy of prediction results.
Although these methods can explore the relationships between feature interactions, they ignore the preference information hidden in the user  s click sequence, which also affects the users click behavior.

CTR Prediction Model Based on Sequence Modeling
The preferences displayed at different times in the users historical click sequence will influence the users subsequent behavior differently.Compared with the general preferences of users in the entire click history, the short-term preferences of users in recent click behaviors will have a greater impact on subsequent behaviors, but the importance of general preferences cannot be ignored as well.
By combining the matrix factorization machine and Markov chain, the Factorizing Personalized Markov Chains (FPMC) model [19] uses the Markov transition matrix to predict the users next click behavior according to the user  s click behavior at the previous moment, and has also achieved good results in CTR prediction.The Hierarchical Representation model (HRM) [19] , which achieves better results on real-world datasets than the FPMC model, uses a hierarchical structure.It first fuses the recent service information consumed by the user.Then, it is combined with user representation information to obtain the final fusion representation and get the prediction results accordingly.The Caser model stands out by applying two innovative methods within a convolutional neural network.It uses a vertical filter to merge point-level user click data and a horizontal filter for integrating service information through embeddings, aiming to accurately predict user click behavior.The above models focus more on the study of user  s shortterm preferences.The extraction of users general preferences only uses user information embedding or service preference fusion.Although the models achieve good prediction accuracy, the overall performance can be further improved by optimizing the capture of general preferences.Recently, many studies based on optimizing the capture of general preferences have introduced attention mechanisms to enhance the capture of general preference information.
The Deep Interest Network (DIN) model [20] introduces the attention mechanism to obtain the users general preference by calculating the relationships between the target service and the users click history, which affects the users next click behavior.The Short-Term Attention Memory Priority Model (STAMP) [21] also introduces an attention mechanism to capture general preferences, and proposes a novel short-term attention model.It calculates interest weights according to the context, and uses users interests at different times to synthesize the attention vectors.These models only consider the relationships between services when using the attention mechanism, and better results can be achieved if information from the embedding layer is integrated, like the Caser model.Moreover, the traditional self-attention mechanism has high memory requirements, which makes it challenging to play a good role in the face of long sequence models [22] .Therefore, we introduce the Lambda layer in the LCSE model to model the internal relationships of context elements with a lower memory cost.On the other hand, these two models are insufficient in capturing short-term preference information.The DIN model does not pay much attention to the shortterm preferences, while the STAMP model only emphasizes the importance of the latest click.If the last click was a mistake, the prediction results may be affected adversely.There is still room for optimization in the capture of short-term preferences.

The Lambda Layer-Based Convolutional Sequence Embedding Model
The LCSE model proposed in this paper handles user short-term preferences and general preferences separately.We use convolutional neural networks to capture user short-term preferences from union-level and point-level, use Lambda layer-based convolutional neural network to fuse the users general preferences, and finally obtain the prediction result according to the captured long-term and short-term preference information.The structure is shown in Fig. 1.Therefore, the core module of the proposed LCSE mainly includes the following four parts: (1) Embedding service features through the embedding layer; (2) Capturing short-term interest features of users through convolutional neural network from union-level and point-level; (3) Capturing users long-term interest characteristics through convolutional neural network based on Lambda layer from union-level and point-level; (4) Obtaining the final CTR prediction results through the fully connected layer.Next, we will discuss each part of the model in detail.

The Embedding Layer
The LCSE model predicts the probability that the user will click on the target service i t next time mainly by the user  s click sequence I ={i t -n i t -n + 1 i t -1 } (in which n represents the length of user  s historical click sequence), so the the user  s click sequence I = {i tn i tn + 1 i t -1 } and the target service i t are imported into the model, in which the target service i t contains k features.
The main function of the embedding layer is to encode the representation of attributes using vectors.Compared to one-hot encoding commonly used, vectors have lower dimensions after embedding and can get updated when training the neural network.We can assume that, through embedding, each service feature has a d dimension, and k features of the service i are connected to form the embedding vector e.

The Lambda Layer
The Lambda network [22] is a linear attention mechanism proposed by Bello.Compared with the selfattention mechanism proposed by Bahdanau et al [23] , it has lower time and space complexity.Compared with the recent linear attention mechanisms proposed by Katharopoulos et al [24] and Choromanski et al [25] , it can model inner relationships between elements within data.Lambda has achieved good performance in areas such as image classification, proving its efficient handling of massive data.In an e-commerce scenario with the same huge amount of data, in the face of a large number of historical click sequences for each user, the Lambda layer can also give full play to its advantages in CTR predic- Similar to other attention mechanisms, the Lambda layers query matrix Q, keyspace matrix K and valuespace matrix can be obtained by linear mapping from the input X and the context C, respectively (equation ( 1)).Here we use the users click sequence after embedding both as the input X and the context C, and add batch normalization operations after linear mapping to improve the training effect of the model.Attention(QKV ) = soft max ( ) The subsequent operations differ from other attention mechanisms in that the contribution of each context element m consists of two parts: content-based contribution μ c m (equation ( 2)) and location-based contribution μ p m (equation ( 3)), in which the content information is obtained by standardization of K using the function soft max (equation ( 4)), and location information is obtained by encoding using the sine and cosine functions (equation ( 5)).The Lambda function is the sum of the above two parts (equation ( 6)).
Finally, we apply Lambda to the query matrix Q to get the output O of the Lambda layer (equation ( 7)).The time complexity of the entire Lambda layer is O(n) . ) In the process of implementing the Lambda layer, we also change it to the form of a multi-head attention mechanism to enhance the learning ability of the model by splitting and merging.The multi-head attention mechanism learns features from H subspaces separately, and then splices the learning results of the H subspaces together to obtain better learning results, yet the number of the H subspaces is not the higher the better.

The Neural Convolutional Network Layer
The structure shown in Fig. 2 includes two convolutional neural network (CNN) layers [26] , which are good at processing sequential information.The data sequence can directly enter the CNN layer after being processed by the embedding layer.The Lambda layer can further process the data sequence before entering the CNN layer.The former CNN layer is used to capture shortterm user preferences and the latter is used to capture long-term user preferences.Both of the layers capture the union-level and point-level information of the input sequence for feature extraction.
Figure 3 shows the structure of the convolution layer.The user click sequence, whether it is processed by the embedding layer or by the attention mechanism, is always a  L ´d (where L is the length of the users historical click sequence, and d is the dimension of user service after embedding) matrix E. Here, the matrix E can be convolved in a similar way to image processing.Dif- The horizontal filter can capture union-level information.It is mainly used to aggregate the joint features of multiple services from the embedding dimension, so that the historical information left by the user recently on the e-commerce platform can jointly decide the next prediction result.F k h Î  L ´d is one of the n horizontal filters and it is a h ´d matrix (1 ≤ k ≤ n), in which d is the dimension of user service after embedding.The convolution value c k obtained after the k-th horizontal filter F k h is slid over the matrix E for i (1 ≤ i ≤ L -h + 1) times is shown in equation (8), where  represents inner product operation and ϕ(×) is the activation function.After the convolution operation, the most important feature f H (equation ( 9)) extracted by the horizontal filter is obtained by the max pooling operation.
The vertical filter can capture point-level information mainly by weighted sum.It splits the users multiple click history into multiple single events, which affect the prediction results separately.Suppose there are n͂ vertical filters, each F ͂ k Î  L ´1 (1 ≤ k ≤ n͂ ) and L is the length of the user  s historical click sequence.The convolution value c͂ k obtained by sliding the vertical filter F ͂ k on E for d times is shown in equation (10).l represents the row corresponding to the matrix.Finally, c͂ 1 , c͂ 2 , … , c͂ n͂ are combined and output as a feature vector f V of size n͂ ´d, as shown in equation (11).
In the next step, we splice the feature f H extracted by the horizontal filter and the feature f V extracted by the vertical filter to get the output of the convolutional neural network layer.

The Fully Connected Network Layer
The input of the fully connected layer is the vector v =[sce t ] obtained by jointing the output vector s of the convolutional neural network that captures the user  s short-term interest features, the output vector c of the attention-based convolutional neural network that captures the user  s long-term interest features, and the feature vector e t of the target service.The vector v needs to be processed by the Batch Normalization (BN) layer before being imported into the fully connected layer.After that it can be sent into the fully connected neural network.BN aims to alleviate the problem of gradient dis-appearance during training by adjusting the data distribution, and to accelerate the speed of network convergence.
We use PReLU as the activation function (equation ( 12)) in the fully connected layer, and use Sigmoid as the activation function (equation ( 13)) in the output layer after the fully connected layer to get the predicted probability ŷ of a user clicking on the target service.PReLU = ì í î x ( ) Finally, we use the cross entropy loss function to calculate the loss of the model to complete the learning of model parameters, as shown in equation ( 14), where y represents whether the user clicks on the target service and Y represents the sample set.
3 Experiment and Analysis

Datasets
The public datasets are used in the experiments of this paper.In order to better simulate the actual ecommerce scenario, we selected two datasets from the Amazon dataset [27] and one from the IMDB dataset [28] .
1) The Amazon dataset mainly records the service and user  s comment information on the Amazon ecommerce website.It contains product information such as books and electronics.In this experiment, we use Movie & TV and Electronic these two categories as our datasets.Meanwhile, we regard the user  s comment on the product as a click behavior, and randomly add items that have not been commented as negative samples of the users click event.The ratio of positive and negative samples is 1∶1.
2) The movie rating dataset mainly records the movie information in the IMDB and the users movie rating information.In the experiments, we choose MovieLens-1M as the experimental dataset.Similarly, we take the users rating of one movie as a click behavior, and randomly add movies the current user has not rated as the negative samples of the users click event.The ratio of positive and negative samples is 1∶1 as well.
Among many network services, movie and electronic product preferences are especially determined by personal taste, and user interest characteristics are more significant in these scenarios, which is convenient for the model to learn users  diverse personality preferences [28] .Therefore, this paper conducts experiments in the application scenarios corresponding to these three datasets, in order to achieve the purpose of giving full play to the role of the model and testing the accuracy.The basic information of the three datasets is shown in Table 1.

Evaluation Criteria
Here, we use Area Under the ROC Curve (AUC) and logloss as our evaluation indicators.AUC reflects the probability that the model judges that the positive sample value is higher than the negative sample, when a positive and negative sample are randomly selected.Models with an AUC value closer to 1.0 perform better in prediction.Logloss can calculate how close the models predictions are to the target results.The closer the logloss is to 0, the more accurate the models predictions are.

Comparative Experiment
This experiment mainly compares the LCSE model with the following models: 1) Attentional Factorization Machine (AFM) model [29] : The AFM model learns the importance of second-order combined features by introducing an attention mechanism.
2) Wide & Deep Learning (WDL) model [15] : The WDL model captures features through a machine learning-based Wide model and a deep learning-based Deep model, and then performs joint training to obtain prediction results.
3) Deep & Cross Network (DCN) model [16] : The DCN model mainly combines feature crossover through the Cross network with the results of a fully connected feed-forward neural network to obtain the final prediction probability.
4) Automatic Feature Interaction Learning via Self-Attentive Neural Networks (AutoInt) model [30] : The Au-toInt model learns the high-order feature interactions through multi-head self-attentive neural networks.

5) eXtreme Deep Factorization Machine (xDeepFM)
model [18] : The xDeepFM model improves the memory and generalization ability of the prediction model by compressing the interaction network, which makes the feature interactions occur at the vector level.
6) Deep Interest Network (DIN) model [20] : The DIN model uses the attention mechanism to discover the users interest preferences in the click sequence to predict the click rate.
The first five models among the comparison models are all based on feature interaction learning.In comparison, the last two comparison models and the LCSE model proposed in this paper are based on sequence modeling.By introducing a variety of feature learning models and comparing them with sequence modeling, we can judge the importance of feature learning and sequence modeling, so as to confirm the superiority of the LCSE model based on sequence modeling more comprehensively.On this basis, the LCSE is compared with the DIN model, which reflects the impact of learning short-term user preferences on the model performance.Comparing LCSE with Caser, we examine the effect of general preferences on model performance as well.Compared with the two, LCSE adds the extraction of union-level information to the convolution layer, optimizes the attention mechanism, and adopts an improved Lambda layer.The validity of the overall design of the LCSE model will be tested in the comparative experiment section below, and the effects of each component of the LCSE model will be verified in the ablation experiment section.
In terms of the parameter setting, AFM, WDL, DCN, AutoInt, xDeepFM, DIN, Caser and LCSE use the same number of layers and dimensions of fully connected layers, of which the number of layers is 3, the dimensions are 256, 128, and 64, respectively.In terms of the optimizer, each model uses the Adam optimizer, where the learning rate α = 0.001, β 1 = 0.9, β 2 = 0.999.
The results of the experiments on three public datasets are shown in Table 2.The performance of the three sequence modeling models is better than that of the feature learning models, indicating that sequence modeling performs better in these experimental scenarios.Regarding the evaluation index AUC, LCSE has higher values than other comparison models, which shows the superi-

Ablation Study
Ablation experiments evaluate the effectiveness of the model by modifying or removing parts of the models structure.The ablation experiment designed in this paper mainly includes three parts: (1) Unlike other convolutional neural networks based on attention mechanism, the LCSE model captures features from union-level and point-level when using convolutional neural networks, so it needs to demonstrate the effectiveness of the structure of horizontal convolution kernels; (2) the LCSE model uses convolutional neural networks to capture users  short-term preferences and a Lambda layer-based convolutional neural network to capture users  general preferences, thus it is necessary to prove the influence of the two kinds of preference information on the prediction results respectively; (3) The LCSE model uses the Lambda layer instead of the commonly used scaling dotproduct attention mechanism, so the superiority of the Lambda layer needs to be proved.
In order to demonstrate the effectiveness of the union-level information captured by the horizontal convolution kernel, in this experiment, the LCSE model with two horizontal convolution kernels is compared with the LCSE model with two convolution kernels removed (Model A) and the LCSE model with only the horizontal convolution kernel above the Lambda layer removed (Model B).The experimental data is based on model A. From Table 3, we can see that from model A to model B, the capture of union-level information in the user click sequence through the embedding layer is added.From model B to model LCSE, the capture of union-level information in data passing through the Lambda layer is added.With the gradual addition of two horizontal convolution kernels, both the AUC value and the cross-entropy are significantly improved, proving the effectiveness of the two kernels.The experimental results also confirm that the introduction of horizontal convolution kernels to capture union-level information has a positive effect on the overall performance of the model.
In order to prove the impact of short-term preferences and general preferences on CTR prediction results and the superiority of the Lambda layer, we designed the following three comparative models.Figure 4  2) Lambda+CNN model: a model that captures user interest only through the convolutional neural network on the improved Lambda layer (Fig. 4(b)), focusing on capturing user general preferences.4, and we can draw the following conclusions: 1) The LCSE model performs better than the Caser model and the Lambda+CNN model on the three datasets.The ACSE model is also a model that combines short-term and general preferences on the basis of the Caser model, and it also outperforms the Caser model, confirming the effectiveness of combining short-term and general preferences.
2) Compared with the ACSE model, the LCSE model has better performance on the selected three datasets, which proves the superiority of the Lambda layer in the performance of the prediction results.
In addition, we also compared the training time of LCSE and ACSE by setting different embedding layer dimensions and maximum user history sequence lengths in the following configurations: Windows 10; Tensor-flow2.1.0;AMD 3600 CPU; GTX 1080Ti GPU.The results in Fig. 5 show that LCSE  s training time is shorter, which proves that the LCSE model also has higher training efficiency.

Conclusion
With the continuous development of the Internet, predicting the user  s next behavior through click rates and implementing precise delivery of products or advertisements are the embodiments of economic intelligence.In the fields of computational advertising and recommendation systems, e-commerce platforms hope to master users short-term and long-term preferences, so as to push messages through customized business strategies.
Aiming to address the problem of ignoring the capture of general preferences, this paper proposes a CTR prediction model based on Lambda layer convolutional sequence embedding.The LCSE model uses a convolutional neural network and a Lambda layer-based convolutional neural network to capture the user  s short-term and general preferences respectively, and finally predicts the user  s next click behavior by combining these two types of preferences.The Lambda layer in the LCSE model is a type of attention mechanism.Compared with the common attention mechanism, the Lambda layer has the advantage of lower time complexity and better performance in predicting results.
Further optimization can be carried out in future work to capture both short-term and general preferences.For short-term preferences, there is still room for development in the identification of false clicks and outlier removal.For long-term preferences, we can continue to combine the related research on the linear attention mechanism, and further optimize on the basis of the Lambda layer to improve the overall performance of the model from the perspective of computational efficiency and prediction accuracy.Not limited to general preferences and short-term preferences, more detailed learning of user interest drift over time is also a future research direction, which can promote the customized development of network services and better demonstrate big data intelligence.
shows the Caser model, Lambda+CNN model, ACSE model and LCSE model.1) Caser model (Fig. 4(a)): a CTR prediction model that only focuses on capturing the short-term interests of users through convolutional neural networks.This model is the benchmark model for comparison in this ablation experiment.

3 )
Attention-based Convolutional Sequence Embedding (ACSE) model: Unlike the LCSE model, the ACSE model uses the more common scaling dot product attention mechanism instead of the Lambda layer (Fig. 4(c)), with both short-term and long-term preferences captured.4) Lambda layer-based Convolutional Sequence Embedding (LCSE) model (Fig. 4(d)): our proposed model.The results of the contrastive experiments are shown in Table