A LSTM-Hawkes hybrid model for posterior click distribution forecast in the advertising network environment

In the field of advertising technology, it is a key task to forecast posterior click distribution since 66% of advertising transactions depend on cost per click model. However, due to the General Data Protection Regulation, machine learning techniques to forecast posterior click distribution based on the sequences of an identified user’s actions are restricted in European countries. To overcome this barrier, we introduce a contextual behavior concept for the advertising network environment and propose a new hybrid model, which we call the Long Short Term Memory—Hawkes model by combining a stochastic-based generative model and a machine learning-based predictive model. Also, to meet the computational efficiency for the heavy demand in mobile advertisement market, we define gradient exponential kernel with just three hyper parameters to minimize residuals. We have carefully tested our proposed model with production data and found that the LSTM-Hawkes model reduces the Mean Squared Error by at least 27.1% and up to 83.8% on average in comparison to the existing Hawkes Process based algorithm, Hawkes Intensity Process, as well as 39.77% on average in comparison to Multivariate Linear Regression. We have also found that our proposed model improves the forecast accuracy by about 21.2% on average.


Introduction
For the first time in 2017, global digital advertisement expenditure was 41% of the global advertising market, which exceeded the TV advertisement expenditure by 6%. Especially the mobile advertisement in digital marketing recorded 37.6% of the global growth rate, which is an impressive increase, enough to draw attention in the worldwide advertisement market [1]. These reports [2][3][4] substantiate that the mobile advertising market is leading the growth in the global advertising market. However, despite the rapid market growth, there is a controversial issue that Ad Tech faces. General Data Protection Regulation (GDPR) [5] has been actively enforced by EU since 25 May 2018.

Related work
The typical strategy of user targeting is based on analyzing the behavior of users across the internet or across the ad network individuals [6,7,8]. Another strategy which predicts if a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 someone is going to click an impression is matrix factorization inferring the relationship between user vector and context vector where the output is click through ratio [9,10,11]. However, the both traditional methods require user information or inferred user identification as the essential asset to forecast the click popularity or probability. Tracking user behavior between different ad tech individuals has been forbidden in EU as well as personal information transfer is also banned. Thus, technically both types of traditional methods are not feasible to be an real world application under GDPR. To overcome the barrier, we focus on an advertising network's posterior click distribution; not a user's CTR, nor an advertisement's CTR since 1) an advertising network or a publisher is still an identical user group remaining individual users anonymous, and 2)CTR only reflects short-term user engagement. We developed a model that can learn a general and efficient representation of the underlying dynamics from the event history with a hyper-parametric form.

Advertising network environment
An advertising network's environment consists of three main parts. First there is the advertiser who wants to advertise, as well as provide the actual advertisement contents. Next, there is the publisher who provides the landing pages for the advertisement contents. Lastly there is the advertising network that connects the advertiser and publisher together. Advertisers are grouped and managed in a system called the 'Demand Side Platform' and publishers are grouped together in a system called the 'Supply Side Platform' . Fig 1 simply describes the Ad Tech environment and its layers. Between layers, due to the GDPR, user information including advertising IDs, IP addresses or any other unique device IDs from a mobile device cannot be transferred.

Formulation and Preliminary Theory
When a random variable following a Bernoulli process which is the time of occurrence (or success) in Bernoulli trials approaches to positive infinity, it is infeasible to calculate the probability of a specific event occurrence due to the computational complexity. Therefore, to solve this problem, in modeling a posterior distribution of either prediction or simulation, most conventional statistical approaches have adopted 'Binomial Approximation to Poisson' which helps find probability of an independent trial in various fields [13][14][15]. However, the 'Binomial Approximation to Poisson' has Memoryless Property [16] because each trial is independent (Independent Increment) and the probability does not change (Stationary Increment). Thus, only event process with non-overlapping intervals can be used in the Binomial Approximation to Poisson in modeling a distribution.
In a click event distribution, where the event occurence time is t i of set T j = {t 1 , t 2 , . . ., t n }, inter-arrival time l i of set L i = {l 1 , l 2 , . . ., l n } can be written as follows t nÀ 1 � l n < t n ) l n ¼ ½t nÀ 1 ; t n Þ: We then can easily find the concurrent occurrences in any of real world's advertising click distribution. To solve the memoryless property and handle the overlapping intervals, we propose a generative model based on a self-exciting process, a type of non-homogenous process, called Hawkes [17]. In this chapter, we provide mathematical induction from binomial distribution to Hawkes process to derive memory kernel of Hawkes.
Binomial approximation to poisson. Suppose that a random variable X follows binomial distribution B(n, p), and that the expected value of X is λ. When n is close to positive infinity, λ approximates to np and by binomial approximation to the Poisson distribution, the probability of X approximates to the probability mass function of a Poisson distribution.
Process. When time unit expands or reduces to by t (t > 0), the average event occurrences becomes λt during the updated time unit where the average occurrences per default time unit is λ = np. Distribution with expanded (or reduced) time unit by t is called Process. And, correspondingly, PDF of the process becomes Poisson distribution and exponential distribution. Suppose that time period for a specific event to take place is a probability variable t and that T is the time when a specific event X T takes place. Under this circumstance, the probability of that event X T occurs after t is equal to that the event X T does not take place within t. The PDF of Poisson distribution becomes Advertising network architecture. Using Charles, a debugging proxy server application [12], we captured and analyzed packets from an advertising network and found out that the advertising network environment has a multilayer structure with the cycle flow as shown above. Advertising network environment is composed of advertisers (DSP), suppliers (SSP), and advertising networks (Ad Networks).
https://doi.org/10.1371/journal.pone.0232887.g001 PLOS ONE following equation Set Eq (2) as S(t) then, the probability of that event X T occurs is 1 − S(t) which is cumulative distribution function (CDF) for the probability variable t. Set this function as F(t) as following Eq (3) Since derivative of CDF is PDF, PDF for the random variable t is Finally, it is concluded that the probability variable t follows exponential distribution with f(t) and F(t), PDF and CDF respectively.
Non-homogeneous process. Lambda intensity function λ(t) determines non-homogeneity. If the event dynamics has positive covariance, intensity function becomes non-homogeneous, known as self exciting, otherwise it becomes homogenous.
Mathmatically the intensity function of non-homogeneous processes is defined as the instantaneous rate at which events occur [18]. However, in the real world implementation for Hawkes, technically it has to be the conditional probability of a specific event occurrence during Δt, which can be interpreted as time unit, at t where the event has not taken place by t. Thus, we suppose that limitation of time unit does not go to zero. We specify each event with the index variable i and suppose that the set of event times is history H t . Therefor, we can achieve following equations Eq (5) from Eqs (2), (3) and (4).
where H t = {t 1 , t 2 , . . ., t i } and N(T), a counting process, is number of event occurrences during T.

LSTM-Hawkes hybrid model
The maximum likelihood estimation using exponential kernel has a problem with residuals because they become multiplied by real numbers depending on the batch size. We solved this problem by scaling cumulative intensity value of the kernel with differential coefficient γ that minimizes the residuals, defined Gradient Exponential Kernel Eq (9). Also, to forecast posterior event time t i , we used LSTM [19] instead of Thinning algorithm [20] which has been dominantly used with Hawkes, for accuracy enhancement.
In this chapter, we first list all variables and parameters used in the proposed model and define the memory kernel that we suggest. Second, in the subsection 'Sampling Method', we describe LSTM architecture and show the test result that empirically proves LSTM is more accurate than Thinning algorithm. Lastly, we propose the LSTM-Hawkes Forecast algorithm [Flow Diagram 1] which combines the generative model with LSTM prediction to draw the posterior process of advertising click event.
Hawkes process and its memory kernel. Hawkes process is self-exiting which satisfies Eq (5) in which the intensity rate Eq (6) explicitly and proportionally increases depending on past event occurrences. Hawkes process is formed as Eq (7) where μ(t) is the background intensity based on the observed activities which can be represented in two ways 1) expected intensity at t based on the observed endogenous events (in our case click) or 2) background intensity describing arrivals of events triggered by exogonous event (in our case conversion). Also hyper parameter α increases intensity rate by α at each t i then decreases exponentially back toward μ(t). Both parameters are estimated by maximum likelihood and the initial value of μ(0) is set to expected CTR per time unit in this research.
Commonly used memory kernels with Hawkes process are Exponential kernel, original model proposed by Hawkes, and Power Law kernel, frequently used in social network model [21][22][23], proposed by Ozaki [24]. In the paper, we have chosen the exponential kernel Eq (8) to achieve better-case time complexity since the number of elementary operations performed increase depending on the umber of variables to approximate. The compared algorithm Eq (14) in our experiments adopts power law kernel Eq (9).
Gradient exponential kernel Gradient exponential kernel is defined as Eq (10), and the code implemented with consecutive loops of while, notated in a pseudo code of Algorithm 1, 2, 3, and 4 where the domain for the memory kernel satisfies t 2 eP as following Hyper parameter estimation. Definition of the lambda intensity function λ(t) in nonhomogeneous processes was presented in the subchapter 'Formulation and Preliminary Theory' as Eq (6) as well as we derive our own memory kernel as Eq (10). By minimizing negative log likelihood (NLL) over the observed data, hyper parameters for the memory kernel can be approximated. (2) and (3) can also be redefined by f 0 (t) as S 0 (t) and f 0 (t) respectively. Correspondingly, likelihood function for Hawkes can be derived as Eq (11) and its log likelihood function as Eq (12), driven by Rubin [25,26] Therefore, when eP j is the domain elements for the random variable t with PDF f 0 (t) and the memory kernel is Eq (10), we can derive NLL function for Gradient Exponential Kernel as where AðmÞ ¼ P t i <eP m e À dðep m À t i Þ for i � 2, t i denotes the event time and A(1) is equal to 0. In our proposed model, optimization is proceeded by minimizing the negative log likelihood Eq (12).
Contextual behavior. To define behavior model, we introduce the behavior concept from existing models [27,28]. User group behavior in advertising network environments is defined based on a set of activities. Also, action sequence, a set of endogenous and exogenous evnets, comprises each activity as element. The behavior model describes how specific ad-networks or publishers perform activities and how often the actions occur at different time.
LSTM We derived the definition of Hawkes and the proposed kernel in a closed form in the previous section. By introducing required parameters qualified to sets, variables and functions for Hawkes in Table 1, we were able to set up the NLL of our generative model. Now, activity sequences defined in Fig (2) can be set as input data for a RNN to learn and forecast T 0 i . We have chosen one layer LSTM and also notated required parameters as lw, lookback window, H (i) , history of activity sequences sampled lw in order, and g(t (i) |H (i) ), function in the Neural Network. In the experiments with the real data sets, we configured conversion data as exogenous event. Fig 3 shows the architecture of LSTM configured in our method.   (12), C) earning differential coefficient γ which minimizes residual error, and D) forecasting posterior event time during the prediction period and drawing λ(t). Those steps and each algorithm are described in Algorithm 1, 2, 3, and 4 as pseudo codes below. In the Algorithm 1, we chose evaluation points by dividing the distance [t 1 , t n ) with the size of eP which means at the implementation level, Δt which is t − t i in the Eq (8), is equal to τ n . After that, the hyper parameters are defined with the initial values and optimization is proceeded. In the Algorithm 4, by calculating lambda intensity with the results from LSTM prediction T 0 i , our model finally draws cumulative intensity values for ½t 1 ; t 0 n Þ.

Data
Criteo, an Ad Tech company, possesses cutting edge user re-targeting technology and has led the development of prediction methods based on machine learning algorithms. Being a leader in this field, Criteo has also notably hosted the Kaggle's CTR forecast competition in 2014 and 2015. We applied the click and conversion data sets [29] provided by the Criteo Lab and compared the results between the the model/algorithm proposed and the Hawkes Intensity Process algorithm (HIP) [21] algorithm. We evaluated the forecasting accuracy of the posterior distribution of a click event from a single advertising network.

Goodness of fit
Residual analysis [30] is a reliable measurement for the Hawkes model and has been widely used to evaluate the precision of a fit. Let t i be a point process with intensity λ(t i ) whose PDF is Eq (10) and s i be equal to λ(t i ) of Eq (5), then s i becomes a unit rate Poisson process transformed by a Hawkes model. Thus, if the model fits well, the transformed process should resemble a unit Poisson process. Also, the residual's inter-event time is supposed to be an independent exponential variable. Therefore, log-log-plot of the residual's inter-event time should be close to the linear line. Fig 4 shows the log-log-plot of the residual inter-event time from the LSTM-Hawkes model and it proves that the shape of the distribution is very close to the linear line. Also, the first quintile of residual's inter-event times are distributed the most. Thus, we can conclude that the LSTM-Hawkes model is an excellent way to determine a precise fit.

Compared algorithm HIP
The HIP model is a Hawkes-based model that uses Power-Law-Kernel Eq (9). Since HIP mathematically induces the expectation function ξ(t) of λ(t) referring to exogenous events to predict endogenous event, it is great to compare with our model having the same input sequence. The only difference between two models is that HIP supposes that exogenous events are given even for the predictive period while LSTM-Hawkes does not.

Compared algorithm Multivariate Linear Regression (MLR)
MLR has been widely adopted as an compared algorithm in population prediction in social media [31]. With production data we used for the experiment, next predicted intensity value based on the history H ( i) is calculated as below.
Commonly used error measurements such as the MSE, Mean Absolute Error (MAE), and the Root Mean Square Error (RMSE) are only designed to measure the raw residuals. However, different from these methods, scores metric [32] is devised to take the negative error into account, which reduces the error score when the residuals are less than a certain point. Not only does the score metric method discover how many residuals the model produces, but it is also able to discover how well the model fits and predicts. In the experiment, we set up this certain point using the standard deviation of the observed data during the time period for the prediction. In the 1st test section, both a 1 and a 2 are 4.057 and in the 2nd test section both a 1 and a 2 are 7.495. The final score is the sum of each function conditioned on the distance mark.

Experimental results
To assess the comparison between the HIP, MLR and LSTM Hawkes models, we selected two different sections where the moving average trend of the click is monotonically decreasing in the first section and increasing in the second section. Test results of the first section are presented in Figs 5, 6 and 7 and Table 2. Also, test results of the second section are presented in Figs 8, 9 and 10 and Table 3. For an accurate performance assessment of the HIP model and MRL, we chose several correlation coefficients so that those models could show its accuracy in prediction, precision in fitting, as well as it's limitation in both the fitting and prediction. For the accuracy of our assessment we chose four equally dispersed correlation coefficients of 0.2, 0.4, 0.6, and 0.8. The observed click is the endogenous event, and the exogenous event is the observed conversion.
First, in the forecasting test in the first section, it shows that the forecast made using the LSTM-Hawkes model reduces 73.9% of MSE on average and at least 27.1% compared to the HIP model where the correlation coefficient is 0.8, as well as 30.16% of MSE on average

PLOS ONE
compared to the MLR. Fig 5(a) and 5(b) also shows the difference between the HIP and Hawkes models where a sudden drop occurs at the beginning part of prediction period. In contrast to HIP that does not follow the moving average due to the low correlation coefficient, LSTM-Hawkes follows the moving average closely, greatly reducing residuals. In Fig 6(a), HIP   Fig 6. As correlation coefficient gets higher, HIP detects the sudden drop better. However since HIP ξ(t) is the expectation value of λ(t) with powerlaw decay, it tends to follow the trend a little more quadric rather than predict actual intensity value each time unit.
https://doi.org/10.1371/journal.pone.0232887.g006  follows the moving average closely with a high coefficient (0.8) of exogenous data but still it's MSE is at least 20 points higher than that of LSTM-Hawkes. MLR also shows great performance in fitting and learning. However this fitting capability eventually causes the overfitting issue and it is proved that LSTM-Hawkes reduces residuals more than MLR in Fig 10. The higher variance of observed data has, the more residuals the feature of MLR produces. Thus, in any case where the dynamics has bigger amplitude or higher frequency, measured MSE of MRL should be greater than the MSE of LSTM-Hawkes. The forecasting test for the second section shows that LSTM-Hawkes reduces 83.8% of MSE on average and at least 37.5% compared to the HIP where the correlation coefficient is 0.6 as well as 49.39% of MSE on average compared to the MLR. Although both HIP (coefficient 0.6) and LSTM-Hawkes show great performance in prediction as can be seen in Fig 8, between 50 and 70 less-fitted parts are found with HIP while LSTM-Hawkes proves the goodness of fit. This gap is also shown in Fig 7(a) as well. From the tests, we were able to verify that our proposed model of the LSTM-Hawkes significantly outperforms the HIP model in both the fitting and forecasting criteria. As MLR showed the overfitting issue in the first experiment, at this time the MLR produces greater MSE again. Due to higher variance of the observed data of second section, MLR achieves an overfitted model with higher variance. While the compared algorithms show the similar performance in prediction (MLR's fitting capability outperforms HIP), the LSTM-Hawkes outperforms the both compared methods in every experimental case.
Pertaining to the compatibility test with various maximum likelihood estimation algorithms based on both gradient descent method and Newtonian method, we could prove that LSTM-Hawkes has great compatibility with all of the tests. We set up our model for the compatibility test with Nelder-Mead [33], BFGS with Hessian Matrix [34], Conjugate Gradient Descent [35], and Simulated Annealing [36] and have our model optimized by estimating hyper parameters with the maximum likelihood estimation functions mentioned above. Fig 9  shows learning results based on these optimization functions and we could discover that the full interquartile range (IRQ) of MAE is a lot smaller than that of HIP (Fig 11). Also, since our model is not fully dependent with exogenous data to refer, the outlier-points cannot be found. It is expected that compatibility with different optimization algorithms will always be guaranteed.

PLOS ONE
Lastly, when we assessed the score metric, Eq (15), of the two models, we were able to see that the LSTM-Hawkes model always obtained a greater percentage of negative error score than those of the HIP model and MLR presented in Fig 12. Thus, this always resulted in a lower score metric for the LSTM-Hawkes model, proving it's greater accuracy of prediction and goodness of fit.

Conclusion
In the real production environment, an advertising network forecasting application does require the prediction results within milliseconds or even microseconds [37]. Therefore, the approach to the short interval prediction using Neural Network is not just scientifically important but also practically required since it does take great advantages in fast forecasting. Plus, because Algorithm (1), (2) are able to be performed in parallel or con-currently while Algorithm (3) is being conducted, the actual processing time in the production environment would expect to be even shorter. Thus, the LSTM-Hawkes using stochastic based generative model always guarantees the time efficiency.

Discussion
These days, to tract more consumers, advertisers strongly focus on conversion events which actually create sales. Thus, the percentage of mobile advertising bidding based on conversion events increases. However, since conversion events rarely occurs, return Conversion Ratio (CVR) predicted based on a historical dataset is a great challenge for existing models including the Hawkes and other Neural Network models. However, different from the original mathematical definition of the kernel, we did not suppose that limitation of Δt goes to zero. Therefore, we can easily expand the application of the model for the longer interval forecast based on bigger time unit. In future research, we plan to develop and implement a CVR prediction model, expanding to LSTM-Hawkes, which overcomes the data sparsity problem.
Supporting information S1 Dataset. To help any of researchers to reproduce the result, we provide the codes of the model. Having the original data sets that can be found at http://labs.criteo.com/wp-content/ uploads/2014/07/criteoconversionlogs.tar.gz, we first preprocess the data to conform behavior model (Fig 4) which can be found from the supporting information attachment. Then, we have processed Algorithm 1 to 4 in order to achieve the experimental results which also can be found from the supporting information attachment. (ZIP) S1 File. (R)