Click prediction boosting via Bayesian hyperparameter optimization based ensemble learning pipelines

Online travel agencies (OTA's) advertise their website offers on meta-search bidding engines. The problem of predicting the number of clicks a hotel would receive for a given bid amount is an important step in the management of an OTA's advertisement campaign on a meta-search engine, because bid times number of clicks defines the cost to be generated. Various regressors are ensembled in this work to improve click prediction performance. Following the preprocessing procedures, the feature set is divided into train and test groups depending on the logging date of the samples. The data collection is then subjected to feature elimination via utilizing XGBoost, which significantly reduces the dimension of features. The optimum hyper-parameters are then found by applying Bayesian hyperparameter optimization to XGBoost, LightGBM, and SGD models. The different trained models are tested separately as well as combined to form ensemble models. Four alternative ensemble solutions have been suggested. The same test set is used to test both individual and ensemble models, and the results of 46 model combinations demonstrate that stack ensemble models yield the desired R2 score of all. In conclusion, the ensemble model improves the prediction performance by about 10%.


Introduction
Millions of travelers book hotel accommodation over the Internet each year.
Modern travelers rely on peer options, electronic word of mouth (eWOM), and peer reviews. Popular online travel websites offer reliable reviews and prices [1]. Therefore, customers choose to inspect and compare different options on meta-search sites like Kayak.com, Trivago, and TripAdvisor before booking their accommodations.
Online travel agencies (OTA's) advertise their website offers on meta-search bidding engines. If the OTA chooses to have a Cost-Per-Click (CPC) ad campaign, the OTA promises to pay a certain amount for each click a certain hotel gets from the platform under predefined conditions. The amount to pay per click is the OTA's bid amount. The problem of predicting the number of clicks a hotel would get for a certain bid amount is an important step in the OTA's advertisement campaign management on a meta-search engine, as bid × numberof clicks defines the cost to be generated.
In one study, state-of-the-art prediction algorithms and Extreme Gradient Boosting (XGBoost) [2] regressor as well as a minimum Redundancy-Maximum Relevance (mRMR) [3] feature selection algorithm were executed to predict the daily clicks to be received per hotel, using a large OTA's data from Turkey [4]. The data set received from the meta-search bidding engine contained both numerical and categorical features, with each column having missing and outlier values. The number of clicks as the multiplication of the predicted click-through rate (CTR) and the predicted hotel impression were modelled. The highest R-Squared values obtained in the prediction of individual-hotel based CTR and impression values are both achieved by XGBoost in this work.
Another study aimed to forecast how many impressions and clicks a hotel will acquire as well as how many rooms it will sell via a meta-search bidding engine [5]. The given model predicts how much money an OTA's hotels will make the following day. The authors demonstrate that by incorporating OTA-specific information into prediction models, the generalization of models improves and better results are obtained. In that study, the best results were obtained using tree-based boosting techniques.
Predicting hotel searches, clicks, and bookings is a challenging task due to many external factors such as seasonality, events, location, and hotel-based properties. Capturing such properties increases the accuracy of prediction models. Due to the high variance in daily OTA data, the use of non-linear prediction methods and creating relevant features with a time-delayed data preprocessing approach is adopted in a work trying to forecast daily room sales for each hotel in a meta-search bidding platform [6]. They applied XGBoost, random forest, gradient boosting, deep neural networks, and generalized linear models (GLM) [7]. The most successful model to predict bookings is gradient boosting, applied on a dataset enriched by features that can summarize the trends in the target variable well.
The demand for hotel rooms in the hotel industry in Turkey between the years 2002-2013 is estimated using ARIMA by Efendioglu and Bulkan [8]. In their study, they determined the hotel room capacity according to the cost of the unsold rooms and the ARIMA distribution. They also reported that the hotel room demand in the country could be affected by outer factors such as political crises and warnings about terrorism. This work shows the non-deterministic nature of hotel room demand and how unpredictable factors suddenly affect the click prediction problem.
In the literature, studies are focusing on the problem of predicting the CTR of a sponsored display advertisement to be shown on a search engine, related to a query. Click and CTR prediction is an ongoing research for both industry and academia [9] [10] [11]. Our aim of predicting the number of clicks is highly related to the CTR prediction problem, hence those studies are investigated to get a better understanding of related work.
In order to predict ad clicks, Google makes use of logistic regression with improvements in the context of traditional supervised learning based on an FTRL-Proximal online learning algorithm [12] for better sparsity and convergence. Microsoft's Bing Search Engine proposes a new Bayesian online learning algorithm for CTR prediction for sponsored search [13], which is based on a probit regression model that maps discrete or real-valued input features to probabilities. The scalability of the algorithm is ensured through a principled weight pruning procedure and an approximate parallel implementation. Yahoo adopts a machine learning framework based on Bayesian logistic regression to predict click-through and conversion rates [14], which is simple, scalable, and efficient.
Facebook combines decision trees with logistic regression [15], generating 3% better results in click prediction, compared to other methods.
Ensemble learning [16] is a machine learning model combination that gets decisions from various models to enhance the overall performance. The ensemble approach provides the stability and low-variety predictions of machine learning algorithms. It builds a set of decision-makers, namely classifiers and regressors, with various techniques as final decisions [17].  whether they could increase the profitability of pay-per-click (PPC) campaigns [21]. They applied voting, bootStrap aggregation (Bagging) [22], stacked generalization (or stacking) [23], and metacost [24] techniques to four base classifiers, namely, Naïve Bayes, logistic regression, decision trees, and Support Vector Machines. The research in this work analyzed a data set of PPC advertisements placed on the Google search engine, aiming to classify PPC campaign success.
They used average accuracy, recall, and precision metrics to measure the performance of both base classifiers and ensemble models. They also introduced the evaluation metric of total campaign portfolio profit and illustrated how relying on overall model accuracy can be misleading. They conclude that applying ensemble learning techniques in PPC marketing campaigns can achieve higher profits.
Eight ensemble methods were proposed by Ling et al. to accurately estimate the CTR in sponsored search ads [25]. A single model would lead to sub-optimal accuracy, and the regression models all have different advantages and disadvantages. The ensemble models are created via bagging, boosting, stacking, and cascading. The training data is collected from historical ads' impressions and the corresponding clicks. The Area under the Receiver Operating Characteristic Curve (AUC) and Relative Information Gain (RIG) metrics are computed against the testing data to evaluate prediction accuracy. They conclude that boosting is better than cascading for the given problem. Boosting neural networks with gradient boosting decision trees turned out to be the best model in the given setting. They conclude that the model ensemble is a promising direction for CTR prediction; meanwhile, domain knowledge is also essential in the ensemble design.
Etsy, an online e-commerce platform, displays promoted search results, which are similar to sponsored search results and our problem with meta-search bidding engines. CTR prediction is utilized in the system to determine the ranking of the ads [26]. They found out that different features capture different aspects, so they classified the features as being historical and content-based. They train separate CTR prediction models based on historical and content-based features, separately. Then, these individual models are combined with a logistic regression model. They reported AUC, Average Impression Log Loss, and Normalized Cross-Entropy metrics to compare the models to non-trivial baselines on a large-scale real-world dataset from Etsy, demonstrating the effectiveness of the proposed system.
In this study, we utilize ensemble learning pipelines to predict the number of clicks a hotel will receive the next day, and comparing substantial amount of stand-alone prediction performance of the models.

Overview of the Proposed System
There are five primary components in the proposed system. The complete system's flow diagram is depicted in Fig.-1. To summarize, queries are used to retrieve the dataset from the database. Preprocessing is used to extract timedomain seasonal decomposition features with suitable data cleaning in the next stage. XGBoost, LightGBM (LGBM) [27] and Stochastic Gradient Descent (SGD) [28] algorithms are then subjected to hyper-parameter tuning. In the final step, individual and ensembled models are trained and tested with the same train and test sets to generate click predictions. Each model's R 2 score is presented, and 46 distinct models are trained and tested via the proposed system.

Dataset Generation and Data Preprocessing
The data is retrieved from a major OTA company based on Turkey. Contents The closeness of the related day to the next public holiday and the length of the holiday are also added as additional numerical variables.
In order to improve the accuracy and generalization ability of the prediction model, additional features are generated from the data following a slidingwindow (time-delay) approach. For example, the average and standard deviation of numerical values for some specific time periods (such as the last 3, 7, As a final step, feature one-hot encoding is proposed for some of the stringbased features and binarized. In the last step, the feature set is normalized with min-max scaling to force values to be between 0 and 1.

XGBoost-based recursive feature elimination
XGBoost is the part of gradient boosting decision tree which operate via regularization of the tree framework. By using gradient boosting to create the boosted trees and collect the feature scores in an effective manner, each feature's significance to the training model is indicated [30]. The formula calculation the feature importance of every feature F n is shown in Eq. 1.
There is a subdivision of each node into two regions at every node e for each feature n as a part of the feature space F n from a given single decision tree T . The maximally forecasted score boosting rateî 2 represents the metric of squared error shifts of the cost function from the given XGBoost regression outcome of an additive tree T e . The summation of the squared importance over all trees E proposes the summarization of the square importance of the given feature n. Accordingly, the root mean squared importance manifests the absolute importance factor of the feature.
The estimation of such an improvement depends on replacing the actual feature value in space with random noise to determine a relative magnitude shift in the final regression performance. Running multiple trees simultaneously provides a better understanding of the average importance of the feature.
In the next step, the customized recursive feature elimination algorithm is used to minimize the feature space [31]. The algorithm 1 shows the procedure of the flow. The goal is to cover the features (f eature subspace) that represent best the feature importance levels in a descending order. To avoid the complexity of the classical recursive-based feature elimination due to the large feature space, the initial feature importance values are considered as bias factors for the features. Given that the randomization factor of the selected features will be auto-biased in the subspace, such a specialization significantly reduces the elimination process. r 2 score value of a new f eature subspace is calculated within every iteration until convergence occurs (r 2 temp value stop being exceeded by r 2 score). Again XGBoost regressor is selected as feature sub-space evaluator. if r 2 score < r 2 temp then return (f eature subspace) else r 2 temp = r 2 score end end

Bayesian Hyper-parameter Optimization
Hyper-parameter optimization is an essential approach for some machine learning models to enhance prediction performance. There are a few algorithms for tuning hyper-parameters. One of them is a Grid Search [32] which tries 200 each combination of given hyper-parameter candidates of a model. Another optimization algorithm is known as random search [33], which randomly extracts hyper-parameter combinations and tries to reach local optima of a performance score. However, none of them are able to reach successful local optima of performance in a short period. Bayesian hyper-parameter optimization [34] is a relatively more powerful and efficient algorithm for hyper-parameter tuning. It aims to reach a global optimum in a much shorter time than grid search. There is a probabilistic model of f (x) that aims to be exploited to make decisions about where X is accepted as the next performing function. This procedure helps to find the minimum of non-convex functions in a few epochs, which positively effects the performance. The evaluation metric to rank hyper-parameter combinations through input data is R − squared(R 2 ). R − squared is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. The formula of R 2 is shown in Eq. 2.

Ensembling
If there are M models with errors extracted from the same dataset which are uncorrelated with them, the average error of a model is theoretically reduced by some factor by simply averaging the model outputs. On the other hand, if some of the model outputs have lower performance and are not fit to predict results as well as others, overall error may not be reduced or even increase in some cases.

Average & Weighted Average of Model Outputs
The first and most basic ensembling approach is to take an average of various model outputs. There are two different averaging techniques for ensembling.
The first one is taking a mean of predicted values. It provides a lower variance of predicted values since different algorithms proceed to predict various aspects of the input data set. The formula for an average of model outputs is shown in Eq. 3.
where i is the i th sample, r is regressor model, pi r is individual probability of given regressor and n is the number of models used.
However, some machine learning models perform worse than others in terms of prediction, culminating in a poorer overall ensemble prediction performance than some individual regressor prediction performances. The fundamental reason for this is because we give weak regressors the same weight as other ones that provide decent individual performance. As a consequence, while taking an average of all estimations, the weighted average is also utilized in this study to eliminate the detrimental influence of low-performance models. Weights are produced using each model's individual R 2 score and scaled between 0 and 1 to standardize the weight of each regressors, ensuring that the sum of all weights is 1. This method allows models that predict higher performance to have a greater impact on final prediction than models that predict lower performance.
The formula of the weighted average of model outputs is shown in Eq. 4.
W avg i = r w r * pi r , r ∈ R f or i = 1 to N, where r is the chosen regressor model, w r is normalized individual R 2 performance of regressor. r, pi r is prediction result of regressor r of i'th sample and N is the number of models used.

Blend Ensemble Model
The Stack ensemble method and the Blend ensemble algorithm [42] have

Experiments and Results
Instead of splitting a dataset into train and test with some percentage, daily click predictions of each hotel are estimated. Accordingly, the train set is designed from the earliest day until test day that clicks will be predicted. By It can be inferred from the results that simpler regressor models as metapredictors overshadow tree based regressors due to the fact that the most of the work is already done by the level-0 learners; the level-1 regressor is basically just a mediator and it makes sense to choose a rather simple algorithm for this purpose [43]. Simple linear models at the leaves suppose to work well, and the results are likely to prove once again.

Conclusion and discussion
Assorted regressors are ensembled in the proposed study to improve click prediction performance. The feature set is divided into train and test groups depending on the logging date in the first phase. The data collection is then The same test set is used to test both individual and ensemble models, and the results of 46 model combinations demonstrate that stack ensemble models produce the best R 2 score of all. The greatest R 2 score is 0.639 for the stack ensemble model combined with linear regression, whereas the best machine learning model had an R 2 score of 0.579. As a conclusion, the ensemble model improves prediction performance by about 10%.
Various types of artificial neural network (ANN) models will be added to ensemble models in the future, with the goal of improving stack and blend ensemble models. Yandex's CatBoost machine learning model [44], which handles categorical information, can also be added to the list of regressors to examine.
The concept of meta-learners is designed to provide the final outcome, yet there are possibilities to convert them into intermediate learners via inducing additional hyperparameter optimization mechanisms or additional meta-feature elimination due to forming the additive judgement on stacked predictions on an originally reduced feature dimension [45]. Articulating meta-learners as mediators would be an inception based regularizer for intercommunication between multiple meta-models as a single pipeline, which might recalibrate incoming feature space with new model parameters to interact with.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.