An Ensemble Architecture Based on Deep Learning Model for Click Fraud Detection in Pay-Per-Click Advertisement Campaign

With the rapid development of online advertising, click fraud is a serious issue for the internet market. Click fraud is a dishonest attempt to improve a website’s profit or deplete an advertiser’s budget by clicking on pay-per-click advertisements. For an extended period, this illegal act has a threat to the industrial sectors. As a result, these businesses hesitate to advertise their items on mobile apps and websites, as numerous groups attempt to take advantage of themes. To safely advertise their services and products online, a robust mechanism is needed for efficient click fraud detection. To tackle this issue, an ensemble architecture of machine learning and deep learning is proposed to detect click fraud in online advertisement campaigns. The proposed ensemble architecture consists of a Convolutional Neural Network (CNN), and a Bidirectional Long Short-Term Memory network (BiLSTM) is used to extract hidden features, while the Random Forest (RF) is used for classification. The main objective of the proposed research study is to develop a hybrid DL model for automatic feature extraction from clicks data and then process through an RF classifier into two classes, such as fraudulent and non-fraudulent clicks. Furthermore, a preprocessing module is developed to preprocess data by dealing with categorical attributes and imbalanced data to enhance the reliability and consistency of the clicks data. In addition, different evaluation criteria are used to evaluate and compare the performance of the proposed CNN-BiLSTM-RF with the ensemble and standalone models. The experimental results indicate that our ensemble architecture achieved the accuracy of 99.19 ± 0.08%, precision 99.89 ± 0.03%, sensitivity 98.50 ± 0.11%, F1-score 99.19 ± 0.08% and specificity 99.89 ± 0.03%. Furthermore, our proposed architecture produced superior results compared to other developed ensemble and conventional models. Moreover, our proposed ensemble architecture can be used as a safeguard against click fraud for pay-per-click advertising to facilitate industries for the safe and reliable promotion of their products.

Advertisements are delivered to these ad networks, which agree on a pre-decided price for each user click. The ad network compensates the content publisher based on the number of visitors it directs to the ads [4]. Unfortunately, a threat known as Click Fraud is associated with this payment method. There is roughly one fraudulent click among five clicks. The practice of click fraud is becoming more common, and a significant part of internet traffic is fake. In addition, advertisers typically suffer from economic losses due to click fraud activities [5].
Click fraud can be defined as deliberately clicking on a pay-per-click advertisement to redirect or negatively use the advertiser's ad budget. [6], [7]. Numerous groups or parties engage in click fraud. The most frequent offenders are as follows: Competitors engage for the largest share of click fraud in their competitor's adverts. They generate clicks and acquire a competitive edge by squandering their opponent's pay-per-click budget. Web administrators conduct click fraud and make unjustifiable revenue from displaying advertisements on their websites. Rather than spending time growing and improving their website, they are enticed to click on these advertisements to gain profits. A fraud ring is an organized group of fraudsters who target ad networks to obtain more money quickly [8].
Click fraud may be performed in a variety of ways. A brute force attack is an approach to click fraud using a single computing device. This attack might be as simple as repeatedly clicking on an advertisement. Publishers employ crowd-sourcing to boost ad clicks. They intentionally or unintentionally use website users to click on their ads. Reward traffic compensates the user for clicking on the ad, a more sophisticated form of click fraud that can create many clicks. Click Farm is a click fraud technique that persuades people to click on advertisements for a whole day in return for money. Hit Inflation is another type of click fraud in which real users are redirected to a website by visiting the ad and then the page they want to see. Botnets are malware that spreads through a network of infected computers. Malware takes control of several computers. They tell the hacked computers to browse various websites and click on their advertising without the owner's knowledge [9].
Recently, machine learning (ML) and deep learning (DL) paradigms are used widely for online fraud detection, such as user behaviors and items fraud [10], tax fraud [11], financial and transaction [12], [13], credit card fraud detection [14], [15], [16], to name a few. The conventional ML models, such as Random Forest (RF), Support Vector Machines (SVM), Naive Bayesian (NB), etc., rely on the manual representation of features space, which require human intervention to construct features space before the learning process. Furthermore, these conventional models are also not adaptable to cope with high-dimensional data [17]. To cope with this issue, DL models, such as Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory (BiLSTM) are robust and provide automatic feature construction from large data samples [18], [19], [20]. These models are also leveraged to cope with high dimensional and non-linear data to produce reliable and superior performance compared to the conventional learning models [21], [22]. Motivated by these studies, therefore, in this study, an ensemble architecture is proposed based on a two-fold hybrid approach to classify normal and fraudulent clicks. First, DL models, such as CNN and BiLSTM, are combined to develop a robust architecture to construct feature space automatically. Second, RF is employed as a supervised learning model to take features constructed using a hybrid DL model to classify clicks into two classes, such as normal and fraudulent.
The following are the major contributions of this study.
• Develop an ensemble CNN-BiLSTM-RF architecture based on ML and DL models to detect click fraud in online advertisement campaigns with high accuracy.
• Develop a pre-processing module to investigate temporal characteristics of the dataset for click fraud detection.
• Comparative analysis of the proposed ensemble CNN-BiLSTM-RF architecture with the conventional learning and DL models to highlight the significance of the proposed research study.
• Various experiments are performed as a proof of concept to highlight the significance of the proposed ensemble architecture to facilitate the advertisement industry for reliable product promotions.
• A complete experimental study is provided to assess the performance of the proposed model, including Area Under Curve (AUC), precision, specificity, sensitivity, F1-score, confusion matrix and accuracy.
The remaining flow of our proposed architecture is structured as In section II discussed the latest and most related work. Section III illustrates the proposed architecture methodology and materials. Data description, pre-processing and analysis are given in section IV. Section V contains the details experimental setup and results analysis. Section VI compared the proposed approach with other latest methods and section VII conclude the article.

II. RELATED WORK
This section discusses the existing approaches for click fraud detection in Pay-per-Click advertising campaigns. Many studies exist in the literature focused on click fraud detection employing machine learning and deep learning approaches. However, this section discusses a few recent and related studies. In [23], proposed Clicktok, based on unifying the technical response and exploiting the temporal aspects of click traffic, provides a protection approach that separates organic and non-organic click fraud attacks.To identify the online click request author used AdSherlock and detected click fraud after that [24]. In [25], a robust integrated local kernel embedding model was proposed to handle data sparsity and imbalance problems through a robust similarity function, which obtains the data embedding. In [26], presented an efficient and deployable solution for detecting click fraud at the client side in mobile apps. Finally, in [27], the Fight VOLUME 10, 2022 Click-Fraud (FCFraud) method was proposed to detect click fraud from the user side, which can be incorporated into smartphone and computer operating systems. The proposed method accurately classifies ad requests from all user actions 99.6% accurately detects click bots 100% successfully on mobiles and computer devices.
Different supervised learning models have been developed to detect click fraud in an online advertisement environment. In [28], proposed an ad-fraud-detection approach that utilizes robust features against attacker evasion. In [29], the authors developed new features based on statistics seen in an ad network, estimated from a considerable number of legitimate user ad requests, including the popularity of publisher websites and client environment tendencies [30]. These features are fed to the RF for detecting fraudulent ad requests. In [31], assessed the user's click journey across their portfolio and flagged IP addresses that generate many clicks but never install apps. They used LightGBM as a methodology and achieved 98% accuracy. In [32], the authors proposed a model based on XGBoost for distinguishing between legal and illegal users. In [33], analyzed click patterns across a dataset to determine the user's click journey across their portfolio and fagged IP addresses that generate a high number of clicks but not complete installation of the app. They used SVM, KNNs, RF, and Gradient Tree Boosting (GTB) for classification. In [6], the RF algorithm was used to classify features to predict fraudulent click behaviour and achieved prediction accuracy higher than 91% in the positive and negative samples. However, all these models require a manual process to construct features from the given data samples. Furthermore, these conventional learning models are not versatile enough to cope with high dimensional and non-linearity data problems, which degrade the model performance and cause poor generalization.
Furthermore, click fraud not only bothers budget advertisers but also demonstrates how bots are being used to tamper with your data. In [34], useful information, as a result, is critical to be aware of and evolve. It is to devise solutions to avoid and prevent them. In [35], the authors proposed a click fraud detection model, abbreviated CFC, for classifying fraudulent clicks by incorporating some features and testing with KNN, ANN, and SVM. The experiment results show that the proposed CFC model achieved more than 93%. In [36], Using ADASYN with GTB to over-sample the data enhanced the classification accuracy with an average precision score of 64.32% because the accuracy measure is not appropriate for the imbalance distribution of class samples. Therefore, the author used the F1 score and AUC as evaluation measures to assess the performance of GTB. In [37], an algorithm-based detection technique for classifying target advertising click frauds demonstrates how machine learning approaches can be integrated to maintain the viability of online advertising. In [38], CFC (Click Fraud Crowd-sourcing) Approach defend the dishonorable Clicks. In [39], the authors intended to detect clicks fraud using various ML and DL classifiers, such as RF, Support Vector Machines (SVM), k-Nearest Neighbor (KNN), as well as DL methods such as auto-encoders, Convolutional Neural Networks (CNN), Restricted Boltzmann Machine (RBM), etc. The limitation of this solution is that it only detects fraud in a supervised learning context.
Moreover ensemble strategies were also developed to detect click fraud to facilitate the advertising industry for reliable product promotions. In [40], the authors suggested an ensemble approach based on RF to classify ads impression into two classes, such as fraudulent or non-fraudulent. The authors achieved a precision of 96.29% using data acquired from a European commercial ad server. In [5], an ensemble approach was developed by integrating Cascaded Forest and XGBoost to detect click fraud using multiple datasets to evaluate the performance of the existing approach. The authors achieved the maximum precision of 94.0%, recall of 94.0%, F1-score 94.0%, and accuracy of 94.53%. Another Gradient Tree Boosting (GTB) based ensemble model was proposed to classify fraudulent behaviours of publishers from raw user click data [41]. The authors reported that the GTB model achieved a precision of 60.5%, which still needs improvement to facilitate industry for reliable online advertisement. Similarly, in [42], a two-fold strategy was developed to segregate non-human clicks from online advertisement camping data. Another ensemble model was presented in [43] to employ XGBoost for detecting click ad fraud using online advertising clicks data. The authors achieved an accuracy of 96% to facilitate advertisers to block fake ads for reliable product promotions.
In [44], a hybrid deep learning method comprised of a Neural Network (NN), Semi-supervised Generative Adversarial Network (GAN) and an Auto-Encoder (AE) was developed for click fraud detection. Furthermore, a multi-time scale forecasting technique was presented to deal with the imbalanced dataset. In [45], proposed a deep learning approach called the cost-sensitive CNN model to identify fraudulent clicks using mobile advertisement data based on the feature matrix of a click to capture the pattern of click fraud. As a result, they obtained a classification accuracy and recall rate of over 93%. In [46], they proposed a unique weighted hybrid model to detect click fraud and identify fraudulent mobile advertising apps by integrating heterogeneous graph, and DL approaches. The proposed approach is based on the mobile ad system's relationships among users, publishers and advertisements. Furthermore, Table 1 illustrates a critical analysis of the existing models for click fraud detection.
To the best of our knowledge, all aforementioned studies attempted to use either ML or DL models to detect click fraud in advertisement data. Furthermore, most of the studies employed manual approaches for feature extraction to detect click fraud detection. In addition, all these studies did not achieve accurate performance for click fraud detection to facilitate advertisement industries. To sum up, no study used an ensemble approach to automatically extract features using DL methods from given clicks data to classify into two real and fraudulent clicks using a supervised learning algorithm. Furthermore, existing clicks fraud detection models are failed to achieve higher detection rate to facilitate advertisement industry for secure products promotions. Therefore, in this study, an ensemble CNN-BiLSTM-RF is developed to extract features automatically using a robust hybrid DL model and used RF classifier to classify real and fraudulent clicks. The proposed ensemble architecture aims to extract the most promising features automatically to build a robust classifier for enhancing the performance of the clicks fraud detection and also facilitating advertisement industries for reliable product promotions.

III. PROPOSED METHODOLOGY
This section presents a detailed methodology for click fraud detection architecture. It includes a general flow model of CNN architecture and detailed step-by-step architecture of the proposed method. In addition, it presents a comprehensive review of deep learning and conventional machine learning models.

A. OVERVIEW OF PROPOSED MODEL
This subsection shows an overview model of the proposed click fraud detection. The proposed model consists of the following steps as shown in the Figure 1. The first step illustrates the pay-per-click data, which includes real and fake click data. Next, raw clicks data is passed to the data cleaning module to cope with data imbalance problems, missing attribute values, selection of machine-readable features, etc. In the next step, prepared data is given as an input to the temporal features extraction module, which is responsible for extracting hidden temporal patterns from the given data. In addition, extracted features are visually analyzed to understand the hidden temporal patterns. Next, our prepared data is divided into training (learning) and testing (unseen) sample sets. Furthermore, our proposed CNN-BiLSTM+RF model is trained using the learning samples. Once our proposed model is trained, testing samples are passed to the learned model for evaluating the performance and performing a comparison with existing baseline models. To evaluate the performance of proposed and baseline models, different evaluation analysis matrices are used, such as accuracy, precision, recall, f1 score, and area under the curve.

B. STEP BY STEP PROCESS OF PROPOSED METHODOLOGY
In this subsection, a step-by-step process of the proposed Pay Per Click (PPC) methodology is discussed in Figure 3. The step-by-step process of the model illustrates several processes, including fake and real clicks data, processing the unprocessed data, extracting features, creating ML and DL models, and performance evaluation. First, we used the Synthetic Minority Oversampling Technique (SMOTE) for an unbalanced dataset to balance the dataset. Then, we divided the dataset into two classes real class and fake class. Finally, different performance matrices are considered to evaluate and contrast the effectiveness and performance of the proposed with the development approaches.

1) PRE-PROCESSING DATA
In pre-processing TakingData dataset has 200 million clicks per link including 8 feature. Data must be pre-processed before detection for the algorithm to recognize it.
• Remove Null Value: During data pre-processing, Machine Learning algorithms do not accept missing values; managing missing data is required during dataset preparation. We remove the missing values from the dataset using the distinct floating-point NaN value and the Python None object. i.e., removing rows that contain missing values. : Conversion of our data into numeric because in DL/ML, input and output variables are numeric, so we encode our categorical data into numeric data to fit and evaluate the model.
• Transformation of Data: Conversion of our data into numeric because in DL/ML, input and output variables are numeric, so we encode our categorical data into numeric data to fit and evaluate the model.
• Data Re-sampling: Data re-sampling is the prepossessing methodology to improve the accuracy of data.
In our data, we used SMOTE oversampling technique to balance the distribution of the sampler by generating artificial samples to increase the samples of the minority class. The advantage of SMOTE is that it produces artificial data points rather than duplicates that are only slightly different from the actual data points.
• Data Normalization: It is a technique used in machine learning and deep learning to reduce the sensitivity of the training model to the number of features. It involves transforming real numerical value characteristics to a 0-1 range. It makes it possible for the model to converge to more precise weights. we used a common scale to change the value of numeric columns in our dataset because the features in the data have different ranges.

2) GENERAL CNN ARCHITECTURE
CNN is the improved version of multi-layer perception, proposed by [56]. CNN networks come in a variety of forms, We employed 1-D CNN in this study. The typical structure of 1-D CNN is visualized in Fig. 2. CNN has three layers: a convolutional layer, a pooling layer, and a dense or fully connected layer. The CNN extracts implicit features from the input data by performing convolution and pooling operations [57]. The features gathered are then combined and routed through a dense or fully connected layer. An activation function is used to increase the non-linearity of neuron yield.
The convolutional layer is a critical component of the CNN architecture. By convolving the input data, a convolutional layer consists of several convolutional kernels that extract hidden features and build feature maps. Transmit the feature maps into a non-linear activation function to generate the convolutional layer output. Equation 1 is used to represent  the convolutional layer mathematically, [58], that is.
where x j indicates the convolution layer input, c j represents the j th output feature map, w j indicates a weight matrix, * illustrates the dot product, b j denotes the bias vector, and f indicates the activation function. As an activation function for CNN, rectified linear unit (ReLU) function is frequently used. Mathematically ReLU function can be defined as follows in Equation 2, [59]: where the h j represent a feature map element produced through convolutional methods. Pooling layers are also known as down-sampling. The pooling operation's primary function is to reduce the feature maps dimensionality and to avoid overfitting. The Max pooling layer is a popular pooling method. Mathematically can be calculated using Equations 3 and 4 to get the extreme value of an allocated area in feature maps [60]. where γ represents the maximum pooling sub-sampling function, β j denotes the bias and P j indicates the output of the max-pooling layer. Finally, the convolutional and pooling feature maps are passed into the fully connected layer, which produces the ultimate output vector, which formulated by [60] as shown in Equation 5: where y j indicates the ultimate output vector, the bias represents with δ j , and t j denotes the weight matrix.

3) GENERAL BiLSTM ARCHITECTURE
BiLSTM is a bidirectional variant of LSTM used to learn in both directions, such as forward and backward, to process long data sequences [61]. The conventional LSTM often forgets future information, which causes a loss of information for long-term time-series data dependencies. The conventional LSTMs can also only use the prior context. Therefore, BiLSTM effectively employs two separate hidden layers to learn the long-term data sequences in forwarding and backward directions. It is preferable to capture two-direction contextual dependencies to gain access to long-range information. Simply, it consists of two LSTMs, where the first LSTM is employed to feed the learning process in the forward direction, whereas the second LSTM is used to learn from the given inputs in reverse (backward) direction. Fig. 4 shows the basic architecture of the BiLSTM. − → h and ← − h are used to individually indicate the output of the forward and reverse hidden layers. Both LSTM units use the ordered input data sequences in the training process. The recursive process is carried out to estimate the output of the forward − → h t and reverse ← − h t LSTM layers. The output of both LSTM layers is merged using the mode attribute, whereas mode comprises the following possible merge strategies: average, sum, multiplied and concat. Our proposed architecture specifies an average mode to merge outcomes obtained from forward and reverse LSTM units. Finally, a flattened layer f layer is employed to get the merged output and convert it into a one-dimensional vector v to obtain the desired outcome by passing vector v to the softmax function.
The layer of BiLSTM generates bi-directional sequences as an output two-dimensional vector, Y , where output sequences of both LSTM units are concatenated using merge mode strategy as shown in Equation 6, [18].
where the α indicates the merge mode strategy used for the both − → h t and ← − h t output sequences. The α indicates an average mode strategy to concatenate the output sequences of both forward and reverse LSTM units. The merge mode strategy can be multiplication function, a summation function, an average function or a concatenating function. Finally, outcome of the both LSTM units is represented as a onedimensional vector,Y = [y 1 , y 2 , . . . , y t ], where the last element, y t , indicates the best-predicted value for the next time iteration.

4) WORKFLOW MODEL OF RANDOM FOREST
This subsection presents a general workflow of the conventional ML model, such as Random Forest (RF). The RF algorithm is a well-known supervised machine learning algorithm. RF is based on the ensemble learning concept, which is helpful for both regression and classification problems. It combines multiple classifiers to tackle a complex problem and improve model performance. As the name implies, RF is a classifier that uses several decision trees on different subsets of a dataset and gets the average to enhance classification accuracy. Rather than relying on a single decision tree, the random forest aggregates predictions from all trees and forecasts the ultimate output based on the majority vote of predictions. Higher the number of trees, the better the accuracy and the lesser the risk of over-fitting. This study using the classification to differentiate between real and fraudulent clicks. Thus, we utilize the primary binary RF classifier. Fig. 5 illustrates the architecture of a RF classifier.

C. PROPOSED CNN-BiLSTM-RF ARCHITECTURE
A broad overview of the developed approach is illustrated in Fig. 6. The proposed technique has three major components. The 11 × 1 Click fraud data is loaded into deep learning networks on the input layer. It comprises a 1-D CNN layer followed by a Maxpooling layer, allows for the sample-based discretization of parameters to recognize the relevant features resulting in reduced training time and prevention from over-fitting. After the Maxpooling layer comes the Batch Normalization layer, which enables the normalization of parameters between intermediate layers and prevents slower training times. The 1-D CNN layer contains 64 filters, kernel size two and Relu is used as an activation function. The Maxpooling layer is with pooling length 2. This feature map is fed into the BiLSTM layer. BiLSTM contains 128 memory blocks that learn the time domain features. The BiLSTM layer follows a Maxpooling layer with pooling length 2 and the Batch Normalization layer. Next, a Flatten to reshape the input for upcoming Dense layers. There are two dense layers  added with filters 128 and 64. Both dense layers have used Relu as an activation function. The dropout layer with a value of 0.5 is used between both dense layers. The Dropout Layer is put in place to account for Over Fitting even though the model uses Max Pooling in between every layer. Generally, this is because CNN and BiLSTM used in combination have a higher probability of over-fitting and perform poorly on the testing set. Finally, the features are fed into RF for the real and fraudulent clicks classification.
Furthermore, Algorithm 1 is presented to provide a stepby-step process of the proposed CNN+BiLSTM-RF. The proposed algorithm shows several steps to present a detailed flow of the model. It takes raw clicks data as an input and predict click type as a real or fake as an output. First of all, data is pre-processed in order to perform transmission of categorical attributes into number format, computational of initial input features using timestamp feature. Next, newly constructed features are added to the existing features set. Once features are combined, correlation index is calculated to reduce features set by eliminating low correlated features. Then, SMOTE is applied to balance the distribution of the data samples as per class label. Next, min-max normalization is used to scaled down features values in uniform range to consider all of the features equally in the learning phase of the learners. Once data is cleaned, in the next step, data is divided into K (K = 10) subsets as [S 1 , S 2 , S 3 , . . . , S K ]. Furthermore, an hybrid model is trained using f (X , y) Where X , y ∈ [S 1 , . . . S K ]. During in each training epoch, training and validation loss and accuracy is computed and weights are updated for the next epoch using Adam optimizer to minimize training and validation error. In addition, for each K set, training and validation accuracy is reported. Once, hybrid CNN-BiLSTM model is trained, input samples are passed to the trained hybrid model to extract hidden features, which further used to train RF model using the f (E F , y train ). Once RF is trained using extracted features set, unseen samples are also passed through the hybrid DL model to extract features for unseen samples using f (Ē F , y test ) to obtain y pred . Once prediction results are obtained, different evaluation measures are employed to evaluate the performance in terms of Accuracy, Precision, Recall, F1 Score and AUC.

IV. DATASET PRESENTATION AND ANALYSIS
This section presents data preparation and analysis to clean the raw data to highlight the hidden patterns of clicks. It incorporates into data description with relevant data source, preprocessing of data in order to get reliable data,and exploratory analysis of pre-processed data.

A. DATASET DESCRIPTION AND PRE-PROCESSING
In this research study, we used the TalkingData dataset acquired from Kaggle [62], explained in detail as follows. The TalkingData is an ad-tracking fraud dataset containing 200 million clicks with eight features over four days. The main features of the dataset are as follows: click's IP address, application identifier for marketing purposes, device type, installed operating system for the device, publisher channel, click time (timestamp (UTC)), app downloads time and is_attributed (target attribute). Table 2 summarizes the acquired data.
Next, pre-processing module is developed to clean and convert the raw data into reliable format in order to make  it readable for machines. In pre-processing, time attribute was removed during the pre-processing data stage. Click VOLUME 10, 2022

Algorithm 1 Proposed CNN+BiLSTM-RF Algorithm
Input: Input clicks X Data, y is the target variable, F represents features and E F represents extracted features. Output: Detection of real and fake clicks in talking dataset. time attribute was divided into four sub-columns: day, hour, minute, and second. Next, label encoding is employed to convert the labels into a numerical form so that they can be converted into machine-readable form. Label encoding converts data into a form the computer can understand but assigns a unique number to each category of data. If the datasets are not well-organized, this could lead to problems with their training. A label with a high value may be given higher priority than a label with a lower value [63]. In this way, label encoding is used to convert the categorical values of the following attributes, such as Device, OS, and Channel into the numeric form.
For the experiments, due to limited computational resources, the entire dataset was not considered. Therefore, only 1 million data samples are considered, and the class ratio matches the ratio for 200 million samples. Dealing with unbalance datasets, the ML and DL algorithms more biased with dominant class [64]. Therefore, to address this problem, we used the Synthetic Minority Oversampling Technique (SMOTE) oversampling technique to balance the unbalanced dataset. SMOTE is a popular oversampling technique that turned into proposed to enhance random oversampling however its conduct on high-dimensional records has now no longer been very well investigated [65].

B. EXPLORATORY ANALYSIS
In this subsection, a visual way is carried out to analyze the clicks data through box plots and heat-map. the primary reason for the box plot in our article is to locate the common number of the dataset to examine how the data is dispersed between each sample we compare the respective median of each Box. Furthermore, box-plot analysis is widely used to measure five value summary, such as minimum, lower quartile of the median, the median, the upper quartile of the median, and maximum Clicks.and we compare the respective median of each box plot. we analyze to investigate PPC according to time interval groups in terms of Hourly Clicks (HC), Daily Clicks (DC) and Weekly Clicks (WC). It can be seen that the relationship between HC, DC, WC, and PPC varies because of the different structures of Clicks. Heat-maps are utilized in diverse sorts of analytic however are maximum usually used to reveal Visitor Clicks on particular web-pages or website templates. and its indicates display wherein Visitor have clicked on a page, how ways they've scrolled down a page or used to show the consequences of eye-monitoring tests. Click analytics are useful for web activity analysis, marketing, software testing, market research, and users productivity analysis.
Therefore, clicks data are analyzed based on the following temporal granularity, such as hourly, daily, and weekly clicks analysis. First, hourly clicks are visualized to analyze total hourly traffic on PPC websites, we create clicks frequency that displays on y-axis and hourly clicks on x-axis. In this example, we're looking at hourly trends for instance 350k visitors click on PPC ad per hour between 23 days. Fig. 7 shows hourly clicks analysis.   Daily clicks analysis is a PPC insights aggregation of a user's tracked conduct throughout a website. in this graph we recognize daily clicks on a ad between 6 to 9 DC advanced analytic. Similarly, Fig. 8 shows daily clicks analysis.
Next, switch the graph to ''Weekly Clicks Analysis'' to look at data on an even more granular level.In weekly click analysis, we collect data from 3.5 million users from week one to week three. In comparison to the second week, fewer users clicked on the ad in the first week. We examined the clicking behavior of users on Pay Per Click Ads, which became more noticeable in the third week. Fig. 9 illustrates weekly click analysis. It shows a weekly distribution of clicks data based on five value summary. The box plot analysis indicates that the week 2 and 3 achieved maximum number of clicks data.
Moreover, Fig. 10 shows correlation analysis to investigate the linear relationship between temporal and output features. Correlation analysis is a statistical technique used throughout analysis to ascertain the strength of a linear relationship between two variables and compute their association. temporal correlation analysis is a table that reveals the correlation coefficient between variables. Either every cell in the table represents the relationship between the two variables. A correlation matrix can also be used to summarize data, as an insight towards a more detailed analysis, or as a diagnosis and monitoring for advanced analyses. Our goal is to encapsulate a substantial number of data to determine patterns.
In our preceding example, the perceptible pattern is that all of the variables are highly correlated with one another. The pairwise correlation analysis is considered to investigate the linear relationship between pairs of features. The pairwise correlation range varies between −1 to +1, where negative correlation indicates that the linear relationship between pairs of features is weak and positive correlation indicates that the linear relationship between features is strong.

V. EXPERIMENTAL ENVIRONMENT, AND RESULTS ANALYSIS A. IMPLEMENTATION ENVIRONMENT
The selected one million samples dataset are divided into 80% training data and 20% used for testing. The results are gained through 5-fold cross-validation techniques. For normalization, we used the Min-Mix scaler, while for balancing the dataset used SMOTE oversampling. First, we train the CNN-BiLSTM model on 100 epochs using the training data. The SGD optimizer is used for optimization, learning rate 0.001, batch size 33, and loss categorical crossentropy. After training and validation, the CNN-BiLSTM model replaces the output layer, consisting of 2 filters, activation sigmoid and kernel regularizer l2 (le-4), with the RF classifier. The features are extracted from CNN-BiLSTM deep learning networks and feed into RF classifier for classification. The CNN extracts the deep feature, and BiLSTM can grip in a data sequence long-term dependency. Finally, we run the experiment for RF with the number of estimators 200 and random state 42. Table 3 presents a detailed summary of the implementation environment for our proposed architecture.

B. RESULTS ANALYSIS
This subsection investigates the performance of the proposed ensemble model. First, loss and accuracy of the DL models are evaluated using training and validation datasets.  Second, performance of each implemented model is evaluated and compared using the evaluation indicators such as confusion matrix, accuracy, precision, recall, F1-score, and AUC.

1) PERFORMANCE ANALYSIS
The confusion matrices are shown in Fig. 11 to compare the performance of the proposed deep CNN-BiLSTM-RF and other implemented DL models, such as BiLSTM, CNN and CNN-BiLSTM architectures for click fraud detection. The confusion matrix is utilized to analyze the CNN-BiLSTM-RF model performance. We evaluated the model performance on 39910 testing samples (real: 19959 clicks and fraudulent: 19951 clicks). The dark green diagonal of the matrix represents the accurate classifications, whereas all other entries are mis-classifications. As illustrated in Fig. 11a, when BiLSTM is individually applied on testing data, 189 real clicks are inaccurately classified as fraudulent (false negative) and 1,778 fraudulent clicks are inaccurately classified as real (false positive). Similarly, mis-classification rate of CNN model for real and fraudulent clicks are 125 and 1417 as visualized in Fig. 11b, which indicates that CNN model performed well compared to the BiLSTM. However, when CNN-BiLSTM-RF is applied to the testing data, only 12 real clicks are mis-classified as fraudulent, whereas 154 fraudulent clicks are mis-classified as real as shown in Fig. 11d. Thus, the CNN-BiLSTM RF model performs significantly better than the CNN-BiLSTM and standalone BiLSTM and CNN models.
Furthermore, different evaluation measures are employed, such as accuracy, precision, sensitivity, specificity, F1-score and AUC to test and evaluate the results of the proposed model [66], [67]. Accuracy evaluates a predictor's ability to identify all instances correctly, either positive or negative as shown in equation 7: Sensitivity is the frequency of accurately predicted positive samples among all true positive samples as follows in equation 8: Thus, it assesses the capacity of a predictor to identify positive samples. Similarly, specificity evaluates the capacity of a classifier to identify negative instances. Equation 9 shows a basic formula to estimate precision for a binary classification problem.

Specificity = Precision (PR) =
T p T p + F p (9) Furthermore, Harmonic mean of precision and recall is called F1-score. The formulas of measures are given below in equation 10: Based on evaluation analysis, Table 4 analyzes the click fraud detection performance of proposed CNN-BiLSTM and CNN-BiLSTM-RF models. Furthermore, Table 4 shows that when the CNN-BiLSTM is applied to test data, the classification accuracy is only 98.09 ± 0.13% (with 96.56 ± 0.17% sensitivity and 99.74 ± 0.04% specificity) for real and fake clicks. As explained earlier, we used CNN-BiLSTM for feature extraction and fed the extracted feature into the RF algorithm for classification. The experimental results of the CNN-BiLSTM in combination with RF yield 99.19% accuracy (with a sensitivity of 98.50 ± 0.11% and a specificity of 99.89 ± 0.03%), which is the best performing ratio. An improvement in recall and precision is also noted. The CNN-BiLSTM achieved (99.74 ± 0.04% precision and 98.12 ± 0.13 f1-score) and the CNN-BiLSTM-RF architecture obtained (99.89 ± 0.03% precision and 99.19 ± 0.08% f1-score). It shows that the CNN-BiLSTM model extracts meaningful features that help to enhance CNN-BiLSTM-RF performance. Table 4 shows the overall accuracy, precision, recall, f1-score and AUC of the proposed and other implemented models.

2) LOSS/ACCURACY ANALYSIS OF DL MODELS
Loss and accuracy of the implemented DL models are analyzed using the training and validation data samples sets. Loss indicates the error rate and defined as the summation of error to measures that how our proposed model is doing job well or bad. In this research study, categorical cross entropy is used as a loss function to estimate the loss for the given binary problem. It is used as loss function to get class labels in a one hot encoding format, such as 0's and 1's. Fig. 12 shows a loss analysis of the implemented individual and ensemble DL models. The training and validation loss of the proposed deep CNN-BiLSTM model is compared with BiLSTM, CNN and CNN-BiLSTM (with 1 layer for each). It can be analyzed that the training and validation loss of the proposed deep CNN-BiLSTM-RF model significantly decreases as the number of training epochs increases compared to the standalone models. The training and validation loss of the proposed deep CNN-BiLSTM model is varied between 0.01 and 0.35, which indicates that our proposed model is doing great job for detecting fraudulent clicks.  Similarly, accuracy is used as a performance metric to measure the performance of the model by comparing predicting and ground truth class labels. Furthermore, 50 epochs are used to calculate training and validating accuracy of each implemented DL model. Fig. 13 shows a comparison of categorical cross entropy based estimated accuracy of each individual and proposed ensemble models. The comparison shows a comparative analysis to analyze and compare the training and validation accuracy of the proposed deep CNN-BiLSTM-RF architecture with other implemented DL architectures. It is found that the training and validation accuracy of the proposed architecture increases as the number of training epoch increases. It can be seen from the comparative analysis that the accuracy of our proposed deep CNN-BiLSTM-RF model reached 99% for both training and validation sets as the epochs reached up to 50.

3) ROC CURVE ANALYSIS
The Receiver Operator Characteristic (ROC) curve is a binary classification problem performance measure. ROC curve provides a visual way to understand the trade-off between  sensitivity (true positive rate) and specificity (false positive rate). It uses different probability thresholding values for error detection trade-off. It is effective and mostly used for balanced class distribution to investigate the rate of true and false positives. The higher value of y-axis shows that the performance of the proposed model is reliable and usually found perfect skill at a point (0,1). Fig. 14 illustrates the proposed model's performance. The given ROC curve analysis shows that the AUC of the proposed ensemble architecture is close to 1, which indicates that the true positives rate is higher than false negatives compared to the other models. The standalone models, such as BiLSTM and CNN achieved 96.27% and 95.07% AUC score, which indicates that CNN performance is slightly low compared to the BiLSTM. Similarly, CNN-BiLSTM achieved a 98.42% AUC score, while the CNN-BiLSTM-RF model obtained a 99.58% score. The ROC curve analysis indicates that our proposed deep CNN-BiLSTM-RF model outperformed the standalone and ensemble DL models.

VI. DISCUSSION
The experimental findings and analyses reveal that our proposed deep CNN-BiLSTM-RF performed well as compared to the CNN-BiLSTM architecture. The proposed deep CNN-BiLSTM-RF increased the accuracy of 3.31% and 3.45% compared to the standalone BiLSTM and CNN models as shown in Fig. 15. Similarly, it is also achieved a better accuracy compared to the ensemble model, such as  CNN-BiLSTM. The detection rate of the proposed model for fraudulent clicks detection is also improved by 3.1% and 3.3% compared to the BiLSTM and CNN models. In addition, our proposed model an improved f1-score by 3.3% and 3.5% compared to the BiLSTM and CNN models. Hence, our proposed deep CNN-BiLSTM-RF model achieved better performance compared to the BiLSTM, CNN and CNN-BiLSTM models. Furthermore, Fig. 16 comparison of accuracy and precision (detection rate) for fraudulent clicks detection. The comparison indicates that our proposed model achieved high detection rate or fraudulent clicks detection of 99.60% compared to other listed models.
Moreover, Table 5 compared our proposed CNN-BiLSTM-RF model with some recent approaches applied for click fraud detection on the basis of accuracy, Recall, precision,  F1-Score, specificity, and AUC. The research studies including [5], [6], [9], [32] and [68] achieved moderated accuracies between 91 and 94%. The studies, such as [49] and [43] achieved the best accuracies between 95 to 98%. Compared to our proposed model in terms of accuracy, precision, sensitivity, specificity and AUC, our proposed CNN-BiLSTM-RF achieved better results than themes.

VII. CONCLUSION AND FUTURE DIRECTION
Companies are changing their focus to advertising their items and services on mobile applications and websites as the online advertising market continuously grow. As a result, the problem of click fraud has become highly prevalent in recent years. Click fraud is the malicious or illegal clicking on adverts that results in the advertiser's revenue being squandered. To address this problem, numerous approaches have been proposed to detect click fraud. By categorizing clicks into invalid and valid, click fraud detection method can employ to shield the advertisers. We proposed a hybrid of CNN and BiLSTM with a combination of the RF classifier for click fraud detection. The combined CNN-BiLSTM-RF model gives the best results over the click fraud data. It gets asses from the CNN's capability of features extraction as well as the BiLSTM ability to acquire long-term bidirectional dependencies. Besides, RF is an ensemble machine learning model that is more suitable for classification than the traditional classifier associated with deep learning networks. The proposed models were trained on the TalkingData click fraud dataset one million samples. We compared Two deep learning models, CNN-BiLSTM and CNN-BiLSTM-RF, across different configurations and concluded through experimental results that the CNN-BiLSTM-RF model performs well with an accuracy of 99.19%. Although, this proposed architecture can be utilized as a general model to combat the click fraud in pay-per-click advertising to protect advertisers from fraudsters who generate clicks on their advertisements illegally. In future work, the proposed model will train on other click fraud datasets to evaluate its performance. we will also develop a tool to detect click fraud in real work internet and mobile advertising environments.
AMREEN BATOOL received the bachelor's degree from the GC University of Pakistan, the M.C.S. degree from the Virtual University of Pakistan, and the M.S. degree in computer science and technology from Tiangong University, Tianjin, China, in 2021. She is currently pursuing the Ph.D. degree with the Department of Electronic Engineering, Jeju National University, Republic of Korea. She is serving as a Project Coordinator at EUT Global Ltd. Her main role is to coordinate with clients and field engineers to plan project delivery. Her research interests include machine learning, deep learning, and blockchain technology. His research interests include AI machine learning, pattern recognition, blockchain and deep learning-based applications, big data and knowledge discovery, time series data analysis and prediction, image processing and medical applications, and recommendation systems.