Bi-GRU-APSO: Bi-Directional Gated Recurrent Unit with Adaptive Particle Swarm Optimization Algorithm for Sales Forecasting in Multi-Channel Retail

: In the present scenario, retail sales forecasting has a great significance in E-commerce companies. The precise retail sales forecasting enhances the business decision making, storage management, and product sales. Inaccurate retail sales forecasting can decrease customer satisfaction, inventory shortages, product backlog, and unsatisfied customer demands. In order to obtain a better retail sales forecasting, deep learning models are preferred. In this manuscript, an effective Bi-GRU is proposed for accurate sales forecasting related to E-commerce companies. Initially, retail sales data are acquired from two benchmark online datasets: Rossmann dataset and Walmart dataset. From the acquired datasets, the unreliable samples are eliminated by interpolating missing data, outlier’s removal, normalization, and de-normalization. Then, feature engineering is carried out by implementing the Adaptive Particle Swarm Optimization (APSO) algorithm, Recursive Feature Elimination (RFE) technique, and Minimum Redundancy Maximum Relevance (MRMR) technique. Followed by that, the optimized active features from feature engineering are given to the Bi-Directional Gated Recurrent Unit (Bi-GRU) model for precise retail sales forecasting. From the result analysis, it is seen that the proposed Bi-GRU model achieves higher results in terms of an R2 value of 0.98 and 0.99, a Mean Absolute Error (MAE) of 0.05 and 0.07, and a Mean Square Error (MSE) of 0.04 and 0.03 on the Rossmann and Walmart datasets. The proposed method supports the retail sales forecasting by achieving superior results over the conventional models.


Introduction
Retail sales forecasting is one of the important factors in E-commerce companies due to the economic hardship and strong competition [1].A good sales forecasting makes an efficient production plan, increases the revenue from sales, and improves the satisfaction of customers.The timely and effective forecasting model is an indispensable and crucial tool for handling the inventory level [2,3], whereas the inappropriate forecasting model results in insufficient or redundant stock that directly affects the competitive advantage and income [4].In the recent decades, the researchers have introduced many analytical and statistical regression models for resolving the concerns of retail sales forecasting [5,6].The existing models are generally categorized into two types of artificial intelligence models and classical regression models.Some of the classical regression models are exponential smoothing, weighted average, linear regression, moving average, Support Vector Regression (SVR), eXtreme Gradient Boosting (XGBoost), etc. [7,8].On the other hand, the common statistical models employed in retail sales forecasting are Seasonal Auto-Regressive Moving Average (SARIMA) and ARIMA [9,10].
Currently, artificial intelligence models are highly adopted for retail sales forecasting, such as neural networks, fuzzy systems, expert systems, and hybrid models [11,12].In artificial intelligence, neural networks are extensively utilized in sales forecasting because of their higher ability and flexible nature in relation to data mining.Additionally, the hybrid models integrate the benefits of dissimilar approaches for improving the performance of forecasting [13,14].Currently, most of the existing studies integrate artificial intelligence and classical models to achieve better performance in retail sales forecasting [15].However, the E-commerce companies face unrivalled issues to ensure precise forecast, which can be noted as (i) complex and dynamic online sale compared to offline sale, (ii) marketing behavior greatly influencing the customers in E-business platforms, and (iii) need to collect an enormous amount of data to enhance the performance of retail sales forecasting.In order to achieve better retail sales forecasting and to reduce the time complexity, an effective automated regression model is implemented in this manuscript.
The contributions are listed below: • Performed outlier's removal, interpolation of missing data, normalization, and denormalization in the acquired datasets.The forecasting accuracy is significantly improved by excluding outliers and interpolating missing data.The outlier's removal aims to eliminate rare or unexpected instances in datasets, and a more typical value is interpolated with the missing data to avoid problems like bias and loss of power.Additionally, data integrity is improved by performing normalization and de-normalization using the Z-score normalization technique.

•
Integrated APSO algorithm, RFE, and MRMR for optimizing features in the acquired Rossmann and Walmart datasets.The feature engineering process minimizes the number of features in the pre-processed datasets and, in turn, decreases the computational complexity and processing time, alongside improving the performance of the regression model.The selection of active features decreases the complexity to linear, and the regression model consumes a minimal processing time of 20.11 s, while consuming 30.12 s in the Rossmann and Walmart datasets.• The selected active features are passed to the Bi-GRU model for precise retail sales forecasting.When related to other regression models, the Bi-GRU model consumes limited memory and is faster in data processing.The effectiveness of the proposed regression model is analyzed based on six evaluation measures: Coefficient of determination (R2), MSE, Normalized Deviation (ND), MAE, Root Mean Square Scale Error (RMSSE), and Normalized Root Mean Square Error (NRMSE).
The related papers on the topic "sales forecasting" are reviewed in Section 2. The details about methodology, simulation analysis, and the conclusion are discussed in Sections 3-5, respectively.In this manner, the present research manuscript is organized.

He et al.
[16] integrated the PSO algorithm with a Long Short-Term Memory (LSTM) network for precise sales forecasting related to the e-commerce companies.Here, the PSO algorithm was implemented for optimizing the number of iterations and the hidden neurons in the LSTM layers.The experiments conducted on the real and online benchmark datasets demonstrated the effectiveness of the presented forecasting model over the nine comparative models.The LSTM network was effective in sales forecasting but required more training samples to learn efficiently.Ji et al. [17] and Massaro et al. [18] employed the XGBoost 1.0.0 model for effective sales forecasting by using commodities sales features and data series.Here, sales forecasting was performed for optimizing the inventory management by considering the relevant factors, like transportation time, delivery time, order lead-time, and inventory cost.Generally, the XGBoost model included regularization methods for preventing overfitting, which led to erroneous prediction of new sales data.In addition, the XGBoost model was memory intensive, particularly for large real-time datasets.
Wong and Guo [19] adopted an effective learning algorithm for accurate sales forecasting and incorporated Extreme Learning Machine (ELM) and Improved Harmony Search Algorithm (IHSA).The incorporation of ELM and IHSA enhanced the ability of network generalization.The experiments performed on the benchmark and real-time fashion retail datasets revealed that the presented model achieved a superior performance compared to the traditional neural networks.On the other hand, the incorporation of ELM and IHSA with the learning algorithm increased the time complexity of the system.
Loureiro et al. [20] explored the usage of deep learning model (DLM) in the fashion retail sales forecasting by considering a diverse and wide set of data.Here, the DLM performance was compared with other shallow models, and the numerical analysis stated that the DLM obtained results were superior to the shallow models in terms of different evaluation measures.Nonetheless, the DLM was computationally costly and needed a higher-end system to deal with a diverse and wide set of data.
Lu [21] combined the SVR model with the variable selection approach for precise sales forecasting.This study forecasted the weekly sales of computer products like display cards, hard disks, main boards, notebooks, and liquid crystal displays.In addition, Luo et al. [22] integrated the LSTM network with an Extreme Deep Factorization Machine (XDeepFM) model for retail sales forecasting by effectively exploring the correlation between the sales factors.The LSTM network was employed for correcting the residuals to enhance the accuracy of the forecasting model.The numerical evaluation proved that the presented forecasting model was superior compared to other traditional forecasting models.As discussed in the earlier literature, the LSTM network was precise in sales forecasting but needed enormous training samples to learn efficiently.
Zhang et al. [23] integrated a Moth-Flame Optimization (MFO) algorithm with the ELM to predict the transaction volume of e-commerce.The use of a traditional MFO algorithm significantly improved the convergence accuracy but demanded a high execution time.Weng et al. [24] designed a new framework for precisely forecasting the sales of the supply chain, along with the designed framework that incorporated the LSTM network with a light-Gradient Boosting Machine (GBM) model.The designed framework's performance was validated on three supply-chain sales datasets, and the obtained results demonstrated the efficacy of the designed framework over other traditional models by means of accuracy but needed high time complexity.
Shilong [25] presented an accurate and efficient sales forecasting framework based on the machine-learning models.At first, the vectors were extracted from the historical sales dataset by conducting feature engineering, and then the XGBoost model was implemented for forecasting the future sales.The experiments performed on the online Walmart retail goods dataset proved that the presented framework obtained a more precise sales prediction than the existing models with limited memory resources and computing time.Nevertheless, the XGBoost model faced two major issues in retail forecasting, namely overfitting and outliers.
Punia et al. [26] integrated a random forest with the LSTM network for accurate retaildemand forecasting.The presented framework superiorly analyzed complex relationships between the regression and temporal types that resulted in better prediction than the existing forecasting models.A multivariate real-time dataset was employed for analyzing the performance of the presented framework in light of different measures, such as bias, variance, and prediction accuracy.In the multivariate real-time dataset, the random forest resulted in overfitting with noisy regression tasks, and, additionally, the random forest faced difficulty in handling the categorical variables.
Karmy and Maldonado [27] suggested the SVR model for retail sales forecasting on the hierarchical time-series data.Three models were formulated (middle-out, top-down, and bottom-up SVR) for forecasting retail sales related to the travel industry.The SVR model generally underperformed when the number of training samples was lower than the number of features in every data point.
Kilimci et al. [28] integrated eleven forecasting models that comprised DLMs, SVR model, and time-series approaches for effective retail-demand forecasting.In addition, a novel decision strategy was implemented from the inspiration of the ensemble model based on the concept of boosting.The presented system's performance was tested on a real-time dataset and evaluated by means of different error rates.The integration of DLMs, the SVR model, and time-series approaches increased the system complexity, which was one of the main issues in this literature.
Kohli et al. [29] introduced a new sales prediction framework based on K-Nearest Neighbor (KNN) and linear regression models to formulate commendable decisions and find potential risks in the businesses.The efficacy of the presented framework was tested on an online Rossmann dataset by means of different error rates.As mentioned in the previous literature, the integration of two machine-learning models generally increased the system complexity and processing time.
From the overall analysis of the above-stated literature, we noted that the state-of-theart models ELM, XDeepFM, light GBM, XGBoost, LSTM DLM, SVR, and KNN were applied for precise sales forecasting related to the e-commerce companies.The integration of the above-stated learning models normally maximizes the system complexity and processing time.Furthermore, aggregated forecasting assists strategic decision making on the position which frequently relates to the operational decisions at the store level.These aspects affect the demand and promotional data and increase the substantial complexity, and, subsequently, the forecasters possibly face dimensionality issue of too many constraints and not enough data.To address the aforementioned problems and to achieve better retail sales forecasting, an effective Bi-GRU model is proposed in this manuscript which is clearly described in the following subsections.

Methods
Future sales forecasting is the important factor in retail business for identifying the potential risks and making appropriate decisions.The precise demand forecasting is vital in organizing and planning the labor force, transportation, purchasing, and production of successful and profitable retail business [30].The proposed automated regression model for retail sales forecasting includes three important steps, which are listed below and graphically specified in Figure 1: • Retail data pre-processing: Interpolating missing data, outliers' removal, normalization, and de-normalization.

Retail Data Pre-Processing
After acquiring the retail sales data from the Rossmann and Walmart datasets, data pre-processing was carried out to eliminate unreliable samples for better forecasting.In this scenario, the data pre-processing step included the processes of interpolating missing data, outliers' removal, normalization, and de-normalization [31,32].
In the initial process, identifying outliers is important in statistics and data analysis because it has a significant impact on the results of statistical analyses.Removing the outliers involves excluding data points significantly deviating from the norm to enhance the model's accuracy and generalization on new data.The outliers are the instances which are highly deviated from other instances in the Rossmann and Walmart datasets.In this study, the instances,  , , are considered as the outliers in the time-series data,  .When the following condition is satisfied,   ,      , the instances  , are eliminated from the time-series data,  .Here, the absolute value function is  ., mean function is represented as  .,  3, and the standard deviation function is denoted as  . .After the elimination of outliers, the missing data are interpolated because they affect the validity and generalizability of the data, which depend on the quantity, category, and pattern.The missing data are supplied by computing the mean value of two nearest neighbor data.Following that, the data normalization is performed by implementing the z-score normalization technique, where the normalization process speeds up the training time of the Bi-GRU model.The key concept of normalization is to remove potential biases and distortions that occur because of the dissimilar feature scales.The z-score normalization

Retail Data Pre-Processing
After acquiring the retail sales data from the Rossmann and Walmart datasets, data pre-processing was carried out to eliminate unreliable samples for better forecasting.In this scenario, the data pre-processing step included the processes of interpolating missing data, outliers' removal, normalization, and de-normalization [31,32].
In the initial process, identifying outliers is important in statistics and data analysis because it has a significant impact on the results of statistical analyses.Removing the outliers involves excluding data points significantly deviating from the norm to enhance the model's accuracy and generalization on new data.The outliers are the instances which are highly deviated from other instances in the Rossmann and Walmart datasets.In this study, the instances, s i,j , are considered as the outliers in the time-series data, S j .When the following condition is satisfied, abs s i,j − mean S j > n × std S j , the instances s i,j are eliminated from the time-series data, S j .Here, the absolute value function is abs(.),mean function is represented as mean(.),n = 3, and the standard deviation function is denoted as std(.).After the elimination of outliers, the missing data are interpolated because they affect the validity and generalizability of the data, which depend on the quantity, category, and pattern.The missing data are supplied by computing the mean value of two nearest neighbor data.
Following that, the data normalization is performed by implementing the z-score normalization technique, where the normalization process speeds up the training time of the Bi-GRU model.The key concept of normalization is to remove potential biases and distortions that occur because of the dissimilar feature scales.The z-score normalization technique normalizes the input and the output variables based on Equation (1), and the de-normalization is mathematically formulated in Equation (2).

Feature Engineering
After pre-processing the Rossmann and Walmart datasets, the feature selection is initially accomplished by employing MRMR and RFE techniques.Generally, the feature selection process decreases the feature sets by choosing the most active features or eliminating the drossy features.MRMR is chosen, as it identifies the relevant features and minimizes the redundancy, thus enhancing the classification accuracy.Therefore, the MRMR overcomes the other techniques based on the accuracy and number of supportive features.The MRMR is a feature measurement criterion; it computes the redundancy and correlation between the features based on mutual information.Here, the MRMR technique performs feature selection by following two conditions, such as maximal relevance (max R ) and minimal redundancy (min D ), which are mathematically specified in Equations ( 3) and ( 4).Moreover, m(x, y) denotes the mutual information, and f denotes a set of features.The mathematical expression of mutual information m(x, y) is represented in Equation (5).
m(x, y) = q(x, y)log q(x, y) q(x)q(y) dxdy On the other hand, the RFE is one of the effective feature selection techniques which fits the model and eliminates irrelevant features, until the discriminative active features are selected [33].RFE is selected here because it efficiently minimizes the dimensionality of high-dimensional datasets by removing redundant features to help enhance the computational and storage necessities.The RFE technique includes three major benefits: (i) complete elimination of irrelevant information in the data, (ii) ease in performing data visualization, and (iii) required limited computational power.By considering Equations ( 3) and ( 4), Equation ( 6) is generated.The RFE technique efficiently selects the optimal feature subsets using Equation (6).
In addition to this, feature optimization is carried out utilizing the APSO algorithm, where it selects active features from the pre-processed Rossmann and Walmart datasets.The APSO algorithm is applied here, as it selects the algorithmic constraints at run time to enhance the exploitation efficiency, and it performs a global search over the entire search space with a higher convergence speed.As discussed in the previous sections, this process decreases the complexity of the Bi-GRU model and its processing time.The conventional PSO algorithm is one of the effective metaheuristic-based optimization algorithms [34] that generally mimics the behavior of birds and fish schooling [35].The velocity and the position of the particles are updated in the PSO algorithm by Equations ( 7) and (8).
where the global best positions of the particles are represented as p gd ; the present best positions of the particles are denoted as p id ; the random numbers are indicated as r 1 and r 2 ; the acceleration coefficients are denoted as c 1 and c 2 ; the inertia weight is represented as I w , used for balancing the local and the global searches; and the iteration number is indicated as n.
The APSO algorithm optimizes the features based on Adaptive Uniform Mutation (AUM) function from the Human Group Optimization (HGO), where the particle's positions (features) are denoted as p i (n) = (p i,1 , p i,2 .pi,D ).The AUM function extends the ability of feature optimization in the exploration phase.Additionally, a nonlinear function, p m , is employed for controlling the mutation range and decision in each particle.The nonlinear function, p m , is updated in each iteration by performing Equation (9).
If the iteration increases, the nonlinear function, p m , tends to decrease, while the maximum number of iterations is represented as M I .The mutation randomly selects the active features from the datasets when the nonlinear function, p m , is higher than the random number, which usually ranges between zero and one.The selected active features from the Rossmann and Walmart datasets are finally passed to the Bi-GRU model for retail sales forecasting.The APSO algorithm terminates when it reaches the maximum number of iterations (100).
The parameters considered in the APSO algorithm are as follows: the cognitive constant, c 2 , is two; the social constant, c 1 , is three; the size of the population is 100; and the number of iterations is 100.The features selected in the Walmart dataset are the date, Consumer Price Index (CPI), fuel prices, store, weekly_sales, and holiday_flag.Correspondingly, the features selected in the Rossmann dataset are the day of the week, open promo, customers, sales, and store number.The architecture of the APSO algorithm is mentioned in Figure 2.
Telecom 2024, 5, FOR PEER REVIEW 7 where the global best positions of the particles are represented as  ; the present best positions of the particles are denoted as  ; the random numbers are indicated as  and  ; the acceleration coefficients are denoted as  and  ; the inertia weight is represented as  , used for balancing the local and the global searches; and the iteration number is indicated as .
The APSO algorithm optimizes the features based on Adaptive Uniform Mutation (AUM) function from the Human Group Optimization (HGO), where the particle's positions (features) are denoted as    , ,  , . , .The AUM function extends the ability of feature optimization in the exploration phase.Additionally, a nonlinear function,  , is employed for controlling the mutation range and decision in each particle.The nonlinear function,  , is updated in each iteration by performing Equation (9).
If the iteration increases, the nonlinear function,  , tends to decrease, while the maximum number of iterations is represented as  .The mutation randomly selects the active features from the datasets when the nonlinear function,  , is higher than the random number, which usually ranges between zero and one.The selected active features from the Rossmann and Walmart datasets are finally passed to the Bi-GRU model for retail sales forecasting.The APSO algorithm terminates when it reaches the maximum number of iterations (100).
The parameters considered in the APSO algorithm are as follows: the cognitive constant,  , is two; the social constant,  , is three; the size of the population is 100; and the number of iterations is 100.The features selected in the Walmart dataset are the date, Consumer Price Index (CPI), fuel prices, store, weekly_sales, and holiday_flag.Correspondingly, the features selected in the Rossmann dataset are the day of the week, open promo, customers, sales, and store number.The architecture of the APSO algorithm is mentioned in Figure 2. The step-by-step procedure of APSO algorithm is specified as follows.
Step 1: The swarm particles, size, location, objective, number of iterations, and save non-dominated solution are set into the archive.
Step 2: To update the  , pareto domination connection is applied.
Step 3: Due to the multiplicity of solution,  is chosen from the archives.In the beginning, crowding distance is estimated, and, formerly, binary tournament is utilized for choosing  .
Step 4: At that time, the decision value is reset, which depends on  .Each value of feature vector is considered as a binary value.The step-by-step procedure of APSO algorithm is specified as follows.
Step 1: The swarm particles, size, location, objective, number of iterations, and save non-dominated solution are set into the archive.
Step 2: To update the p id , pareto domination connection is applied.
Step 3: Due to the multiplicity of solution, p gd is chosen from the archives.In the beginning, crowding distance is estimated, and, formerly, binary tournament is utilized for choosing p gd .
Step 4: At that time, the decision value is reset, which depends on p gd .Each value of feature vector is considered as a binary value.
Step 5: Depending on step 5, the particle's position and velocity is updated.
Step 6: Uniform mutation is accomplished.
Step 7: Then, the external archive is updated by means of crowding distance.
Step 8: Termination process: If the proposed method achieves the maximum iteration, the process is stopped; otherwise, step 2 is repeated.Therefore, the worst particles are removed by HGO.After choosing the optimal features from the Rossmann dataset (day of week, open promo, customers, sales, and store number) and Walmart dataset (date, CPI, fuel prices, store, weekly_sales, and holiday_flag) using MRMR, RFE, and APSO, forecasting is processed using Bi-GRU, which is described in the following section.

Retail Sales Forecasting
The optimal features selected from MRMR, RFE, and APSO on the Rossmann dataset and Walmart dataset are given as input to the Bi-GRU model for effective forecasting of retail sales.The Bi-GRU model has update and reset gates for performing sales forecasting, which also decreases the gradient dispersion and computational loss, while enabling the capability for shorter-and longer-term memory [36,37].Also, Bi-GRU comprises a smaller number of constraints because it does not have a forget gate, which makes it computationally efficient, less prone to overfitting, a suitable option for a smaller-type dataset.
In the Bi-GRU model, the input and the forget gates of the LSTM network are replaced by the update gate, d TS .The update gate helps the model in determining the past information, which needs to be passed along with the future information.This process reduces the vanishing-gradient problem in the Bi-GRU model.The update gate, d TS , is mathematically specified in Equation (10).
where the weight matrix is represented as W d ; the bias matrix is denoted as b d ; the input matrix (selected features) at the time step, TS, is indicated as f TS ; the sigmoid activation function is denoted as σ; and the hidden state at the previous time step, TS − 1, is indicated as h TS−1 .In the Bi-GRU model, the reset gate, p TS , is utilized for controlling the historical time-series data and is responsible for the network's shorter-term memory in the hidden state.The reset gate, p TS , is numerically expressed in Equation (11).
where the bias matrix and the weight matrix of the reset gate (p TS ) are denoted as b p and W p .Then, the candidate of the hidden state, ∼ h TS , is specified in Equation (12).
where the tangent activation function is represented as tanh, the dot multiplication operation is denoted as ⊙, and the bias matrix and weight matrix of the memory cell state are correspondingly denoted as b h and W h .The output, h TS , is obtained by linearly interpolating ∼ h TS and h TS−1 , and this process is indicated in Equation ( 13).The Bi-GRU model's architecture is mentioned in Figure 3.
Appropriate feature engineering for the Bi-GRU model is needed for retail sales forecasting to extract the implicit vectors and complex variances in the historical sequence data.The traditional Bi-GRU model extracts only feature information in the forward direction, and it automatically rejects the backward historical time-series data.So, an adaptive Bi-GRU model was implemented in this study for precise retail sales forecasting.The pro-posed BiGRU has the capability to process several inputs proficiently over the conventional models because of its facility to study the input from both the directions concurrently.
The proposed regression model extracts the knowledge between the variables from the forward and backward directions, as mentioned in Figure 3.In the Bi-GRU model, the forward GRU extracts prior information in the historical time-series data, and the backward GRU extracts future information in the historical time-series data.The numerical expression of the Bi-GRU O TS model is specified in Equation (14).
where the output of the backward and forward directions is represented as A, and it performs operations like multiplication function, average function, summation function, etc.In addition, the hidden states of backward and forward GRUs are denoted as The parameters considered in the BI-GRU model are as follows: the look-back is eight, the number of neurons is 80, the dropout rate is 0.5, the batch size is 50, the loss function is MSE loss, the optimizer is Adam, and the learning rate is 0.0001.The numerical results of the proposed regression model are specified in Section 4.
Telecom 2024, 5, FOR PEER REVIEW 9 forward GRU extracts prior information in the historical time-series data, and the backward GRU extracts future information in the historical time-series data.The numerical expression of the Bi-GRU  model is specified in Equation (14).
where the output of the backward and forward directions is represented as , and it performs operations like multiplication function, average function, summation function, etc.
In addition, the hidden states of backward and forward GRUs are denoted as ℎ ⃖ and ℎ ⃗ .
The parameters considered in the BI-GRU model are as follows: the look-back is eight, the number of neurons is 80, the dropout rate is 0.5, the batch size is 50, the loss function is MSE loss, the optimizer is Adam, and the learning rate is 0.0001.The numerical results of the proposed regression model are specified in Section 4.

Complexity and Convergence Analysis
The complexity of the method for creating a composite particle is calculated as per the number of assessments of Euclidean distances (  ) among the particles.It is assumed that  particles are arranged in a list, which is denoted as  .Consequently, the estimation of Euclidean distance is essential for this process:   is formulated as Equation (15).
The worst time complexity of the procedure is   .The PSO contains a complexity of   that is at the expense of estimating   .In spite of this, the partition technique intrinsically deliberates the distance and fitness, which delivers an efficient system to contribute a complete employment of the included particles.Hence, it is worthwhile to improve the adaptivity of PSO by incorporating other swarm intelligence algorithm for solving the optimization problems.To address the complexity,   , and convergence speed of the algorithm, a Human Group Optimization (HGO) algorithm is integrated with the PSO algorithm to influence the particle's positions.The HGO algorithm employs an AUM function,  , to improve the convergence speed, which eases the implementation process of PSO.

Results and Discussion
The proposed regression model was simulated by a Python 3.7 software tool with libraries of SciPy, Keras, TensorFlow, SciKit Learn, Matplotlib, Numpy, Polars, and Pandas.The proposed regression model was analyzed on a system with 8TB hard disk, Linux

Complexity and Convergence Analysis
The complexity of the method for creating a composite particle is calculated as per the number of assessments of Euclidean distances (T(N)) among the particles.It is assumed that N particles are arranged in a list, which is denoted as L I .Consequently, the estimation of Euclidean distance is essential for this process: T(N) is formulated as Equation (15).
The worst time complexity of the procedure is O N 2 .The PSO contains a complexity of O N 2 that is at the expense of estimating T(N).In spite of this, the partition technique intrinsically deliberates the distance and fitness, which delivers an efficient system to contribute a complete employment of the included particles.Hence, it is worthwhile to improve the adaptivity of PSO by incorporating other swarm intelligence algorithm for solving the optimization problems.To address the complexity, O(N), and convergence speed of the algorithm, a Human Group Optimization (HGO) algorithm is integrated with the PSO algorithm to influence the particle's positions.The HGO algorithm employs an AUM function, d i , to improve the convergence speed, which eases the implementation process of PSO.

Results and Discussion
The proposed regression model was simulated by a Python 3.7 software tool with libraries of SciPy, Keras, TensorFlow, SciKit Learn, Matplotlib, Numpy, Polars, and Pandas.The proposed regression model was analyzed on a system with 8TB hard disk, Linux operating system, 128GB RAM, and Intel core i7 10th generation processor.This equipment was sourced from SATA manufacturer, Mumbai, INDIA.The efficacy of the Bi-GRU model was analyzed on the Rossmann and Walmart datasets by means of NRMSE, ND, RMSSE, MAE, R2, and MSE.

Evaluation Measures
The proposed Bi-GRU model's effectiveness was validated by using six evaluation measures, namely NRMSE, ND, RMSSE, MAE, R2, and MSE.The evaluation measure NRMSE is one of the crucial measures in analyzing the proposed Bi-GRU model for sales forecasting; particularly, it is preferred while performing outliers' removal.The spread is considered as the difference between the minimum, y min , and the maximum, y max , values in the training dataset in NRMSE.The error measure, ND, accounts for the scale difference between the actual and predicted values.The evaluation measures NRMSE and ND are mathematically formulated in Equations ( 16) and (17).
where N represents the number of data points, and y and ŷ are denoted as the actual and predicted values.The error measures, i.e., MSE, MAE, and RMSSE, determine the average square difference, mean of absolution difference, and average square scale error difference between the actual and the predicted values.The error measures, MSE, MAE, and RMSSE, are mathematically specified in Equations ( 18)- (20), where q is denoted as period.The R 2 is the number that ranging between zero and one; it measures how well the proposed Bi-GRU model predicts the outcomes and is numerically specified in Equation (21).

Dataset Description
The proposed Bi-GRU model's efficacy was validated on the Rossmann and Walmart datasets.The Walmart dataset was released in 2014, and it contains the weekly sales data of 77 departments and 45 stores.The Walmart dataset includes features like unemployment (prevailing unemployment rate), temperature (temperature on a sale day), date (week of sales), CPI (prevailing CPI), fuel prices (fuel cost in a region), store (store number), weekly_sales (sales of a given store), and holiday_flag (whether the week is a non-holiday week, holiday week, or special holiday week).This historical dataset covers sales from 5 February 2010 to 1 November 2012, and it is available from https://www.kaggle.com/datasets/yasserh/walmart-dataset (accessed on 12 December 2023).
On the other hand, the Rossmann dataset was released in 2015, and it has the sales data of 1115 stores.This dataset includes seven features: day of the week, school holiday, state holiday, open promo, customers, sales, and store number.The Rossmann dataset is available from https://www.kaggle.com/competitions/rossmann-store-sales/data(accessed on 12 December 2023).The data statistics of Rossmann and Walmart datasets are mentioned in Table 1.

Quantitative Analysis
When managing complex sales prediction issues, several analyses are established through a combination of techniques to complement the benefits of deep learning models and to enhance the accuracy.The deep learning models are always selected because of their high precision in solving the complex problems.Consequently, in the proposed Bi-GRU model, the forward GRU extracts prior information in the historical time-series data, and the backward GRU extracts future information in the historical time-series data to minimize the error performances (NRMSE, ND, RMSSE, MAE, MSE, and R2) for sales forecasting, which is shown in the below tables.
Therefore, the efficacy of the Bi-GRU model was validated on the Rossmann and Walmart datasets.The numerical analysis of different models on the Rossmann and Walmart datasets is denoted in Tables 2 and 3.As specified in Tables 2 and 3, the efficacy of the Bi-GRU model was analyzed by comparing its results with other regression models, like linear regression, LSTM, Bi-Directional LSTM (Bi-LSTM), and GRU by means of NRMSE, ND, RMSSE, MAE, MSE, and R2.By viewing Table 2, we can see that the proposed Bi-GRU model obtains a high R2 value of 0.98 and a minimum NRMSE, ND, RMSSE, MAE, and MSE of 0.08, 0.07, 0.08, 0.05, and 0.04, respectively, on the Rossmann dataset.On the other hand, as seen in Table 3, the Bi-GRU model achieves a high R2 value of 0.99 and a minimum NRMSE, ND, RMSSE, MAE, and MSE of 0.01, 0.03, 0.08, 0.07, and 0.03 on the Walmart dataset, respectively.As compared to the other regression models-the linear regression, LSTM, Bi-LSTM, and GRU-the Bi-GRU model utilizes limited memory and is faster in regard to data processing.The Bi-GRU model is more precise while utilizing the datasets with longer sequences.In addition to this, the Bi-GRU model effectively addresses the problems of overfitting and vanishing gradient.In addition to the assessment of various classifiers, a statistical test was used to assess the performance of all the classifiers.For that reason, the Friedman test, also known as the nonparametric statistical test, was applied in this research.The Friedman test uses the data ranks rather than the data themselves and tests the null hypothesis in which the column properties are all identical.In some other way, all classifiers have a comparable impact on the classification process.In this statistical test, the probability of gaining the detected sample outcomes (p-value) is computed as a scalar value in the range of [0, 1].Small values of "p" decline the null hypothesis.The Friedman test is calculated as p = 0.0071, which shows that the proposed Bi-GRU has a superior performance compared to other classifiers in a significant manner.
In addition to this, the numerical analysis of different feature engineering techniques with the Bi-GRU model on the Rossmann and Walmart datasets is presented in Table 5.By analyzing Table 5, we can see that the Bi-GRU model with feature engineering techniques (APSO+RFE+MRMR) obtains superior retail sales forecasting results compared to the other combinations with the Bi-GRU model by means of the NRMSE, ND, RMSSE, MAE, R2, and MSE.
The active features selected from APSO+RFE+MRMR includes four major benefits: (i) improves Bi-GRU model's performance by learning meaningful patterns in the acquired datasets; (ii) reduces overfitting risk; (iii) improves computational efficiency, where the transformed features need fewer computational resources; and (iv) improves Bi-GRU model's interpretability.As mentioned in the earlier sections, the selection of active features diminishes the processing time of the Bi-GRU model to 20.11 s and 30.12 s, respectively, in the Rossmann and Walmart datasets and reduces the model's complexity to linear.

Comparative Analysis
The comparative study between the existing regression models and the proposed Bi-GRU model is detailed in this section.Ozyegen et al. [38] analyzed the performance of three existing regression models in retail sales forecasting, namely LSTM, Gradient Boosted Regressor (GBR), and Time Delay Neural Network (TDNN).In this study, the existing regression model's efficacy was analyzed by applying it to the Rossmann and Walmart datasets.The experiments carried out on the Rossmann and Walmart datasets state that the GBR model obtains minimal mean NRMSE and ND values when related to other two regression models.The GBR model has a mean NRMSE of 0.12 and 0.01 and ND of 0.20 and 0.10 on the Rossmann and Walmart datasets, respectively.The comparative results between the existing regression models (GBR, TDNN, and LSTM) and the proposed Bi-GRU model are clearly stated in Table 6 and Figure 4. Table 6.Comparative results between the existing regression models and the proposed Bi-GRU model by means of NRMSE and ND.
In addition, Niu [40] performed feature engineering with the XGBoost model for precise sales forecasting on the Walmart dataset.The presented XGBoost model has a minimal RMSSE of 0.65, which is superior to other comparative regression models, like ridge and logistic regression.When related to these existing regression models, the presented Bi-GRU model has a minimal error value and high R2 value on the Rossmann and Walmart datasets.The comparative results between the XGBoost model and the proposed Bi-GRU model are specified in Table 7.
Walmart GBR [38] 0.01 0.10 TDNN [38] 0.02 0.14 LSTM [38] 0.01 0.13 Bi-GRU 0.01 0.03 On the other hand, Yang [39] analyzed the sales prediction performance of XGBoost, random forest, and Ordinary Least Squares (OLS) models on the Walmart dataset.According to the numerical investigation, the XGBoost model has a minimal MAE of 0.12 and MSE of 0.06, and a high R2 of 0.98 on the Walmart dataset.The obtained numerical results are superior to the random forest and OLS models.
In addition, Niu [40] performed feature engineering with the XGBoost model for precise sales forecasting on the Walmart dataset.The presented XGBoost model has a minimal RMSSE of 0.65, which is superior to other comparative regression models, like ridge and logistic regression.When related to these existing regression models, the presented Bi-GRU model has a minimal error value and high R2 value on the Rossmann and

Discussion
In this study, the integration of Bi-GRU with the hybrid feature engineering techniques achieved a higher forecasting performance on the Rossmann and Walmart datasets when compared with Ozyegen et al. [38], Yang [39], and Niu [40].The results are superior to the those of the existing regression models, like linear regression, LSTM, Bi-LSTM, and GRU.The proposed Bi-GRU model has minimal NRMSEs of 0.08 and 0.01, NDs of 0.07 and 0.03, RMSSEs of 0.08 and 0.08, MAEs of 0.05 and 0.07, and MSEs of 0.04 and 0.03, as well as high R2 values of 0.98 and 0.99, on the Rossmann and Walmart datasets, which are shown in Tables 2 and 3, respectively.
The practical implications of the findings in the context of E-commerce companies are as follows: (i) customer satisfaction by producing timely delivery of the product, and (ii) long-term strategy plan for reliable future growth.Furthermore, the precise retail sales forecasting using the Bi-GRU-APSO model improves their operations by allowing the E-commerce business to effectively allocate resources for managing cash flow and future.Additionally, the retail sales forecasting results in an accurate estimation of revenue and costs based on their long-and short-term performance.The major five benefits of timely retail sales forecasting are (i) eliminates the chances of panic sales, (ii) leverages real time data, (iii) offers simple financial planning, (iv) stabilizes inventory management, and (v) speeds up the process of product delivery.

Conclusions
A Bi-GRU model was implemented in this study with effective feature engineering techniques for precise retail sales forecasting.In the initial phase, the unreliable samples in the Rossmann and Walmart datasets were eliminated by removing outliers, interpolating missing data, normalization, and de-normalization.Next, the feature engineering was performed using the APSO algorithm, RFE, and MRMR to select the active features from the pre-processed datasets, which were finally given to the Bi-GRU model for precise retail sales forecasting.The efficacy of the proposed regression model was analyzed using six evaluation measures: NRMSE, ND, RMSSE, MAE, MSE, and R2.The proposed Bi-

Figure 1 .
Figure 1.Workflow of the proposed automated regression model.

Figure 1 .
Figure 1.Workflow of the proposed automated regression model.

Figure 2 .
Figure 2. Architecture of the APSO algorithm.

Figure 2 .
Figure 2. Architecture of the APSO algorithm.

Figure 3 .
Figure 3. Architecture of the Bi-GRU model.

Figure 3 .
Figure 3. Architecture of the Bi-GRU model.

Figure 4 .
Figure 4. Graphical comparison between the proposed and existing regression models.

Figure 4 .
Figure 4. Graphical comparison between the proposed and existing regression models.

Table 1 .
Data statistics of Rossmann and Walmart datasets.

Table 2 .
Numerical analysis of different regression models on the Rossmann dataset.

Table 3 .
Numerical analysis of different regression models on the Walmart dataset.

Table 4
displays the numerical analysis of different models on the citadel POS dataset using NRMSE, ND, RMSSE, MAE, and MSE.The proposed Bi-GRU model uses less memory compared to other regression models.Table4clearly shows that Bi-GRU obtained a high R2 value of 0.96 and an NRMSE of 0.09, an ND of 0.08, an RMSSE of 0.08, and MAE of 0.07, and an MSE of 0.06.

Table 4 .
Numerical analysis of different regression models on the Citadel POS dataset.

Table 5 .
Numerical analysis of different feature engineering techniques on the Rossmann and Walmart dataset.

Table 7 .
Comparative results between the existing regression model and the proposed Bi-GRU model by means of R2, MAE, MSE, and RMSSE.