Towards online training for RL-Based Query Optimizer

Join query optimization aims to ﬁnd the best join order for tables in a query, which is critical for query processing performance. Recently, reinforcement learning models have been proposed to solve the challenges existing with query processing and join query optimization. However, changes in the data distribution can turn the trained reinforcement learning models into obsolete models, resulting in longer execution times. In this paper, we propose a new training strategy in order to extend the existing reinforcement learning models and improve their adaptation when the data distribution changes. The experiments show that the proposed strategy has a signiﬁcant beneﬁt in decreasing training time for the models given the changes in the data distribution.


Introduction
The order in which to perform the join operation across tables is one of the most difficult aspects of query optimization and query plan development [1,2].Even if the query's results are the same, the order in which the query's tables are combined can have a significant impact on query execution time.The optimizer relies on precalculated statistics like data distribution and cardinality estimation to determine the cost of each join order condition, as the number of intermediate rows generated at runtime is unknown.Furthermore, the number of possible join orders increases exponentially with the number of tables, which makes it difficult for the query optimizer to compute the costs for all combinations to select the best join order during query execution [1].As a result, most optimizers use heuristics to reduce the search space, such as examining the structure of the query tree [3].Recently, reinforcement learning and its reward system [4][5][6][7][8][9][10][11] were utilized to learn prediction models for the best join orders between tables over a snapshot of the database.These approaches used reinforcement learning, where the agent receives continuous feedback on actual response time or estimated cardinality, rather than relying on pre-calculated statistics, to maximize the reward, which results in improved query performances.A challenge may appear while considering such models to generate the join order which is the change in the data distribution resulting from tables deletions and insertions.A severe change in the data distribution would convert the trained models into obsolete models as the generated join order wouldn't be consistent with the new data distribution.
Example 1: Consider the following query on the Store-Ordering database presented in Figure 1.The Store-Ordering database consists of four entities: "STORE", "ORDER", "CASHIER", and "CUSTOMER" with the cardinalities 1,000, 1,000,000, 1,000, and 100,000 rows for each table respectively.To retrieve the store and cashier details of orders from the Store-Ordering database, we can use the following simple query in PostgreSQL: SELECT * FROM Store CROSS JOIN ORDER CROSS JOIN Cashier WHERE Store.id = cashier.storeid AND Order.cashierid = Cashier.idBased on the given cardinalities, the trained RL model may determine that the optimal execution order is Store, followed by Cashier, and then Order.However, if there is a change in the data distribution, such as an increase in the cardinality of the Cashier table, the optimal join execution order may no longer be the same and the new optimal order may be revised to Store, Order, and then Cashier.So, this paper addresses the challenge of data distribution change in the tables after the prediction models were trained by introducing an online training strategy that adapts to changes in data while benefitting from the obsolete models.
The rest of the paper is organized as follows: The background information for concepts used in this paper is presented in Section 2.Then, in Section 3, previous work related to the proposed model is discussed.Section 4 details the proposed model.In section 5, the results of the performance evaluation of the proposed models are presented.Finally, Section 6 concludes the paper.2 Background

Join Ordering Problem
"Join" means to merge rows from two tables based on a common column.A query that joins N tables is conducted through N-1 joins, whereas join only works with two tables at a time.The query optimizer must make the following two critical decisions: 1.The selection of a join order.2. The choice of a join algorithm.
The optimum join order that the optimizer must select is our main concern because it greatly affects the query execution time.The number of alternative join combinations that the optimizer needs to explore and analyze grows exponentially with the number of tables [1].Furthermore, because the number of intermediate rows (results) is unknown at run time, the optimizer is forced to use pre-calculated statistics and cardinality estimation to estimate the cost.
According to a join's commutative property, which specifies which table will be accessed first when two tables are joined, the operation A ⋊ ⋉ B is identical to the operation B ⋊ ⋉ A. Also, as stated by the associative property of joins, when A, B, and C are tables, (A ⋊ ⋉ B) ⋊ ⋉ C is equivalent to (B ⋊ ⋉ C) ⋊ ⋉ A. Each possible permutation could have a different cost and performance impact, for instance, depending on the intermediate results.

Reinforcement learning
Reinforcement learning is an area of machine learning in which an agent learns to take actions in different situations in order to maximize its cumulative rewards.Reinforcement learning algorithms learn by taking actions and receiving rewards or penalties in the process.While interacting with the environment, the agent's main purpose is to maximize its cumulative rewards.Reinforcement learning's key elements include agents, environments, states, actions, and a reward value [12] as shown in Figure 2.
The environment receives the action of the agent and the current state as input and outputs a reward and the subsequent state.The reward function translates the stateaction pair of the environment to the reward value, defining the goal in a reinforcement learning problem (negative or positive) [12].
Reinforcement learning problems are closely connected to optimal control problems, and both can be represented using the framework of Markov Decision Processes (MDPs).A reinforcement learning task can be represented as an MDP when it follows the Markov property.This assumption implies that the future state is determined exclusively by the current state and does not depend on history [12,13].The Markov Decision Process is described by the following 4-tuples 1 [12,13].< S, A, P (s, a), R(s, a) > (1) 1. S: The collection of states that an agent encounters while interacting with its environment.It is assumed that the states possess the Markov property.2. P(s, a): The transition function, which identifies the probability distribution to be in the new state as a result of action taken in state s. 3. R(s, a) is the reward function that identifies the reward of taking action a in state s.
Neural networks can be used to approximate functions in reinforcement learning when the state space or action space is large [14].Deep reinforcement learning is the result of employing deep neural networks to apply reinforcement learning.The agents that learn to link state-action pairings to rewards are deep neural networks.Depending on the result, the neural network is encouraged or discouraged by the action on this input in the future.
In [6], deep neural networks were utilised to learn state representations of queries in order to find the best plans.Instead of relying just on basic statistics to estimate costs, this paper investigated the idea of training a deep reinforcement learning network to predict query cardinalities.
The ReJOIN model [5] employs deep reinforcement learning approaches to solve the Join order selection problem.During the learning phase, ReJOIN used the traditional cost model based on cardinality estimation rather than real execution time.
The SkinnerDB system proposed in [16] uses reinforcement learning for query optimization.The proposed model learns the appropriate join order while running the query.The different join orders are divided into slices, with each slice of data being tested until the best join order is discovered and considered for the remaining data slices.When measuring query performance using regret bounds as a reward mechanism, the difference between actual execution time and the time for an optimal join order is taken into account.
Neo (Neural Optimizer) is a supervised learning model that guides a search algorithm over a huge and complex area, as described in [7].Neo assumes the existence of a sample workload, which is made up of a group of queries that are typical of the overall workload.Neo builds its query optimization model on top of current optimizers and learns from incoming queries.Given a sample workload and the expert's best query plans, the learned model attempts to generalise a model that can infer the query plan with the shortest execution time.
Deep reinforcement is used in Query optimization problems in Towards a Hands-Free Query Optimizer via Deep Learning, as stated in [17].The Fully Observed Optimizer (FOOP) presented by [18] is based on a reinforcement learning model in which the reward function is defined as the cost model of a conventional DBMS optimizer.
Another model that incorporates reinforcement learning is presented by [9].To bootstrap the reinforcement learning model before fine-tuning it using real-time execution time, the model proposes a learning-based strategy for join order based on the plans given by the DBMS optimizer.
Recently, a Deep Reinforcement Learning Based Query Optimizer (RL QOptimizer) has been presented [4].The proposed method utilizes reinforcement learning and real-time feedback from the database to determine the best join order in query plans.
All previous models, with the exception of SkinnerDB, have the limitation that they only train the model once and evaluate its performance without taking into account changes to the data distribution.Data distribution refers to how data is spread within a dataset which can change depending on factors such as insertions, deletions, and changes in cardinality (modifying the number of distinct values).The distribution of data plays a crucial role in determining query execution plans and can greatly impact query performance.Consequently, the goal of this paper is to develop a novel architecture that handles the problem of changing data distribution by providing a new training strategy that adjusts to changes in data distribution.

Proposed Architecture
One of the most difficult aspects of learning-based database management systems is that data is always subject to significant changes, necessitating the use of an adaptable model to respond to these changes.The proposed architecture Figure 3 provides a practical solution by using an online approach.It adapts to data changes by training reinforcement learning models on a previous database snapshot on a periodic basis in the background.The proposed architecture uses the DQN model that is presented in [4].It is a 'Deep' Q-Learning model that uses a neural network to approximate the Q-value function and is more suitable for large states and actions.
The DQN model uses a neural network to estimate the Q-value function and help the agent select the best action.The agent takes actions to maximize the total reward, which is calculated by adding the immediate reward to the highest Q-value from the next state, using the following formula 2 [19,20]: The first part of the Q-value calculation, r(s, a), represents the immediate reward for taking a specific action (a) in a given state (s).The second part of the calculation, γ * max Q(s', a), represents the estimate of the optimal future value, adjusted by a discount factor (γ), for the highest potential Q-value in the next state (s').This second part is known as the discounted estimate of optimal future value.
According to the suggested architecture in Figure 3, the model comprises four main phases.The preparation of the data comes first, followed by the preparation of the model, the retraining phase, and the replacement phase.
During the data preparation phase, the system receives the training data from the log file, which represents all potential query execution plans, and syncs the most recent database snapshot.The system then prepares to retrain the model by loading a copy of the one that has already been trained previously as a pre-trained model in the model preparation phase.Following that, the model starts training as depicted Figure 4 by examining each potential execution plan that is in the log file, after which the agent communicates with the DBMS to obtain the actual execution time for the plan, which in our models represents the reward, multiplied by -1 to reduce the execution time.In the final phase, a new trained model based on a configured background training technique replaces the actual model currently in production by modifying the neural network weights to adapt affected query strategies.The replacement can happen either on a regular basis or when the queries' average response time decrease by a certain threshold.The proposed architecture includes retraining the pre-trained model by optimizing its weights to adapt to changes in the data distribution.Instead of starting the training from scratch, the existing information about the database in the pre-trained model is utilized.
In addition, the proposed architecture utilizes the log file to initialize the retraining process.Consequently, the pre-trained model will be updated given the workload queries that were requested by the database users.This should ensure an optimal state for the model, resulting in efficient join order plans, and consequently a shorter response time for the user's most required queries.

Performance Evaluation
The evaluation aims to compare the execution plans generated by the new online proposed model to those generated by the primary model [4] as a baseline, to demonstrate the effectiveness of the proposed model in terms of execution time.To do this, both models are used to generate plans for the same queries after the data distribution is changed within the evaluation database.Then, the execution times of plans generated by both models are collected and compared.The models are tested on one real database that was used as a benchmark dataset for the join-ordering problem [21].

Experiments Setup
Experiments are carried out on a laptop running Ubuntu 18.04.3LTS and equipped with an 8-core Intel Core i7-8550U processor and 8 GB of RAM.The available memory per operator (work mem) was set to 512MB, and the buffer size was set to 1 GB.
IMDb (Internet Movie Database) [22] was utilized through performance evaluation to evaluate the proposed architecture.It is a non-commercial online real-world database that contains a vast quantity of information about films, television shows, and home recordings.During the training process, we focused on tables used in [4] with an average number of records per table greater than 6 million records.

Experimental Results
Two experiments were conducted to evaluate the performance of the proposed approach.The first experiment focused on increasing the cardinality of the "AKA TITLE" table by 5 folds, while the second experiment involved modifications to the data distribution in multiple tables.
In the first experiment, table "AKA TITLE" cardinality was increased by 5 folds, then the online background training was applied to the IMDb database.Following the inclusion of the old trained DQN model, the average response time increased and queries began to timeout.Consequently, the background model was retrained and it was found that the retaining process required only 10% of the original model training time.This may be related to the fact that many queries don't use the "TITLE" table and didn't have to change their execution strategy.Focusing on queries using "AKA TITLE", as shown in Figure 5, the model that had been trained in the background was enhanced, and as a result, "Query 24 -Q24" which could not complete within the maximum time on the original model, was successfully executed on the newly trained one.
In the second experiment, changes were applied to the data distribution in the IMDb database by using random insertion to increase the cardinality of the "COMPLETE CAST", "MOVIE COMPANIES" and "AKA TITLE" tables by 4, 2 and 3 times, respectively.In addition, deletion was applied to "MOVIE LINK" to reduce its cardinality to 25%.As shown in Figure 6, the model that had been trained in the background was improved, and as a result, "Query 57 -Q57" which could not complete within the maximum time on the original model, was successfully executed on the newly trained one.Similarly to the first experiment, it is found that the retraining process only requires a small fraction of the original model training time, which approximated 16% in this particular experiment.Since the model was not trained from scratch, the retraining process was able to fine-tune the existing knowledge acquired during the initial training.
The results of the performance evaluation experiments demonstrate that the newly trained Model performs better than the original model in determining the appropriate join orders for various queries, including challenging cases like "Q24 and Q57".The original model struggled to even execute the query within the maximum execution time.Let's use Query 57 as an example to highlight the challenges faced by the previous model.The earlier model prioritized MOVIE COMPANIES and did not consider the doubled cardinality of MOVIE COMPANIES.The original join order was: MOVIE COMPANIES -> TITLE -> MOVIE INFO -> AKA TITLE -> KIND TYPE.However, the enhanced model modified the plan, leading to improved performance by changing the order to TITLE -> AKA TITLE -> KIND TYPE -> MOVIE COMPANIES -> MOVIE INFO.

CONCLUSION
Determining the best join order is a difficult task due to the exponential growth of alternative join combinations as the number of tables increases.Consequently, it becomes impossible for the optimizer to consider all possible combinations.As a result, most optimizers use heuristics to narrow the search space, such as inspecting the query tree's structure.Reinforcement learning models [4] manage to recommend a query execution plan concentrating on join ordering difficulties but cannot adapt to data distribution changes like insertions, updates, and deletions.This paper proposes an architecture to enforce a training strategy to the reinforcement learning models in order to allow them to adapt to data distribution changes.The results of the performance evaluation show that the new technique outperforms RL QOptimizer in determining the appropriate join orders for queries involving tables whose cardinality has been modified.Moreover, this retraining process only needs a small fraction of the original model training time, thanks to the ability of the model to reuse previously acquired knowledge.As shown in the experiments, the retraining process in the first experiment required only 10% of the original model training time, while in the second experiment, it was approximately 16%.This demonstrates the potential of retraining while considering the originally trained model as a pre-training step.Such potential appears as a useful and effective approach for adapting the DQN model to deal with data distribution changes.

Fig. 2
Fig. 2 Simplified Architecture of the Reinforcement Learning.

Fig. 3
Fig. 3 Architecture Overview -the architecture consists of four main phases: data preparation, model preparation, retraining, and replacement.

Fig. 4
Fig. 4 Background Training Overview: Analyzing execution plans, rewarding based on execution times, and updating neural weights for model replacement.

Fig. 5
Fig. 5 Comparison Between the Results of the RL Model and the Online RL Model on IMDB Database.

Fig. 6
Fig. 6 Comparison Between the Results of The RL Model and The Online RL Model on IMDB Database by changing several tables.