Reinforcement Learning-Based Genetic Algorithm in Optimizing Multidimensional Data Discretization Scheme

State Key Laboratory Marine Resource Utilization in South China Sea, Haikou 570228, China College of Information Science and Technology, Hainan University, Haikou 570228, China Dept. of Obstetrics and Gynecology, .e First Affiliated Hospital of Hainan Medical University, Haikou 570102, China Big Data Lab, Department of Computer Science, Norwegian University of Science and Technology, 2815 Gjøvik, Norway


Introduction
e rapid development of the internet of things has produced massive amounts of large-scale data [1][2][3][4][5]. ese are mainly from various types of sensors, with high-dimensional, incomplete, random, fuzzy, and strong interference and other characteristics [6]. Despite the growing body of artificial intelligence research, how to extract and analyze valuable information from these massive amounts of complex sensor data is still a huge challenge in the field of artificial intelligence [7][8][9]. As one of the most influential data preprocessing technologies, feature discretization can reduce the complexity of data by transforming the continuous features in massive data to discrete features and obtain shorter, more accurate, and more comprehensible rules, so as to improve the efficiency of data mining and machine learning [10][11][12][13][14][15][16]. erefore, feature discretization plays a key role in the application of artificial intelligence technology to these data.
With the development of artificial intelligence, more and more scholars are studying feature discretization [17][18][19]. Obtaining the optimal discretization scheme has been proved to be an NP complete problem [20]. e choice of a discretization method for a dataset will restrict the performance and accuracy of a posterior learning task. Discretization technology can be classified as either supervised or unsupervised, according to whether the data contain category information [21].
EqualWidth and EqualFrequency are two commonly used unsupervised discretization methods [22]. ey divide the entire attribute according to a given interval length and frequency. Although they are simple and convenient, they lead to uneven data distribution and loss of some important information. Supervised discretization has the advantage of making full use of the class label and target attribute information, and it facilitates finding the appropriate breakpoint position compared to unsupervised discretization.
1R is the simplest supervised discretization algorithm [23]. It uses a greedy strategy to divide the attribute value range into intervals, each corresponding only to a decision class. However, because the partition standard of the interval is too simple, lacks flexibility, and does not consider the correlation between features, it cannot guarantee that the compatibility of an information system after discretization will not be destroyed. e information entropy-based method is based on the minimum length description principle (MDLP), and it uses the measure of information gain to determine the breakpoint [24]. Although it can largely ensure the consistency of samples in the interval, it is difficult to filter the noise by setting the threshold value of the partition.
ChiMerge [25], Chi2 [26], and Extended Chi2 [27] discriminate and merge adjacent intervals by measuring the degree of association between condition attributes and target attributes. ey have the advantage that the structures of adjacent intervals are distinct, but they are sensitive to parameters.
CADD [28], CAIM [29], and CACC [30] use statistics to quantify class attribute correlation, which they can maximize after discretization, and they can obtain the minimum number of discrete intervals as much as possible. However, they only achieve the best discretization for a single attribute interval, lack the description of all of the data, do not consider the consistency of the data either before or after discretization, and will inevitably lose important information from the original data. ese mainstream methods are obviously based on specific division criteria to achieve the discretization of continuous features. However, in the process of multidimensional data discretization, the distribution of target attribute values is usually difficult to know, and the features have complex correlations. In addition, the relatively fixed division criteria cannot provide a comprehensive measure of discrete intervals, and there will be some defects. erefore, the discretization schemes obtained by most of the algorithms are not optimal in specific application scenarios, or they may even fail to meet the accuracy requirements of the system. A PSO algorithm was applied to feature selection based on discretization, which can generate more powerful and compact representations in high-dimensional datasets, and thus achieve better classification performance [31]. ACO [32] is used to solve the problem of discretization of continuous features to obtain more concise decision rules and higher prediction accuracy. RS-GA is a mature discretization method, which uses the individual fitness function based on rough sets to evaluate the uncertainty of an information system in a genetic algorithm and searches for the optimal discretization scheme through individual evolution [33]. Although these swarm intelligence algorithms can achieve better results, it is difficult to formulate appropriate strategies without prior knowledge, which will make the search in multidimensional space inefficient, consume computing resources, and easily fall into local optima.
Aiming at these problems, this paper proposes a reinforcement learning-based genetic algorithm (RLGA) to optimize the discretization scheme of multidimensional data. First, we binary code the attribute values of the multidimensional data and initialize the population. e binary code method can build an efficient mathematical model suitable for the problem of feature discretization. Second, it is difficult to form a proper strategy without prior knowledge as guidance, which causes the search space to easily fall into local optima. We use rough sets [34] to construct individual fitness functions and a design control function to dynamically adjust the diversity of the population.
en, we introduce a reinforcement learning mechanism [35] to crossover and mutation to determine the crossover fragments and mutation points of the discretization scheme to be optimized. We compare our method to state-of-the-art discretization methods on GF-2 and Landsat 8 images. Experimental results show that the proposed method can reduce the number of intervals and simplify the multidimensional dataset without decreasing the data consistency and classification accuracy of a discretization scheme. e remainder of this paper is arranged as follows. Section 2 describes basic concepts and reviews some related work. Section 3 explains the algorithm flow of the proposed work. Section 4 introduces the experimental environment and datasets. We analyze and discuss the experimental results in Section 5. Section 6 summarizes this paper and discusses future research.

Background and Related Work
We introduce the concept of feature discretization and present simple descriptions of rough sets, genetic algorithms, and reinforcement learning mechanisms. We also analyze problems that occur in the discretization of multidimensional data in the optimization process.

Definitions of Feature Discretization.
Feature discretization is the process of dividing a continuous attribute value (also called a continuous feature value) into a finite number of intervals according to a rule, and associating these intervals with a set of discrete values [10]. Considering the problem of m-dimensional data classification, the discretization algorithm divides the attribute values on the i-th dimension into n i discrete, nonintersecting intervals: where d 0 and d n i are the minimum and maximum attribute values. All of the values are arranged in ascending order in D i , which is called a discretization scheme on the ith dimension. D � D 1 , D 2 , . . . , D i , . . . , D m represents the whole discretization scheme of m-dimensional data. Obviously, the search space of m-dimensional data feature discretization is formed by all of the candidate breakpoints of each dimension, which are different attribute values of each dimension in the training set.

Genetic Algorithm Using Binary
Coding. As a global optimization probability evolution algorithm, a genetic algorithm [36] has inherent implicit parallelism and a strong global search ability. It achieves good performance on many optimization problems [37]. We use binary encoding to encode the candidate breakpoint; the values 1 and 0 represent that the breakpoint is selected or discarded, respectively. As- is a candidate breakpoint set of m-dimensional data in the ith dimension, the chromosome structure in the genetic algorithm is shown in Figure 1, where colors represent different features in the m-dimensional data. e length of each chromosome is m i�1 n i . e set of selected candidate breakpoints is a discretization scheme, according to which attribute values are classified into discrete intervals formed by the candidate breakpoints in the set. [38] is based on the classification mechanism. It interprets the classification as indiscernible relations in the space of features, and these relations form the division of the space. Given decision table S � (U, R, V, f), where U is a finite set of objects, i.e., a domain, R is an attribute set including condition attribute set C and decision attribute set D. For each attribute subset A ⊆ R, the indiscernible binary relation IND(A) and the equivalent classes of attribute subset A in domain U are defined as

Fitness Function Based on Rough Set. A rough set
According to the abovementioned decision table S, for each subset A in U and equivalent class of attribute subset X in U, the lower and upper approximation sets of X are defined as Since the principle of selecting the optimal breakpoint set is to minimize the number of breakpoints without changing the indiscernible relations of the decision table, the fitness function should be determined by the number of breakpoints and the indiscernible relations: where N I is the number of breakpoints in the initial breakpoint set, N D is the number of breakpoints obtained after chromosome decoding, and ΔN � N I − N D is the change of the number of breakpoints. R D takes the value 0 or 1, which, respectively, mean that the indiscernibility of the decision table changes or does not change after discretization. e judgment process is shown in Algorithm 1.

Reinforcement Learning.
Reinforcement learning is a goal-driven, highly adaptive machine learning technique in the field of artificial intelligence [39], in which there are two basic elements: state and action. Performing an action in a certain state is a strategy. e learner must constantly explore to generate an optimal strategy. Different from supervised and unsupervised learning, it regards learning as a process of interaction between agents and the environment through exploration and evaluation. e operation mechanism is shown in Figure 2. e agent selects an action to be applied to the environment by sensing the current state of the environment. After the environment accepts the action, the state changes, and a reward is given to the agent. According to the new state of the environment, the agent continues to select the next action, and this is repeated until it reaches the terminated state. e goal of reinforcement learning is to maximize the accumulated rewards by adjusting strategies. Q-learning [40] is one of the most representative modelfree reinforcement learning techniques. In the current state, the agent selects the next action according to the corresponding Q value of each action using the ε − greedy strategy and updates Q at each step in the learning process as where s and a are, respectively, the current state and behavior, s ′ , a ′ , and A are the next state and behavior, A is the action set, and 0 ≤ c ≤ 1. Q-learning has broad application prospects in solving complex control and decision-making problems [41,42]. We use Q-learning to determine the cross fragments and mutation points of the discretization scheme to be optimized.

Main
Challenges. e complexity of multidimensional data feature discretization increases sharply with the length of the attribute value interval and the association between attributes [36]. When using the genetic algorithm [43] to optimize the discretization scheme of multidimensional data, the main challenges are as follows.
(1) Improper control of population diversity causes premature convergence. (2) Because multidimensional data contain many features, cross fragments and mutation points tend to focus on features with larger value intervals, which Mathematical Problems in Engineering decrease the opportunity for breakpoints on other features to evolve. (3) Due to the complex correlations between features, the crossover and mutation operations are relatively blind without prior knowledge as guidance, making it highly likely that some high-quality fragments on features are destroyed in the next generation of evolutionary operations.

Reinforcement Learning-Based Genetic Algorithm
A genetic algorithm is a general method to search for the optimal solution in the field of artificial intelligence. We add a control function to the fitness function based on rough sets to dynamically adjust the diversity of the population. In addition, according to the characteristics of multidimensional data, we introduce a reinforcement learning mechanism to the crossover and mutation operation to determine the cross fragments and mutation points of the discretization scheme to be optimized, which greatly improves the accuracy and speed of convergence. Below, we discuss the flow of the algorithm.

Evolution of Discretization Scheme to Be Optimized.
We perform evolutionary operations on the optimized discretization scheme and initial population, as shown in Figure 3. e global variable preserves the best individual of the population, while the local variable preserves the historical best individual of the discretization scheme to be optimized. In each iteration, the fitness of the current population is calculated, the optimal individual is obtained, and the global variable is updated. en, the discretization scheme to be optimized and the optimal individual of the population carry out the cross and mutation operations based on reinforcement learning, and the local variable is updated. If the termination condition is not reached, then the population will continue to carry out ordinary evolutionary operations. Otherwise, by comparing the global variable and local variable, the individual with the largest fitness will be output.

Selection Operator Based on the Control Function.
e selection operation is based on the evaluation of individual fitness in the population. Individuals with higher fitness are generally more likely to be selected. Roulette [44] is a simple, efficient, probability-based method that is often used to select individuals. If the population size is n and the fitness value of individual i is f i , then the probability that individual i is selected is Input: Discretization scheme, original decision  (3); Compute the upper approximation set C − (d i ) of the original decision table before discretization using equation (4); end Discrete the original decision table by the discretization scheme; for each category i do Compute the lower approximation set C − (d i )′ of the original decision table after discretization using equation (3); Compute the upper approximation set C − (d i )′ of the original decision table after discretization using equation (4);  However, individuals with large fitness are more likely to be selected, which may result in the destruction of population diversity. Since R D takes the value 0 or 1, individuals with fitness value 0 cannot be selected in the evolution of the population. Although these individuals are not feasible solutions at the initial stage, they may be close to the optimal feasible solution, and may eventually evolve into it. However, considering the randomness and evenness of the initial population, feasible solutions with R D � 1 and potential feasible solutions with R D � 0 tend to be quantitatively equivalent. Even with a small population, because of the complex correlation among the features of multidimensional data, there are far more potential feasible solutions with R D � 0 than with R D � 1. erefore, when roulette is used for individual selection, the potential feasible solutions that are the majority of the population will be eliminated, thus destroying the diversity of the population. According to the abovementioned analysis, to ensure the diversity of the population in the early stage and accelerate convergence of individuals to the optimal solution in the late stage of evolution, we expand the fitness function by adding the control function on the original basis, and we determine the control factors according to the proportion of feasible solutions in the population in the current stage of evolution.
e fitness function of chromosome x is thus expanded to e control function consists of control item φ(x) and control factor σ(p), where σ(p) is the proportion of the population consisting of potential feasible solutions with R D � 0, and σ(p) is a function with p as an independent variable. e expressions for φ(x) and σ(p) are  where N I is the number of breakpoints in the initial breakpoint set, N x D is the number of breakpoints obtained after chromosome x decoding, R x D is the change of the indiscernible relationship of the decision table after discretization by chromosome x, k is the number of classes, d i is the ith class, C (d i ) and C − (d i ) are, respectively, the lower and upper approximation sets of d i before discretization, C (d i ) ′ and C − (d i ) are, respectively, the lower and upper approximation sets of d i after discretization, and setcmp(setA, setB) is a user-defined set comparison function. When setA and setB are equal, the function value is 1, and otherwise it is 0. Since each class corresponds to a lower and upper approximation set, the value range of μ is 0 ≤ μ ≤ 1. e following points are made regarding the extended fitness function.
(1) We want to prevent the unreasonable situation that when the number of breakpoints of the feasible solution with R D � 1 is equal to that of the potential feasible solution with R D � 0, the fitness value of the former is less than or equal to that of the latter. Hence we require that index z of the control item satisfies z > 1. In this paper, z � 2.
(2) It can be seen that the value range of μ is actually 2K + 1 discrete points from 0 to 1, with the interval 1/2k. When μ � 0, the indiscernibility of the decision table is destroyed after discretization. If μ � 0, then μ � 0, and the significance of adding a control function for potential feasible solutions with R D � 0 is lost, since the control function has a smaller value than when μ takes the minimum nonnegative value μ � 1/2K. Consequently, we change the expression μ 1− p to (1/(2k × log 2 N 1 )) 1− p . So, when μ � 0, the value of the control function is not 0, and it is smaller than its value with μ � 1/2k. However, because the chromosome is binary coded, taking "2" as the base number and N 1 as the true number can ensure that the difference between the two control function values corresponding to μ � 0 and μ � 1/2k is not too large, and this controls the selection of individuals relatively reasonably. (3) When μ � 1, R D � 1 and σ(p) � 0, indicating that this individual is a feasible solution, and the value of the control item is 0. When 0 ≤ μ < 1, 0 ≤ μ < 1 and σ(p) is an increasing function with p as the independent variable. Consider that in the early stage of evolution, there are few or no feasible solutions in the population, i.e., p is large. At this time, σ(p) should be relatively large, so the potential feasible solution can be selected with a large probability in roulette, and the search is gradually guided to the area of feasible solutions. As evolution progresses, more and more feasible solutions appear in the population, i.e., p becomes smaller and smaller. Accordingly, σ(p) should become smaller and smaller, accelerating the movement from feasible to optimal solutions.

Crossover Operator Based on Q-Learning.
e crossover operation in a genetic algorithm is the exchange of some genes between two matched chromosomes in a certain way, so as to form two new individuals. However, multidimensional data contain many features, and cross fragments tend to focus on features with large value intervals; hence, the breakpoints on other features lose the chance to cross. However, there are complex correlations among features. Without prior knowledge as a guide, the cross operation of the discretization scheme to be optimized becomes blind, which causes some high-quality fragments to have a high probability of being destroyed. To this end, we use the decision-making ability of Q-learning to select some features of the discretization scheme to be optimized in each iteration for the cross operation.
(1) State: according to the abovementioned analysis, the crossover operation will give each feature a certain probability of change. We define the set of changing features in the crossover operation as a state. Assuming that the multidimensional data have N features, the search space is divided into 2N − 1 states, each a combination of several features. For example, when N � 3, there are seven states: where f i is the ith feature of multidimensional data, 1 ≤ i ≤ N, and the elements in * { } represent the features corresponding to the loci of the most recent crossover or mutation operation.
(2) Action: since multidimensional data generally contain a large number of features, according to the abovementioned definition of states, the number of states will increase exponentially. e transition between two states corresponds to one action; so, many actions must be defined, which increases the computational complexity. According to the previous analysis, we mainly want to avoid the situation that in each crossover operation, cross segments focus on features with a larger value range, which makes some breakpoints lose the chance of crossing. In addition, some high-quality segments will be destroyed without prior knowledge as guidance. erefore, for the current state, the next state to jump to after performing an action should be mainly considered from three aspects. First, the feature set of the next state is a subset of that of the previous state. Second, the feature set of the next state is a complement of that the previous state. ird, the intersection of the feature set of the next state and that of the previous state is not empty. Accordingly, there are three kinds of actions, represented by G, H, and I. e algorithm jumps to a new state by performing actions on the current state, and executes cross operations on all of the features contained in the new state. Suppose S t is the current state and S t+1 is the next state. G(S t ) represents a random jump to one of all of the subsets of the current state, G(S t ) is a random jump to one of all of the subsets of the complementary set of the current state, and H(S t ) is a random jump to a set whose intersection with the current state is not empty and is not a subset of the current state, as shown below.
It is easy to see that the range of values for G(S t ), H(S t ), and I(S t ) covers all of the states.
(3) Reward: to make a correct decision when selecting a feature set in each crossover operation so as to more quickly approach the optimal solution, we set a reward value for each state-action combination. e reward value is based on the change of individual fitness, which is mainly used to evaluate the search for the optimal solution of the algorithm. In the formula mentioned below, P(S t ⟶ S t+1 | A t ) is the probability of jumping to the next state S t+1 after performing action A t in the current state S t . is is related to the number N state of all of the possible states to jump to, which is 1/N state . fit(S t ) and fit(S t+1 ) are, respectively, the individual fitness of the current and next state, and lbest is the historical best fitness of the discretization scheme to be optimized.
According to the designed reward value, we can update Q when the discretization scheme to be optimized has a cross operation.

Mutation Operator Based on Q-Learning.
Like the crossover operator, we use the decision-making ability of Q-learning to select some features of the discretization scheme to be optimized in each iteration for the mutation operation. e state, action, and reward of the mutation and cross operation are consistent, the only difference being to change the cross operation on the feature to a mutation operation. ey each maintain a Q-table. Figure 4 shows the update process of two Q-tables of a discretization scheme with three features in one iteration.
Both Q-tables are initialized to 0. After N iterations, the Q-tables corresponding to the cross and mutation operations are shown in Figures 4(a) and 4(b), respectively. Assuming that the current state is f 1 , f 3 , action H is selected for the cross operation. Accordingly, the state jumps to f 2 , and Q's value of 1 is updated to 6 in cross operation. en, state f 2 selects action f 2 , f 3 for the mutation operation. Accordingly, the state randomly jumps to f 2 , f 3 , and Q′ s value of 2 is updated to 3 in the mutation operation. In this way, the two Q-tables are updated in one iteration.

3.5.
Flow of RLGA Algorithm. Algorithm 2 shows the flow of the proposed method. First, binary genetic coding is applied to the attribute values of the multidimensional data, and the state is generated according to the number of features. en, in the current state of the discretization scheme to be optimized, the greedy algorithm is used to select the action, jump to the next state, and cross operate with the global optimal individual on the corresponding features. At the same time, the reward value is evaluated according to the fitness value after the cross operation, whose Q-table is then updated. Similarly, in the current state, an action is selected to continue the mutation operation, whose Q-table is updated. e population performs the conventional genetic operation and saves the global optimal individual in each iteration. Finally, the program outputs the maximum value of both the global and local variable. While the algorithm optimizes the given discretization scheme, other individuals of the population evolve to enlarge the search scope and improve the probability of obtaining the optimal solution.

Experimental Design
We introduce the experimental data source, experimental environment configuration, and dataset used in the experiment.

Data Source.
e experimental data are from a Landsat 8 satellite image in the coastal area of the South China Sea on February 22, 2018, as shown in Figure 5. e image consists of seven bands. In the experiment, the objects on the image are divided into five categories: impervious surface, construction, bare land, water, and vegetation.

Configuration of Experimental Environment.
To verify the effectiveness of the algorithm in this paper, comparative experiments are carried out using an Intel Core i5-5200U CPU@2.20GHZ, 12 GB memory, and 512 GB hard disk. e visualization, programming, simulation, testing, and calculation are realized in MATLAB R2016a. e radiometric  Input: Multidimensional data discretization scheme Output: Optimal discretization scheme Initialize: global variable � 0, local variable � 0, crossover Q- Table � null, mutation Q- Table � null, t � 0; begin Get the initial breakpoints of the multidimensional data by sorting the values of each feature and removing duplicate values; Binary encode the initial breakpoints of multidimensional data according to the method in Part B of Section 2;

Mathematical Problems in Engineering
Randomly generate initial population P(t); Calculate the fitness of each individual in P(t) using equation (8); Update global variable with the optimal individual fitness value in P(t); Generate state set based on the number of features of multidimensional data according to the definition of state in Part C of Section 3; Choose a state from the state set as the initial state S(t); while t is less than the user's termination iterations do Choose an action from the action set {G, H, I} by e-greedy strategy according to the definition of action in Part C of Section 3; Execute the selected action on the current state S(t) to jump to the next state S(t + 1); Perform crossover operation with global variable on the features contained in state S(t + 1); Calculate the fitness of the multidimensional data discretization scheme after crossover operation using equation (5); Measure the corresponding reward using equation (11) according to the definition of reward in Part C of Section 3; Update crossover Q-Table using equation (6); if the fitness of the multidimensional data discretization scheme > local variable do Update local variable with the fitness of the multidimensional data discretization scheme; end Perform crossover operation in P(t); Calculate the fitness of each individual in P(t) using equation (8); Update global variable with the optimal individual fitness value in P(t); S(t) � S(t + 1); Choose an action from the action set {G, H, I} by e-greedy strategy according to the definition of action in Part C of Section 3; Execute the selected action on the current state S(t) to jump to the next state S(t + 1); Perform mutation operation on the features contained in state S(t + 1) Calculate the fitness of the multidimensional data discretization scheme after mutation operation using equation (5); Measure the corresponding reward using equation (11) according to the definition of reward in Part C of Section 3; Update mutation Q-Table using equation (6); if the fitness of the multidimensional data discretization scheme > local variable do Update local variable with the fitness of the multidimensional data discretization scheme; end Perform mutation operation in P(t); Calculate the fitness of each individual in P(t) using equation (8); Update global variable with the optimal individual fitness value in P(t); t � t + 1; end Return Max(global variable, local variable); end ALGORITHM 2: RLGA algorithm process. calibration and atmospheric correction of images, training of classifiers for discrete results, and comparison of classification prediction accuracy are completed in an ENVI5.3 environment.

Preparation of Experimental Datasets.
We randomly select several areas from the image, covering the five categories specified above and integrate them into a set of experimental samples containing training and test sets. e training set includes 6331 samples, consisting of 935 impervious surface samples, 936 construction samples, 958 bare land samples, 2324 water samples, and 1178 vegetation samples. We sort the pixel values of the training set and delete duplicate values in each band to obtain the initial breakpoints of the seven bands, which are 1403, 1429, 1680, 1869, 2402, 2530, and 2240, for a total of 13553 breakpoints. We discretize this initial set and carry out the comparative experiments.
We compare the proposed method to the classical genetic algorithm (GA) [33] based on the consistency principle of the decision system. en, we compare the optimal set of breakpoints obtained by our method with those of current mainstream supervised discretization methods, such as EDiRa [45], ChiMerge [46], 1R [23], NCAIC [47], FUDC [48], Cramer's V-Test [49], and Chi2 [50], mainly on the evaluation of the number of intervals and data consistency.
In addition, we use the proposed method to optimize the discretization results of the MFD-mvtR algorithm [51]. Finally, we train the neural network classifiers with the discretized samples of all of the methods, and verify the effectiveness of our method by comparing the classification accuracy of each method.

Results and Discussion
We compare the performance of our method with that of the classical genetic algorithm based on the consistency principle of the decision system, and we evaluate the results of seven state-of-the-art discretization methods on the Landsat 8 image. We also use our method to optimize the discretization results of MFD-mvtR [51] on the Gaofen-2 image. Finally, the effectiveness of our method is verified by comparing the classification accuracy of each method.

RLGA versus GA.
We set the population size of the two algorithms at 30 and the number of iterations at 500, and we run them 10 times independently. Figure 6 shows the number of iterations of the two algorithms to reach the theoretical optimal solution in 10 independent experiments. Table 1 compares the convergence rates of the two algorithms. e search efficiency is expressed as where A is the average number of iterations to obtain the optimal solution, T is the total number of iterations, and E is the search efficiency, i.e., the convergence speed. e larger the value, the faster the convergence. It can be seen from Table 1 that the average number of iterations for GA to obtain the optimal solution is 425.3, and the search efficiency is 0.149. ese numbers correspond, respectively, to 338.6 and 0.323 for our method, an obvious improvement in performance.

RLGA versus Mainstream Discretization Algorithms.
We compare the optimal set of breakpoints obtained by our method to mainstream supervised discretization methods, including EDiRa [45], ChiMerge [46], 1R [23], NCAIC [47], FUDC [48], Cramer's V-Test [49], and Chi2 [50], on the number of intervals and data consistency. Figure 7 shows the number of discrete intervals obtained by the eight algorithms in each band. RLGA obtains the minimum number of breakpoints in each band. Table 2 compares the overall number of intervals and data consistency of the eight algorithms. We can see that the number of discrete intervals obtained by RLGA is 2247, which is the least among all of the algorithms, and there is no data error. EDiRa obtains 3909 discrete intervals with 4 data errors. e number of discrete intervals obtained by ChiMerge is 4947, and the number of data errors is also 4. e number of discrete intervals obtained by 1R is 3053, which is the least excepted for RLGA, but it has the most data errors among all of the algorithms, at 31. e number of discrete intervals obtained by NCAIC is the largest among all of the algorithms, at 5041, with 4 data errors. FUDC obtains 4072 discrete intervals, with 4 data errors. Cramer's V-Test and Chi2 both obtain relatively small numbers of discrete intervals, 3858 and 3538, respectively, but there are also more data errors, 7 and 19, respectively. Considering the number of discrete intervals and the number of data errors, RLGA has the best discretization quality.

Optimization of Breakpoints on Gaofen-2 Image.
We use RLGA to optimize the discretization results obtained by the MFD-mvtR algorithm on the Gaofen-2 image [51]. Considering that this image contains only four bands, the search space is divided into 15 states, which is less than the 127 states obtained from 7 bands of the Landsat 8 image.
erefore, we can set the number of iterations of RLGA to only 100 to allow the algorithm to fully learn. e number of optimized breakpoints is shown in Tables 3 and 4. e MFD-mvtR algorithm [51] obtains 1345 breakpoints with 5 data errors on the Gaofen-2 image. Compared to other mainstream algorithms, the results of discretization are satisfactory. We take the results of MFD-mvtR as the discretization scheme to be optimized, use RLGA to perform the cross operation based on reinforcement learning with the optimal individuals in the population, and carry out its own mutation operation based on reinforcement learning. By further optimizing the results of MFD-mvtR, we can see that the numbers of breakpoints in the four bands are reduced to 310, 301, 222, and 198, respectively. e total number of breakpoints is 314 less than before optimization, and the number of data errors is reduced to 0. Table 5 shows the classification accuracy of the neural network after training the discrete feature sets obtained by the abovementioned algorithms. We can see that after optimizing the discretization scheme of MFD-mvtR, the classification accuracy obtained by confusion matrix is about 6 percentage points higher than the original, and the kappa coefficient is 0.8925. Our method reduces the number of    breakpoints in the discretization scheme of MFD-mvtR to a certain extent, while ensuring that there is no data error. It can simplify the dataset and effectively identify the five types of areas of impervious surface, construction, bare land, water, and vegetation on the image after training the classifier by the optimized discrete feature set, which improves the performance of the classifier.

Conclusion and Future Work
In the process of multidimensional data discretization, due to the complex correlation among features and the performance bottleneck of the traditional discretization criteria, most discretization schemes obtained by the algorithms are not optimal in specific application scenarios or they may even fail to meet the accuracy requirements of the system. Some swarm intelligence algorithms can achieve better results, but without prior knowledge as a guide, it is difficult to formulate appropriate strategies, which will cause an inefficient search in multidimensional space, consume many computing resources, and easily fall into local optimization. To solve these problems, this paper proposes a reinforcement learning-based genetic algorithm to optimize the discretization of multidimensional data. First, we binary code the attribute values of the multidimensional data and initialize the population. Second, we use rough sets to construct individual fitness functions and design control functions to dynamically adjust the diversity of the population. en, we introduce the Q-learning reinforcement learning mechanism to the crossover and mutation operations to determine the crossover fragments and mutation points of the discretization to be optimized. We conduct simulation experiments on Landsat 8 and Gaofen-2 images to compare RLGA to the traditional genetic algorithm and state-of-the-art discretization methods.
e experimental results show that our method can reduce the number of breakpoints and simplify the multidimensional dataset without decreasing the data consistency and classification accuracy of a discretization scheme.
Future research work includes the following: (1) test and improve the proposed method on different multidimensional datasets, expand its application scope, and make it more practical; (2) to ease the problems of high-dimensional datasets, improve the algorithm using deep Q-learning technology; and (3) improve the performance of the deep neural network by using RLGA to optimize its parameters.

Data Availability
e Landsat 8 and Gaofen-2 processing data used to support the findings of this study are included within the article. Disclosure e sponsors had no role in the design, execution, interpretation, or writing of the study.

Conflicts of Interest
e authors declare no conflicts of interest.