Offline reinforcement learning for industrial process control: a case study from steel industry

Flatness is a crucial indicator of strip quality that presents a challenge in regulation due to the high-speed process and the nonlinear relationship between flatness and process parameters. Conventional methods for controlling flatness are based on the first principles, empirical models, and predesigned rules, which are less adaptable to changing rolling conditions. To address this limitation, this paper proposed an offline reinforcement learning (RL) based data-driven method for flatness control. Based on the data collected from a factory, the offline RL method can learn the process dynamics from data to generate a control policy. Unlike online RL methods, the proposed method does not require a simulator for training, the policy can be potentially safer and more accurate since a simulator involves simplifications that can introduce bias. To obtain a steady performance, the proposed method incorporated ensemble Q-functions into policy evaluation to address uncertainty estimation. To address distributional shifts, based on Q-values from ensemble Q-functions, behavior cloning was added to policy improvement. Simulation and comparison results showed that the proposed method outperformed the state-of-the-art offline RL methods and achieved the best performance in producing strips with lower flatness.


Introduction
In the steel-making industry, strip rolling is a manufacturing process in which steel strips are passed through rolling mills [1]. An industrial process control system monitors the strip in real-time and controls the mills to produce strips with thickness, crown, and flatness specified by the customers. Currently, in real factories, the feedback control loop is designed using the Proportional-Integral (PI) controller [2]. To control the flatness of the strip, a PI controller computes process parameters for the final mill based on the 1. Offline RL learns an adaptive policy from the process data, while the traditional PI control method requires expert knowledge of the specific industrial cases and complicated parameter tuning. 2. Offline RL learns from a static dataset to directly extract information to generate a policy, while online RL requires a model of the real process for training, using simplification, approximation, and black-box methods, the reality gap between the model and the real system reduces reliability.
This paper proposes a novel offline RL method to learn a stable and safe policy for flatness control of a cold rolling line. The data were obtained from the control system of an operational steel mill. Based on Twin Delayed DDPG with behavior cloning [17], ensemble Q-functions are used in policy evaluation to address problems in uncertainty estimates. The obtained stable Q-values are used for policy improvement. Moreover, a data-driven black-box model of the strip rolling line is used to evaluate the trained RL policy. The main intended contributions of this paper are as follows: 1. To learn a reliable policy, a data-driven control method was proposed using offline RL to extract information from a pre-collected dataset rather than interacting with a simplified environment. A novel offline RL method was proposed to obtain stable Q-values for policy improvement. 2. To ensure safety, a data-driven evaluation was adopted for comparison and simulation to support the deployment readiness of the policy. 3. Comparisons with the state-of-the-art (SOTA) methods and the existing PI controllers at the case study factory were carried out, the proposed method showed the best performance in flatness control of strip rolling.
The rest of this paper is organized as follows. The existing methods, SOTA offline RL methods, and their applications in control are reviewed in Section 2. Section 3 discusses the basic principles of offline RL and the proposed method. A case study, including the details of the data and production line, as well as experimental settings, is presented in Section 4. Section 5 analyzes the results of the proposed method. Section 6 concludes the paper and describes future research plans.

Literature review
Conventional controllers such as PI are popular in industrial process control. However, to improve control performance and product quality, and to lower manual engineering costs, the limitations of the conventional controllers should be addressed. An intelligent, adaptive, and easy-to-implement control method is required to address the problems of trial-and-error-based parameter tuning and multi-field cooperation for system building. With an RL-based data-driven control method, the policy could understand the process dynamics to find out the optimal solutions [2,5]. In [2], based on a black-box model, an online RL method was proposed for flatness control in strip rolling. Such online RL methods rely on the approximated dynamic models, and although they proved the feasibility of RL in strip rolling control, the learned policy can be unsafe and unreliable for deployment.
Compared with online RL which needs interactions between the agent and its environment, offline RL is similar to supervised learning and can extract the information from a pre-collected dataset to learn a policy [16]. This means an environment is not required, and the limitations caused by the simplified environments do not apply. Especially in industries, real data are available, but a highfidelity environment is difficult to build. Thus, offline RL is a potentially promising tool for flatness control in strip rolling. Behavior cloning (BC) directly learns a policy by using supervised learning on observation-action pairs, it is easy to implement, but the generalization of the policy is limited, and it cannot recover from errors [18]. In practice, off-policy RL algorithms (e.g., soft actor-critic (SAC) [19], twin delayed deep deterministic policy gradient (TD3) [20]) can technically be trained using a fixed dataset. However, the accuracy of the Q-function regression targets depends on the estimate of the Q-values for actions that are outside the distribution of actions on which the Q-function has been trained. When the learned policy differs substantially from the behavior policy, this discrepancy can result in highly erroneous target Q-values [16,21]. This problem is further exacerbated by policy improvement, in which the learned policy is optimized by maximizing the Q-values.
To address the above problem, Bootstrapping Error Accumulation Reduction (BEAR) [21], Safe Policy Improvement with Baseline Bootstrapping (SPIBB) [22], and Twin Delayed DDPG with Behavioral Cloning (TD3-BC) [17] were proposed based on constraining the learned policy to the support of a behavior policy without making any changes to the estimation of Q-values. In addition to constraining the policy, another approach is to add a regularizer to the Q-function to avoid overestimation for out of distribution (OOD) actions, as in Conservative Q-learning (CQL) [23], and Implicit Q-learning (IQL) [24]. These methods mainly focus on lowering or removing the estimation of OOD actions using complicated algorithmic modifications.
Unlike the general online [19,20] and offline [17,[21][22][23][24] RL algorithms that were designed and evaluated using benchmark environments, industrial users prefer practical methods that can maintain a balance between effectiveness and complexity. Offline RL has been applied to different fields, such as recommendation systems [25], smart grids [26], power control [27], robotics [28], and autonomous driving [29]. However, there is a lack of offline RL research for process industry applications such as strip rolling.
In this paper, motivated by the problems discussed above, a novel offline RL framework is proposed for strip rolling control. Owing to the real data collected from the factory, the learned policy of offline RL can be safer, more reliable, and more practical for deployment.

Strip rolling
Flatness control in strip rolling is chosen as the case to evaluate the RL-based data-driven method. In Fig. 1, the finishing rolling process consists of five finishing rolling mills, the strips will move through each mill. The controller will compute process parameters for the mills after receiving the flatness values measured by the sensor. Readings are taken from four flatness sensors to obtain the state vector, which is defined as S = ( . Process parameters include rolling force, bending force of the work roll and intermediate roll, and the roll gap tilting are recorded by the process control system, which is defined as . The flatness of a strip is computed as follows [30]: where L ref is the length of the strip as a reference and ΔL is the difference between the length of a given strip and the reference strip. As discussed in Sections 1 and 2, the current control method in the factory is based on the PI controller, which has three main limitations: expert knowledge is required, parameter tuning is inefficient, and the controller is less adaptive to the rolling conditions [31]. Moreover, hard-to-control factors of the rolling process, such as the vibration of the running mills, the changing temperature of the roller, and the product surface, can affect the control performance and the quality of strips. To satisfactorily capture the impact of process parameters and hard-to-control factors on the flatness of the strip, a new control method is needed for the strip rolling control to avoid redesigning the control logic and system. Table 1 describes the parameters of the studied production line. The RL controller receives the flatness values (states) and outputs the process parameters (actions). In the strip rolling industry, optimal strips should have flatness values of 0, and the control target is to compute new actions for the mills to produce strips with flatness values as close to zero as possible.

Offline RL for flatness control
According to Sections 1 and 2, online RL applications have been studied in the steel-making industry and other fields [2,[12][13][14][15][17][18][19][20][21][22][23][24][25][26][27][28][29]32]. As shown in Fig. 2, the left part shows the general framework for the online RL application. Firstly, experts from specific domains should collaborate to build a simulator for the studied industrial process, which requires simplification, discretization, and trial-and-error. Secondly, an RL method is used to train a policy by interacting with the simulator. Thirdly, the learned policy will be evaluated by the simulator. In other words, the learned policy is fully trained and evaluated by the simulator.
However, for online RL in industries, especially in the safety-critical processes (e.g., strip rolling), simplification, discretization, and approximation could generate bias between the simulator and the real process, and modeling is time-consuming. Moreover, a new simulator is needed when studying a new target. Therefore, the learned policy is unsafe to be deployed in real systems. The real process can be more complicated, so this policy could endanger the safety of operators and damage the machines. Addressing the reality gap problem in the online RL framework becomes a significant topic.
To address the problems of online RL, offline RL for a safety-critical process is studied in this paper. In Fig. 2, the right part shows an offline RL training framework. Firstly, process data are collected from the server of the control system, the data contain process and quality parameters that will be used for training. Secondly, offline RL is trained using the data without interacting with an environment. Thirdly, a model-based evaluation is adopted to evaluate the policy. Compared to online RL, this framework can directly learn from real industrial data rather than building a simulator. Thus, the bias problem in online RL can be reduced. Since the data are collected from the real process, key information about the process is captured in the data.

Offline RL
A Markov Decision Process (MDP) can be defined by a tuple (S, A, R, P, γ), where S is the state space, A is the action space, R is the reward function, P is the transition dynamics and γ is the discount. RL is used to train a policy by maximizing the expected cumulative discounted reward in the MDP.D = {(s, a, r, s ′ ) } is a dataset of tuples from trajectories. In an actor-critic algorithm, the Bellman operator is used to update the parametric Q-function Q θ (s, a) in policy evaluation, which is shown as follows: Based on the Q-values, the parametric policy π ϕ (a|s ) is improved via policy improvement which maximizes the expected Q-values. Policy evaluation and policy improvement are shown as follows: In offline settings, the agent can only learn from a static dataset D. The agent cannot correct the actions by taking them and obtaining the reward from the environment. OOD actions a ′ in Eq. (1) are not existing in D, which are generated by the learned policy π and used to compute Q , resulting in highly erroneous target Q-values. Meanwhile, the policy is optimized by maximizing the Q , further exacerbating this problem. As discussed in Section 2, CQL addresses the problem of Q-values overestimation by learning conservative Q-values in policy evaluation [23] while IQL directly removes OOD actions from policy evaluation [24]. Different from CQL and IQL, which have complex regularizers and terms, BEAR, SPIBB, and TD3-BC constrain the learned policy to force it to approach the behavior policy [17,21,22]. Especially TB3-BC, which has a behavior cloning term on top of TD3, is called a minimalist approach to offline RL [17]. It is simple but effective, and that is potentially significant in the industry.

Stable Q-learning
In this paper, considering the industrial application, a safe, simple, effective, and adaptive control method is sought after. As discussed in Section 4.1, an offline RL framework will be adapted to the strip rolling by learning from the real data. In this section, considering the uncertainty estimation, a novel offline RL method (stable Q-learning, SQL) is proposed for this framework to obtain a safe and reliable policy for the studied problem.
In RL, deep ensemble achieved the best performance in addressing uncertainty estimation [20]. In offline RL, ensemble Q-functions can be used to address the overestimation of Q-values. Each Q-function is updated separately, due to the random initialization, and pessimistic Q-value estimates can be obtained from the different outputs of Q-functions [33]. In this paper, the purpose aims to improve the generalization of neural networks for estimating stable Q-values. N Q-functions will be updated separately using a minibatch, and the objective of policy evaluation in the proposed method is shown as follows: where θ i and θ i denote the parameters of the i th Q-network and its target network. Since TD3 was designed to have two Q-functions and two target Q-functions, minimum target Q-values are chosen for policy evaluation for all Q-functions. This paper will explore addressing the uncertainty estimation by adding more Q-functions and corresponding target Q-functions. Because the parametric Q- functions are approximated by neural networks, they can be fundamentally biased, decreasing the generalization capability of the network.
Compared to the policy improvement in TD3, which only uses the output of one Q-function to optimize the policy, by averaging all the Q-values, the policy improvement for SQL is shown as follows: Moreover, a policy regularizer in TD3-BC [17] is added to SQL, pushing the policy towards favoring actions contained in the dataset D. The new policy improvement step is: where ( π k (s) − a ) 2 is a behavior cloning term, given the dataset of M transitions of (s i ,a i ),α is a hyperparameter, and a normalization term based on the average absolute Q-values is used to normalize the Q-values. Algorithm 1 shows the pseudocode of SQL.

Algorithm 1
Initialize N Q-function Qθ i , a policy πϕ, create target networks Q θi ,πϕ, offline RL dataset D generated by behavior policy πβ

Experiment setting
Due to the flatness adjustment used in the production line [34], the flatness values at different reference points at the same time step have a similar trend. The performance of flatness at each time step can be estimated by averaging the values at different reference points. The reward function is designed as follows: where s i is the i th state, and four reference points are studied in the studied strip rolling line. Using this reward function, the learned policy will be trained to generate actions that can produce strips with flatness values close to 0. MDP is a discrete-time stochastic control process, which can be defined by a tuple (S, A, R, P, γ). In the flatness control case, S (state) is the flatness value of strips, which needs to be optimized. A (action) is the process parameters (rolling force, bending force of the work roll and intermediate roll, and roll gap tilting), which are used to control the rolling mill, ensuring the mills produce strips with desired flatness. R (reward function) is used to compute the indicator of the strip quality. P (transitions) represents the model of the process, given action, the state can be obtained using P. γ is a discount for computing discounted return in RL, it determines the present value of future rewards [8].
Although the bias problem can be reduced by adopting offline RL, it is risky to deploy the policy to the real system for evaluation. In this paper, a black-box data-driven approach is developed to model the real process. A probabilistic neural network (PNN) [35] is trained, receiving the chosen action as input, and outputting the next state as output. To address uncertainty, the output is a parameterized Gaussian distribution, which is conditioned on s and a is shown as follows: The mean value and the standard deviation are parameterized by θ. As suggested by [36], the model is learned to predict the difference to the current state, Δs t+1 = f(s t , a t ).
In this paper, the proposed method (SQL) will be compared to the SOTA methods (BC, CQL, IQL, TD3-BC). First, key hyperparameters of the proposed SQL method are studied to find the optimal setting, including the ensemble number N of Q-functions, and α in the policy improvement step. Then, SQL will be compared with SOTA. Finally, an experiment of the proposed method will be analyzed and compared with the existing method used in the factory.

Analysis of SQL
This section investigates the parameter sensitivity of the proposed method (SQL). The number of Q-functions and α in policy improvement were studied separately by grid search. Fig. 3(a) presents the results of the analysis of the number of Q-functions, where values of {2, 4, 6, 8, 10} were tested, and α was fixed at 2.5, as recommended in [17]. The performance of the models exhibits a similar trend as the number of Q-functions was increased from 2 to 10, with a decrease noted around the 20th timestep. Nonetheless, an increase in reward was observed when the number of Q-functions was raised from 2 (blue line) to 4 (orange line) during the first 20 timesteps. A temporary dip in reward occurred between the 21st and the 30th timestep, but the reward thereafter increased, yielding the highest final reward among all models. In contrast, as the number of Q-functions was further increased from 4 to 10, an overall decrease was observed. Based on the above findings, the model with 4 Q-functions was deemed to achieve the best performance.
The effect of α on the performance of SQL was analyzed in Fig. 3(b). The models were evaluated using 4 Q-functions, and varying values of α, including {1.5, 2.5, 3.5, 5, 10, 20, 30}. Results show that an increase in the values of α leads to reductions in fluctuations in reward. Particularly, when α was set to 20, a significant reduction in fluctuation was observed. The model with α = 20 also achieved the highest final reward of − 60. The key hyperparameters of the proposed method that achieved the optimal performance are shown in Table 2.

Comparison of the SOTA methods
In this section, the proposed method (SQL) was compared with SOTA methods (BC, CQL, TB3-BC, IQL) in terms of reward, the comparison results are shown in Fig. 4. The reward was designed to assess the overall flatness. First, the absolute values of flatness at different points (state 1 to 4) were computed. Then, the mean value of the absolute values was obtained. Finally, the inverse of the mean value was used as the reward. For the studied case, the optimal flatness is 0, and thus, the flatness is expected to be as close to 0 as possible, leading to high reward values. In RL, the reward is used as an indicator in training, and the high-quality strips will have high reward values.
Although BC has an uptrend from the 10th to the 20th step, IQL has an uptrend from the 35th to the 50th step, they have an overall downtrend during the training process, suggesting that they are not effective in addressing the problem. CQL exhibits a steady increase in reward from − 92 to − 78 with a decreasing shadow area. Compared to CQL, TD3-BC has a higher final reward of − 75, but it has a huge fluctuation during the training process. The reward decreased from the 10th to the 20th step and then increased, before the 40th step, the large shadow area indicates the performance is unsteady. Notably, the shaded area of TD3-BC decreased significantly from the 40th step, meaning that the standard deviation decreased when using different randoms to repeat the training. For SOTA methods, CQL has an overall steady performance, but the final reward is lower than TD3-BC. TD3-BC has a huge fluctuation during the training process, but the final reward is higher, and the shadow area is smaller at the end of training.
The results presented in Fig. 4 demonstrate the superior performance of SQL compared to SOTA methods. Firstly, SQL successfully mitigated the decrease in reward observed in TD3-BC from the 10th to the 20th step, and reduced the uncertainty, as evidenced by the smaller shadow area. Secondly, from the 1st to the 20th step, SQL exhibits a comparable trend in reward to CQL with a higher reward throughout the process, resulting in a final reward of − 62. In contrast, CQL has a lower uncertainty, as indicated by its smaller shadow area, but the final reward of − 78 is significantly lower than that of SQL (− 62). The above findings support the conclusion that SQL is the optimal control method for the studied case.

Experiment analysis
In this section, simulation results of the existing method (PI) commonly used in the factory and the proposed SQL method were analyzed. For the states, either positive or negative values are acceptable, and the optimal value is 0. In other words, the optimization goal is to produce high-quality strips with flatness values as close to 0 as possible. The results are shown in Fig. 5. For state 1, 80 % (from the 1st to 800th step) of the flatness values are negative, SQL demonstrated a remarkable ability to improve the flatness by pushing the values closer to 0. Similar performance was observed in state 4. In state 2 and state 3, over 90 % of the flatness values are positive, and SQL successfully reduced the values, demonstrating its ability to improve strip quality. Results in Fig. 5 suggest that SQL can always push the values towards 0, regardless of their initial values being negative or positive. The above findings indicate that SQL is a promising tool for addressing the studied problem.
Another evaluation of the performance of the proposed method (SQL) and the existing method (PI) was performed by computing the average and standard deviation (STD) of absolute flatness values in each state. As per the established quality standards, highquality steel strips should have flatness values close to 0. Fig. 6 shows the comparison of the average flatness of the two methods in four states. Results indicate that the average flatness values generated by SQL in three states are lower than those produced by PI. The reductions in average flatness are 8.48 %, 5.48 %, and 5.13 % for state 1, 2, and 3, respectively. Even though the flatness of state 4 produced by SQL increased by 0.27 %, it is still lower than that of state 3 (1.331). The balance between the overall and individual flatness was obtained. Fig. 7 shows the STD of the flatness in four states. The results indicate that the STD of the flatness values produced by SQL in all four states is lower than that of PI, with reductions of 7.69 %, 6.55 %, 10.53 %, and 2.66 % for states 1 to 4, respectively. A lower STD indicates improved steadiness in the performance of the strips, leading to smoother steel strips. In conclusion, the results of the evaluation of the average and STD of the flatness values demonstrate that the proposed method (SQL) produces steel strips with flatness values closer to 0 and improved steadiness compared to the existing method (PI). The purpose of this paper is to propose a novel and adaptive method for the changeable manufacturing condition. The proposed method and findings are important in safety-critical applications and are expected to improve the quality standards and increase the exporting power of the factory.

Conclusion and future work
This paper studied a novel offline RL method for flatness control in the strip rolling industry. Using the pre-collected dataset for training rather than an approximated environment, the proposed method reduced the bias of the environment. Q-ensemble was adopted to address the problem of uncertainty estimates of Q-values. A behavior cloning term was added to the policy improvement step to push the policy towards favoring actions contained in the dataset. From the comparison, the reward of the proposed method increased without sharp drops and has the highest final value, it can be concluded that it can learn a policy more stably and effectively than the SOTA methods. From the experiment, the results showed that the proposed method can generate strips with flatness with more values close to 0 than the traditional control method, the lower STD means the performance is more stable.
Next steps of our work will focus on the reality gap between the experiment and the real process. Although the simulator for training is avoided, the offline RL used in this paper highly relies on real data, which means that the quality of the data can affect the reliability of the policy. Before being deployed to the system, two plans are available to improve the policy. Firstly, to collect more data from the real system, and to make the policy fully explored to find optimal solutions to the control problem. Secondly, an additional step (e.g., transfer learning) can be adopted to optimize the policy.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
The data that has been used is confidential.