An Online Control Approach for Forging Machine Using Reinforcement Learning and Taboo Search

It is noticed that offline-training and online-implementation method is dominant in the data-driven control. However, the inconsistence existing in offline data and online data may degrade the control performance. To address the aforementioned issue, an online control strategy is developed so that the control parameters can be updated online based on the real-time data measured to ensure satisfactory control performance in this study. Specifically, an online control algorithm is addressed to control the pressing-down speed of the forging machine based on the framework of the reinforcement learning that has a capability of building a complete mapping from state space to action space only according to the neighbour samples. Rather than using the way of trials and errors which is too slow to be online implementation, a taboo search is addressed to speed up the learning-working process by directly searching the control on the current states, followed by the stability conditions, derived from Lyapunov stability theory. A coarse model that is limited to get the cost information of the reinforcement learning is used to make the best of mechanism information, which prevents the occurrence of the invalid states that do not conform to system characteristics. The effectiveness of the algorithm is demonstrated by an ultra-low forging machine, which outperforms the conventional approaches such as PID and neural network control approaches. The proposed algorithm has advantages in parameter adjustments so that it is easier to implement in a practical system.


I. INTRODUCTION
Forging machine, as an electro-hydraulic hybrid system with nonlinearity and multi-field coupling, is an essential equipment in forging industry [1]. The control on the forging machine is the guarantee of the quality for the forgings production which is vital for the high reliability areas such as in aviation, space exploration and nuclear industry. To meet the needs of the precise forgings, some advanced algorithms such as the sliding mode control [2], [3], back-stepping control [4], feedback linearization [5] were used in the control on forging machine instead of the conventional PID-based control [6] and fuzzy-based control [7], [8]. However, the aforementioned approaches [2]- [5] are model-based control algorithms, which strongly depend on the accuracy of model. Unfortunately, it is hard to build an accurate model in a complex engineering practice. For example, the viscosity of hydraulic oil is prone to be influenced by the temperature, which will lead to the model bias. On the other hand, The associate editor coordinating the review of this manuscript and approving it for publication was Tao Zhou . the forging machine is usually facing the different forging batches, which further increases the difficulty in producing an accurate model.
Compared with the established model, the collected data will be better to reflect the real states which are interacted with the system and the surroundings. Therefore, the data-driven approaches [10], [11] based on the fact that advanced measurement techniques [9] have made it easy to obtain the large-scale data online have been introduced to the forging machine field in recent years. Reference [12] developed two online updated backpropagation (BP) neural network algorithms to accurately control the die forging hydraulic press machine. The weights of the neural networks were initially trained offline and then updated online according to an error backpropagation algorithm. A novel least squares support vector machine (LS-SVM) control method was addressed in [13] for general unknown nonlinear systems, which was further proved that the control error was fully equal to the LS-SVM modeling error. In [14] a novel online probabilistic extreme learning machine (ELM) method was proposed to model batch forging processes. By using the characteristics of the online ELM, a strategy was developed to update the distribution model as new forging process data were collected. In [15] a combination of the neural network and genetic algorithms had been employed to optimize the forging force.
These data-driven approaches are always working on a way of offline-learning and online-working. The offline-learning forms an implication relation according to the historical data. After this implication is obtained by learning with the ways of supervisory or un-supervisory it will be used for the online-working as a black box model. Either supervisory learning or un-supervisory learning requires a large volume of data as the training dataset, however, it is difficult to get them as the forging machine is often to deal with different forging batches. Firstly, for a new forging process, the training data are empty due to the lack of historical process, while for some special forging processes, the training data are not available due to the differences of the experimental conditions or tests. Secondly, it is inevitable that the working situation is not consistent with the training condition which leads to the performance degradation, even mistake of the forging machine under the function of the previous welltrained controller. As a result, it is a challenge to develop a control strategy for forging machine without depending on the accurate model and the way of offline-learning and online-working in traditional data-driven approach.
Using a way of online-learning and online-working is a feasible solution for the forging machine control because the forging machine is always working at a slow process due to the machine's large mechanical inertia and slow hydraulic activity. Compared with this slowness, the computer shows an amazing computing ability which makes the way of online-learning online-working become possible. All the methods concerning the accurate model and an amount of historic data are forced to be abandoned due to the aforementioned limitations of the forging machine.
To our best knowledge, the reinforcement learning (RL) [16] is able to support the offline learning (Q algorithm) and online learning (e.g., Sarsa algorithm) by the means of approaching to the stage reward with adjusting the action based on the difference of the adjacent sampling time series as an error rectification. The RL does not need an accurate model and it just needs an effect from the action which reduces the requirement of the precision for traditional model. Now the RL has been extended to the deep reinforcement learning (DRL) with the development of deep learning technique. Reference [17] developed a novel artificial agent, termed a deep Q-network, that can learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning. Reference [18] presented a brief survey on the advances that have occurred in the area of deep learning. From engineering application aspect, the RL/DRL showed an excellent performance after a good training in UAV [19], air-conditioning refrigeration [20], smart power control [21], fault tolerant control [22], [23] and so forth [24], [25].
The difficulty of RL in applying to the practical system is its slow training speed whether offline nor online. The RL aims to build a complete mapping between the state space and the action space by training with trial and error in order to deal with the unknown environment. The training is divided into the value-based method and the policy-based method. Compared with the value-based method, the policy-based method is dominated due to its simplicity and intuition. Most algorithms for policy optimization can be classified into three broad categories [26]: (1) policy iteration methods, which alternate between estimating the value function under the current policy and improving the policy [27]; (2) policy gradient methods, which use an estimator of the gradient of the expected return (total reward) obtained from sample trajectories [28]; and (3) derivative-free optimization methods, such as the cross-entropy method (CEM) and covariance matrix adaptation (CMA), which treat the return as a black box function to be optimized in terms of the policy parameters [29]. Generally, both methods spend a long time to train which is often unbearable for real-time system. Concurrently the trial actions in training process may bring the risk to the system because no one knows the effect of actions on the system in advance.
In fact it is not necessary to spend too much time to build a complete mapping from states space to action space because the succeeding states will follow up the occurred states under the control which is a subset of the complete mapping. Searching a control in this subset will speed up the train owning to removing the large redundant states. The taboo search (TS) proposed by Glover and Laguna [30] is an effective stochastic optimization method [31] which gets rid of the historical data in training of the data-driven methods. The TS has an efficient search capability by avoiding circuitous search with introducing a flexible storage structure and corresponding Taboo criteria. It also escapes the local extremum by extending the local optimization to the global optimization. As a result, the TS algorithm is selected as a substitution for trials and errors.
The above discussions show an evolution of control on the forging machine from the model-based control to the data-driven strategy in which most studies focused on the way of offline-learning and online-working. Motivated by overcoming the difficulties of the inconsistence between the training and the working for forging machine, a novel approach is proposed to implement an online control of the forging machine in this study. By integrating reinforcement learning with taboo search, the RL is taken as the evaluation of the actions, and the taboo search is used to improve the learning efficiency. On the other hand the computer simulation technology provides the way of forecasting the system state without a real action on the system which avoids the danger of system out of order from the training actions. The advantages of proposed approach are summarized as follows: (1) This is an online approach with the combination of the data and model which breaks through the conventional mode of offline learning and online working. The optimal control VOLUME 8, 2020 will be achieved in the common process of the learning and working.
(2) All the control vectors are limited within the range of requirements based on the current states which guarantees the system stability in the learning process.
(3) The learning process is speeding up to meet the realtime requirement by bringing to the taboo search which abandons the redundant states independent of the current states.
The remainder of this article is organized as follows: In section 2 the forging machine model is addressed and the relation between the states and controls is derived under the stability condition. Section 3 descripts the proposed approach including the reinforcement learning, the taboo search, the structure and algorithm. The case studies are illustrated in Section 4, followed by conclusions in section 5.

A. THE MODEL OF FORGING MACHINE
The ultra-low forging machine with the heavy force and the slow speed is equipment for a semi-solid metallic confectioning constant-speed isothermal forging which is an important forging technique, particularly for light-weight alloy confectioning in the aerospace industry. The typical structure of the ultra-low forging machine is depicted in figure 1. The forging machine is divided into a power sub-system, a sliding block sub-system, a control sub-system and an auxiliary sub-system. A power sub-system consists of an oil resource that forms the high pressure working oil through a constant rate pump with a driven motor and the pipe that delivers the high pressure working oil to the operating mechanisms. The sliding block sub-system is made up with a hydraulic cylinder that produces the high pressing force at a sliding velocity and a huge slide block that directly acts on the forgings. The control sub-system includes all kinds of valves, sensors and control algorithms, in which the switch valves complete the logic function of the process, and the proportional servo valves control the speed of the slide block by adjusting the valve openings. The auxiliary sub-system is used to implement the additional functions except for the pressing process such as push-out, moving and so on.
The pressing-down phase is the key process in the semi-solid metallic confectioning constant-speed isothermal forging process which usually includes six phases: fast-down phase, slow-down phase, pressing-down phase, keep-pressure phase, fast-up phase and slow-up phase. This pressing-down phase is made up with a long pipe-line with working oil, a proportional servo valve, and a hydraulic cylinder.
For a long pipe-line with working oil, the dynamic process can be described by [5]: where q 1 is the oil flow of pipe; p 1 is the inlet pressure of proportional servo valve; q 2 is the flow of proportional servo valve, and the other parameters are defined in Table 1. For a proportional servo valve, the dynamics can be described by [5]: where the symbols in (3) is shown in Table 1. For a hydraulic cylinder, the dynamic processes can be illustrated by [5]: whereq 2 is the flow velocity, v is the speed of slide block and the other parameters are explained in Table 1.
In terms of equations (1)-(5), the compact state-space model can be given as follows: where The states q 1 is the oil flow of pipe; p 1 is the inlet pressure of proportional servo valve; The meanings of model parameters are table I.

B. THE CONDITION OF STABILITY
The relation between the states and control variables are gotten according to the Lyapunov stability condition. Let where P is a semi definite matrix with the form of with p 22 > 0, p 33 > 0,p 44 > 0, p 55 > 0 and p 66 > 0.
According to the physical meaning of states x i = 0(i = 2, 3, 4, 5, 6), one has If I ≤ 0 and II < 0 there exitsV < 0 which means the system is Lyapunov stability. For I and II Using (6) and (11), one can obtain Formula (13) shows a part expansion of the Lyapunov stability based on the forging machine model. Solving formula VOLUME 8, 2020 (13), one can obtain: As a result, P can be selected as follows: Substituting (15) into (12), one can have Solving (16), one can have As a result, the Lyapunov stability condition is satisfied which means the system is stable subject to formula (17). Remark 1: Formula (17) shows the relationship between the control variable, the states and the load under the stability of system, which is regarded as a restraint condition in taboo search. The states x 2 , x 5 and x 4 in formula (17) are measurable by the flow sensors and pressure sensors, and the parameters are obtained from the design of forging machine.

III. THE PROPOSED APPROACH A. REINFORCEMENT LEARNING
The basic idea of the reinforcement learning is simply to capture the most important aspects of the agent which includes sensation, action, and goal. The basic frame of reinforcement learning is shown in Fig.2 [16].
An agent will get the evaluation of good or bad behavior on environment and learn through experience without a teacher who teaches how to do. In each training session, named episode, the agent explores/exploits the environment by changing action u(k) and receives the states x(k + 1) and the immediate cost R k+1 (x (k + 1) , x (k) , u(k)) based on x (k). The purpose of the training is to enhance the 'brain' of agent. The goal of an agent is to minimize/maximize the immediate cost k+T which is received in the long run. This process is considered as a decision process MDP(X, U, P, R) with a control u and cost R in which X is a set of states, U is a set of controls, P is the transition probabilities P : In order to evaluate the good or bad behavior (often named action or control) the value of a control V u k (x(k)) is defined as where R i (x (i + 1) , x (i) , u(i)) is abbreviated by R i because we do not stress the relation of x (k + 1) , x (k) , and u (k).
The optimal controls will be achieved by carrying an alternation of the policy evaluation and policy improvement using the formulas as follows: where γ is a discount factor with 0 ≤ γ < 1 in order to converge. For a deterministic system, it is evident that: Therefore, the formulas (19) and (20) can be simplified as: Remark 2: The optimal control u(k) can be obtained only using the state information and the immediate cost because there are only x (k), x(k +1) and R k in formulas (22) and (23). A general approach is to adopt iterative method until it is convergent. It is a time-consumption process due to a large number of iterations which form the disadvantage of RL. In fact once x (k) is determined, u(k) will be within a feasible space due to the system limitation. One can directly seek an appropriate control u(k) to maximize the cost function, which can be solved by the technology of the random optimization search. Here we chose the taboo search owing to its high search efficiency as it can avoid the duplicate search in an unknown space.

B. TABOO SEARCH
There are more complex versions of the taboo search which improve its searching capability. Here the basic taboo search algorithm is applied to demonstrate its application in finding the optimal solution. For an element x in the discrete space X , the goal is and the optimal states are solved by neighbor moving continuously where w is the step length, d is the direction. A taboo list whose goal value is updated according to the first input first output (FIFO) rule is designed to prevent the loop search. But the aspiration A(s, x) that records the best solution of history is not limited by the taboo list. The basic taboo search is summarized as procedure 1.

Procedure 1
Step1: Generate an initial x, x ∈ X , then let the optimal x * = x and set a null of the taboo list T = ∅ Step2: Choose a neighbor solution s(x) according to formula (25).
Step4: If C (s(x)) < A(s, x), s(x) ∈ T and C (s(x)) < C(x), let x = s(x) and A (s, x) = C (s(x)) . Step5 Step6: Update taboo list by storing x to the last place of taboo list T.
Step7: Repeat step2 to step6 until one of termination conditions is met, that is, (a) the predetermined times of the moves; or (b) no improvement in the goal with adding the times of the moves

C. THE PROPOSED APPROACH
The structure of the proposed approach is shown in figure 3. Beginning with the states x(k) and x(k + 1) at sample time k and k + 1, the optimal control u * is found by adjusting the u in order to target on the minimization of C(x) according to the RL. Instead of the policy iteration of the gradient method, the taboo search is used to find the optimal action in the action space which is a table in the discrete system.

1) ACTION SAPCE, VALUE FUNCTION AND REWARD
The values of the control variable are limited to the analogto-digital (DA) conversion accuracy. For a n-bit DA converter, the action space is within the range of 2 −n , 2 n .
The forging machine's velocity is determined according to the properties of the forging materials which requires a constant pressing speed during a certain temperature range or a given curve of speed. Therefore, the immediate cost is selected as the absolution value of the error between the actual speed and the reference speed Based on the coarse model (6) and formula (18), the cost functions V k (x, u) and V k+1 (x + 1,u) are prone to obtain,.
Noticed that the coarse model is better to express the tendency than a state expression, the time series error with TD(0) is selected as the immediate cost which is the goal of taboo search

2) NEIGHBORHOOD FUNCTION, TABOO OBJECT, TABOO LIST AND ASPIRATION CTITERION
Formula (25) provides a neighbour search but it will cause the curse with the increase of dimension. The mode of the coding and crossing changing position is usually used to avoid the curse of dimensionality in the taboo search. Let s i = u i where u i 2 −n , 2 n , this mode of the neighbour rule is given in the following: The taboo object is selected as the current control variable u i that is put into the taboo list. If the length l of the taboo list is too long it is prone to trap in the local optimization. If the length l of taboo list is too short it is prone to trap in VOLUME 8, 2020  A(s, x) is selected as the best states of history in order to unlock the process when all the candidates are locked.

3) LONG TERM LIST AND STRICT LIST
The basic TS has an excellent local search ability but a worse global search. A long term list that stores the initial values of each stage is proposed to improve the TS global search ability by generating the initial values as far as the past stages, this is where B is a set of selected initial solutions, and K is a set of initial values randomly generated, K ∈ R.
In order to reduce the search range and speed the search velocity, a strict list is built based on the result of the system stability in section 2.2 u|u · ω 2 n K n If this condition cannot be met in the process of neighbor searching, the u will be abandoned immediately without further work.

4) THE PROCESS OF METHOD
The proposed algorithm is summarized as procedure 2. Procedure 2: Step 1: Give a state x (k) .
Step 2: Select an action u (k) randomly.
Step 6: Compute the time series error C (x) according to Step 7: Search the neighbor based on u (k) and find a new action u (j) according to formula (30) Step 8: If u(j) satisfies a strict list of formula (32), then go to step 7, else repeat step 5 to 6 Step 9: Carry out the taboo search according to procedure 1 Step 10: If it achieves the stage of long term list, then reset u(k) according to formula (31), else go to step 7 Step 11: Repeat steps 7 to 10 until it satisfies the terminate condition and finally gets the optimal u * (k) Step 12: Set the next state x (k + 1) as the current state x (k) and the optimal u * (k) as u (k) Step 13: Repeat steps 3 to 12 until it ends

IV. CASES STUDIES
An ultra-low forging machine is used as the test bed which is controlled by the combination of S7-300 PLC that completes the electric logic control for the process and a trio-MC224 as a special controller that implements the pressing-down phase by the proposed approach. We proposed this special controller as an addition embedded in S7-300 PLC because the PLC cannot complete this complex algorithm due to its limited computation capability. The MC224 and the PLC shared the collected data by a Modbus connection and commutated with the supervisory computer through the Profibus. The structure of test bed is indicted in figure 4. The pressure transmitter is selected as YN-type fog-proof pressure gauge with the accuracy of class 0.1. The flow transmitter is LWGYC-type with the accuracy of class 0.5. The displacement sensor is selected as the MTS production with a minimum resolution of 0.002mm. The proportional servo valve is Rexroth with the responding time less than 10ms. An ultra-low forging machine is working at the slow or ultra-low speed which will spend hours to complete a forging production. In the long pressing process, the forging is keeping the suitable temperature by the mold heating technology as dictated in figure 5. According to the assembly drawing of the ultra-low forging machine, the main oil pipe is almost keeping the same diameter of 0.042m and there are protective measures on the turns in order to reduce the pressure loss of the pipeline, therefore, the actual main pipe is supposed as an ideal long pipeline. The pipe between the proportional servo valve and the hydraulic cylinder is omitted because the proportional servo valve is close to the hydraulic cylinder which leads to little pressure loss. The mass of the slide block, the plunger's size of hydraulic cylinder and the geometric parameters of the oil pipe such as the diameter and the length are obtained from the drawing annotation. The properties of the matter come from the design handbook such as the young's modulus of the oil equal volume and the density of the oil. The parameters of the proportional servo valve are obtained from the chart of the product manual. The other physical parameters are responding to the designed working point. For example, the Ps is guaranteed to the designed 32MPa with adjusting the set value of the relief valve. The friction coefficient is determined according to the criterion of the machine design. The parameters of the coarse model are indicted in table 2. It is noticed that there is an implicit condition of the sampling time being small enough in formula (17) which means the states during two adjacent samplings should change a little enough. In practice, the interval of the adjustment on the ultra-low forging machine should not exceed 5 minutes for ensuring the forging quality. As a result, the ultra-low forging is always suffering a slow change. It is indicated that the practical machine is working in consistence with the assumptions of formula (17) though there is no theoretical proof. The interval of 2 minutes is chosen as the sampling time because this is the minimum time to get a valid control in our computer although the transmitters and actuators have the abilities to speed them up.

A. SCENARIO OF A CONSTANT SPEED
A pressing-down process of the slide block working at an ultra-speed of 0.03mm/s is used to test the proposed approach. In this scenario, only a few oil flow through the servo valve will pump to the upper chamber of the hydraulic cylinder to achieve the ultra-speed of the slide block. It will bring the pressure loss due to the small opening of the servo valve which causes an insufficient pressure acting on the forgings. As a result, the control of the servo valve is a compromise of the pressure loss and the working pressure. The proposed approach is following the procedure 2. However, TS is a random search in essence though it is an efficient searching algorithm. In order to verify the results obtained are reliable, the experiments of the pressing-down process are repeated 7 times. Figures 6 and 7 show the results of the speed and output under control at each experiment with different color curves. In figure 6 the speed is around 0.03mm/s with a little fluctuation and the maximum spikes are 0.0302mm/s (at the first experiment) and 0.0298 mm/s (at the fourth experiment) with the relative errors are 0.7%. It is seen from figure 7 that the different curves are not overlapped with each, showing some differences at each control. However, they all converge around 20.5 with fluctuations, and these  differences between them do not affect to meet the need of set speed.
As aforementioned the control of the servo valve is a compromise of the pressure loss and the working pressure. However it is difficult for the different forging processes to find this compromise due to the influences of resistance and the machine character. A practical approach is to look for the appropriate parameter values by trials and errors during the equipment debugging. All these parameters are recorded as a table and call it when required. For example, the resistance of titanium alloy is always changing with pressing speed, whose relation is following a curve according to the information of related field. Therefore, some typical speeds from the curve will be controlled as the key indicators in the debugging process and the others are determined by interpolation method and improved by fine-tune based on the working conditions. This debugging process will spend a long time (often achieves several months even years) by the conventional PID because there are many scenarios to be tested one by one. The fuzzy based approaches were applied to improve this parameter values, but failed to the requirement of accuracy. With the data increasing it is feasible to introduce the NN as a tool due to its excellent nonlinear fitting function. So for comparison, conventional PID and neural network (NN) are applied in this study.
Here an ultra-speed of 0.03mm/s are taken as an example. The parameters of the PID are adjusted by trials in order to achieve better performance as possible. A three-layer feed-forward backpropagation network with an input layer, a hidden layer and a output layer is chosen as a NN controller, whose input layer include the states (q 1 , p 1 , q 2 , q 2 , p 2 , v), and the output layer is the control variable (O p ). The hidden layer consists of 20 nodes full connect to the input layer and output layer by trials because there is no mature theory to follow. The NN is trained by a classical Levenberg-Marquardt method with random weights initialization. The training database is built based on the selected 4000 data from the fine control by PID in order to make sure of the excellent training database, in which 3500 as training and 500 as testing. After many times of trying to select different weights initialization, the well-trained NN is fine as a controller.
The meanx and the variance σ according to formulas are used to evaluate the performance. The relative error δ between the meanv and the reference v r is according to formula (35) The results are shown in table 3 It is seen from figure 8 and table 3 that all three methods including the traditional PID, the NN and the proposed approach have abilities to achieve the requirement of the speed accuracy (the relative errors <3%). In fact, even after the debugging stage, more parameter values to respond to the practical different cases are being collected in order to deal with the difference between offline-training and onlineimplementation. In the whole process, it is difficult for the PID to adjust the parameters, and the NN highly depends on an excellent training database and weights initialization. In contrast, the proposed approach can realize the automatic control according to the current states. As a result, the proposed algorithm is in a superior position.

B. SCENARIO OF VARIANT SPEEDS
A variant speed with the range from 0.08mm/s to 0.06mm/s via 0.04mm/s is to test the proposed approach with sampling times of 2 minutes. The reference v r follows the following formula according to the craft requirement.
This kind of pressing-down process is seldom in ultra-low forging and there is no effective approach to implement until now. In practice an experimental engineer is required to monitor this process and adjust the PID parameters online to meet the craft curve based on the experiments. First the result of proposed approach is presented. The pressing-down process is repeated 5 times in order to verify the reliability of proposed approach due to the random essence of the TS. The results of the speed and output under control are shown in figure 9 and figure 10. The cyan color, pink color, green color, red color and blue color represent the results from tests 1 to 5 respectively.
It is seen from figure 9 that the curves with different colors have the same tendency which achieves the reference speed under the different constant level and the changing speed period. During the interval from 1min to 30 mins, the maximum speed spikes are 0.0812 mm/s (at the 5 th test) and 0.0788 mm/s (at the 1 st , 3 th , 4 th and 5 th tests) with relative errors of 1.5%. The maximum peak speeds are 0.0406 mm/s (at the 3 rd test) and 0.0394 mm/s (at the 1 st , 2 nd , 3 th and 4 th tests) during the interval between 50 mins and 80 mins, while the speeds vary between 0.0609 mm/s (max) and 0.0394 mm/s (min) during the interval from 80 mins to 100 mins. All the relative errors are less than 1.5%. Figure 10 shows the output under control with different colors  at each test. The blue curve is taken for a further analysis based on the points representing the samples. The variance at the different intervals of 1-30 mins, 50-80 mins, and 90-100 are respectively 231.87, 20.5296, and 38.7686. The variance reduces as the reference speeds are down. The similar case happens on the other curves. One can find the reason from the working principle of the pressing-down process. The pressing-down speed is determined by the load resistance and the upper chamber pressure of the hydraulic cylinder. The upper chamber pressure is the rest of the pressure of the power sub-system taking away the pressure loss of servo valve (the pressure loss of the pipe is omitted because it is far less than that of the servo valve). On the other hand, the slide block is pressing down as a result of the space expansion of the upper chamber with the accumulation of hydraulic oil which can be controlled through the opening of the servo valve. Bigger is the opening of servo valve, less is the pressure loss of the servo valve, and more hydraulic oil will pump into the upper chamber of the hydraulic cylinder. This will VOLUME 8, 2020 widen the tuneable range and lead to a relatively easy control. The means and variance of the speed, and control output are shown in table 4. Then the conventional PID is used under test to control the speed of the pressing-down process. The neural network is abandoned here due to 1) lack of good training database; 2) it is an offline control strategy. The results are shown in figure 8. The red curve, the blue curve, and the green curve are results of the reference speed, the PID control, and the online control approach respectively. It is seen from figure 11 that PID can achieve good control accuracy during the period from15 mins to 60 mins, from 100 mins to 160 mins, and from 180 mins to 200 mins when the speed is stable. The mean, the relative error and the variance at stable speeds are shown in table V. (The data of proposed approach is based on time3.)  Table 5 shows both PID and proposed approach can provide a fine control with the relative error <3%. However figure 11 shows the PID has a worse performance during the transient process because it is difficult to get appropriate PID parameters. In contrast to the flawed PID control, the proposed online control shows a perfect effect throughout the whole process.

C. INFLUENCES OF SAMPLING PERIOD
In this subsection, sampling times are tested to show their effects on the speed under control. The sampling periods are chosen from 2 minutes (the minimun interval time for obtaining the right control) to 5 minutes (the maximum interval time for the forging quality). The reference speed is set as 0.04mm/s. The RL selected a random action at the beginning and then go into the autonomous control according to procedure 2. Figure 12 shows the speed of the pressingdown during different sampling periods. Figure 13 shows the outputs of the controller during different sampling periods. In figure 12, the pink curve, the green curve, the red curve and the blue curve represent the speed of the slide block at the sampling period of 2 minutes, 3 minutes, 4 minutes, and  5 minutes respectively. The stars, the crossings, the triangles, and the squares are the sample points. All four curves can approach to the reference speed (0.04mm/s) after a transient process. The mean and the variance in the stable process are shown in table 5. There are some differences in the transient process. The transient of speed1 (lasts about 18 minutes) is shorter than the others (about 25 minutes for the green curve, about 50 minutes for the red curve, and about 70 minutes for the blue curve). It is the reason that the proposed approach provides a control output during each sampling period and it can adjust the output of control in a shorter time which weakens the accumulative effects of forging machine for a longer period based on the previous moment.

V. CONCLUSION
A data-driven online control strategy has been proposed for the control of the forging machine in order to deal with the difficulties in parameters adjustment of large batch change. This online-learning and online working algorithm has been carried out by reinforcement learning that can get the control only with two consecutive samples and the learning process is based on the computer simulation instead of trials and errors. The mapping space between the state and control has been reduced to a local space by developing the relationship between the states and controls according to the Lyapunov stability theory based on the coarse model, ensuring the system to be stable and preventing the system risk of out of control. The taboo search has been used to overcome the difficulty of the requirement of the historical data, which can find the control directly. Compared with the fine-parameters PID and well-trained NN controller, the proposed approach can well realize the automatic control according to the current states, without the trouble of parameters adjustment that keeps tracing the working condition to get a good performance. The proposed algorithm is thus reliable and convenient in the implementation. The disadvantage is that taboo search would still spend some time to obtain an optimization, therefore the proposed approach can only be applied to the slow physical processes. The next step is to speed up the search to meet the need for general real time control systems.