Maximum Power Point Tracking of Photovoltaic System Based on Reinforcement Learning

The maximum power point tracking (MPPT) technique is often used in photovoltaic (PV) systems to extract the maximum power in various environmental conditions. The perturbation and observation (P&O) method is one of the most well-known MPPT methods; however, it may face problems of large oscillations around maximum power point (MPP) or low-tracking efficiency. In this paper, two reinforcement learning-based maximum power point tracking (RL MPPT) methods are proposed by the use of the Q-learning algorithm. One constructs the Q-table and the other adopts the Q-network. These two proposed methods do not require the information of an actual PV module in advance and can track the MPP through offline training in two phases, the learning phase and the tracking phase. From the experimental results, both the reinforcement learning-based Q-table maximum power point tracking (RL-QT MPPT) and the reinforcement learning-based Q-network maximum power point tracking (RL-QN MPPT) methods have smaller ripples and faster tracking speeds when compared with the P&O method. In addition, for these two proposed methods, the RL-QT MPPT method performs with smaller oscillation and the RL-QN MPPT method achieves higher average power.


Introduction
Sustainable energy such as solar energy is often seen as one of the solutions to reduce pollution caused by thermal power generation. A photovoltaic (PV) module is able to convert solar energy into electrical energy without generating greenhouse gases and coal dust, and it is wildly used since the deployment is relatively easy compared to other sustainable energy sources such as tidal energy and biogas energy. However, low efficiency is the main drawback of a PV system. Therefore, several maximum power point tracking (MPPT) methods are proposed in order to extract maximum power from the PV module. The perturbation and observation (P&O) method is one of the most common MPPT methods and can be implemented model-free [1,2]. However, a large size perturbation setting will lead to large oscillations near the maximum power point (MPP), while a small step size perturbation setting will slow down the tracking speed. Therefore, several adaptive methods are proposed to improve the P&O method [3][4][5]. The adaptive P&O method basically modifies the step size based on the amount of the power difference between two perturbations. However, to achieve the best performance, the ratio between the step size and power difference needs to be tuned according to the actual model. Additionally, several fuzzy P&O methods are proposed to perform the MPPT [6,7]. Although fuzzy logic control is a model-free control method, expert knowledge is required when designing the fuzzy parameters.

Model Description of PV Module
When a PV module is exposed to the sunlight, the electrons inside it will absorb the energy and jump to a higher energy state. Some free mobile electrons will be released through a connected wire and thus form a current. This phenomenon is known as the photovoltaic effect [14].
A solar cell is a device to generate electrical power based on the photovoltaic effect. Figure 1 shows the single-diode model [15] of a solar cell, where I ph is the photo-generated current, I D S is the current through the diode D s , V D S is the voltage across D s . I sh is the current through shunt resistor R sh , and I is the current through series resistor R s . In addition, I and V are the output current and output voltage of the solar cell, respectively. According to Kirchhoff's current law, the output current can be expressed as The current I D S can be expressed by the Shockley equation below: ηKT − 1 = I 0 e q(V+R S I) ηKT − 1 (2) where I 0 represents the reverse saturation current, q is the elementary charge, K is Boltzmann's constant, T is the temperature and η is the diode ideality factor. Note that η = 1 ∼ 2 and η = 1 for an ideal diode [16]. Substituting (2) into (1) yields the I-V characteristic expression of a solar cell model shown as The photo-generated current I ph is affected by solar irradiance S and environment temperature T as below: where I scr is the short circuit current, K i is the temperature coefficient of the short circuit current, and T r is the reference temperature. A PV module usually consists of several solar cells in series or in parallel, i.e., composed of M rows and N columns of solar cells as depicted in Figure 2. The output voltage V sm of M solar cells in series and the output current I sm of N solar cells in parallel are respectively given as Substituting (5) and (6) into (3) yields: which is the I-V expression of an M × N PV module. As described in (7), the I-V curve of a PV module is strongly affected by environmental factors, including the variance of the solar irradiance and the module temperature. Figure 3a shows the I-V curves with different temperatures under the same irradiance condition. As the temperature rises, the curve moves left and vice versa. In addition, the plot of I-V curves under several distinct irradiance conditions with a constant temperature is presented in Figure 3b. It is obvious that the escalation of solar irradiance will cause a raise in the I-V curve, correspondingly. From Figure 3a,b, the maximum power point changes as the irradiance and module temperature vary, thus the variance of irradiance and module temperature will be considered in the proposed MPPT.

Maximum Power Point Tracking (MPPT) Control
As previously mentioned, the input impedance of the converter can be tuned by varying the duty ratio. If a PV module's output impedance is matched with the converter's input impedance, the maximum power can be extracted. As illustrated in Figure 4, the operating point of a PV module is the intersection of the I-V curve and the load line. The purpose of maximum power point tracking (MPPT) is to reach the operating point where the PV module has maximum power output. In this paper, the MPPT is implemented by adjusting the duty ratio of the converter. The P&O method will be discussed in Section 3.1, and the RL MPPT methods will be proposed in Section 3.2.  Figure 5 briefly illustrates the concept of the perturbation and observation method (P&O method), which is one of the well-known tracking methods. At time t, a fixed-sized perturbation is performed according to the measured power difference ∆P t−1 and voltage difference ∆V t−1 . Then the change of the power and the voltage will be observed. With a new power difference ∆P t and a new voltage difference ∆V t , the system can be perturbed accordingly. To drive the operating point toward the MPP, the system first measures the present power P(n) and voltage V(n) and then calculates the difference of power and the difference of voltage. If ∆P > 0, there are two cases, operating points A and B, as depicted in Figure 6a. To move toward the MPP, the duty ratio should be decreased for point A with ∆V > 0, and increased for point B with ∆V < 0. On the other hand, if ∆P < 0, there are two cases, operating points C and D, as depicted in Figure 6b. To move toward the MPP, the duty ratio should be decreased for point C with ∆V < 0, and increased for point D with ∆V > 0.

Design of the Reinforcement Learning MPPT System
Sequential decision-making is a common problem in real life, for example, an infant trying to walk by stretching a leg forward and move his body. By taking a series of actions, the infant will have a chance to reach his goal (keep moving forward). The Markov decision process (MDP) provides a systematic framework for describing this sequential decision-making problem. To solve an MDP, a reinforcement learning (RL) method is proposed by [8]. Learning by interacting with the environment is a human's intuitive learning skill. When facing an unknown system, the human will interact with it according to their own understanding of the system, and then receive feedback. This feedback signal provides a criterion for judging how "good" or "bad" an action is under a specific circumstance. The actions with better outcomes will have a larger chance to be chosen, i.e., they are reinforced, while the actions with worse outcomes will have smaller chance of being done in the future.
With the description of RL provided above, the concept of the variable step size tracking method based on RL is shown in Figure 7. The perturbation size can be chosen according to the observed system state and the previous experiences. Once the perturbation is done, the system change including the change of the state and the feedback signal will be observed and the interaction experience will be learned. To provide a detailed description of the RL MPPT, the background knowledge of RL will be described from Section 3.2.1 to Section 3.2.5, including the description of the system, the interaction, and the evaluation of the interaction experience. Finally, the design of the RL MPPT methods, including the RL-QT MPPT method and the RL-QN MPPT method will be proposed in Section 3.3.

Markov Decision Process and Reinforcement Learning Problem
A Markov decision process (MDP) [17,18] consists of S, A, T, and R, where S represents the set of environment's state description, and A is the set of available actions the agent can take. T is the transition function, which indicates the system's probability distribution of jumping from any state to all possible states after applying all available actions, and it is denoted as T : S × A × S → [0, 1] . For example, the probability of jumping from the state s ∈ S to state s ∈ S after applying the action α ∈ A can be written as T(s, a, s ). For an MDP, the Markov properties hold. Consider a discrete-time series t = 1, 2, . . ., the next state s t+1 only depends on the current state s t and the current action a t , as shown below: P(s t+1 |s t , a t , s t−1 , a t−1 , . . . ) = P(s t+1 |s t , a t ) = T(s t , a t , s t+1 ). (8) A reward R is a scalar numerical signal received from doing an action at a state, and the reward function R can be formally represented as R : In RL, the one who actively takes actions that affect the environment is called an agent, and the environment is an object that reacts passively to the agent. The interaction between the agent and the environment can be described under the MDP framework. The agent-environment interface is shown in Figure 8. The agent and the environment interact discretely in a time series t = 0, 1, 2, 3. For each time step t, the description of the environment's current condition s t is obtained by the agent. The agent will decide which action a t should be taken based on some rules. Then a numerical reward signal r t+1 and a new state s t+1 are brought out by the environment.   The rule followed by an agent is called a policy. Formally, an agent's policy π is the mapping of the current state to the selection probability of the available actions. Hence, the probability of selecting a t when the agent is at the state s t can be denoted as π t (a t |s t ). For example, under a policy π, the probability of selecting a 1 at state s can be written as π(a 1 |s), as shown in Figure 10. With the experience (such as the tuple (s t , a t , r t+1 , s t+1 )) gathered, the policy can be modified. The goal of reinforcement learning is to obtain a policy, such that the received cumulative rewards are maximized.

Long Term Rewards
The expected return G t is the sum of the discounting rewards the agent expected to receive at time t, as below: where γ is the discounting parameter, 0 ≤ γ ≤ 1. The rewards are discounted in order to avoid infinite cumulated rewards. With a smaller γ, the agent focuses on the recent rewards more, while a larger γ will make the agent be farsighted, and thus long term rewards will be considered. Also, the expected return can be written as an iterative form, which indicates that the current return is equal to the sum of the immediate reward r t+1 and the successor state return G t+1 .

Action Value and Optimal Action Value
The action value is the agent's expected return starting from the state s, with the action a chosen and thereafter following the policy π, as described in (10): where r is the immediate reward received by doing action a, and γ a π(a |s )q π (s , a ) is the discounted action value starting from all possible next states s .
The backup diagram of the action value is shown in Figure 11. The top white circle and black dot are the state s and the action a being chosen under the state s, respectively. r is the reward received consequently. The white circles in the middle are the successor states s the agent may obtain after doing action a. After that, the agent does actions according to the policy π, and the black dots in the bottom are the possible actions a being chosen. To an action value function, a policy π is better than the other policy π , or π > π , if and only if q π (s, a) ≥ q π (s, a), ∀s ∈ S, a ∈ A. There always exists a policy that is better or equal to all the other policies and that is an optimal policy. The optimal action value function is defined as the expected return starting from the state s with the action a chosen, thereafter following the optimal policy, such that the expected returns in the following states are maximized, as shown in (11) q * (s, a) = max π q π (s, a) = s ,r p(s , r|s, a ) r + γ max a q * (s , a ) (11) where, max a q * (s , a ) is the optimal action value function of the next state s with optimal action a chosen. The backup diagram of the optimal action value function is depicted in Figure 12 with the blue background. The arc between different a is a representation of choosing the optimal a such that the return starting from s is maximized. Figure 12. The backup diagram of the optimal action value function.

Introduction to Q-learning
Q-learning [19] is a model free temporal difference (TD) method to perform RL. A Q-table is constructed through bootstrapping to store the optimal action value of any state-action pair. The one-step update of Q-learning is shown as below: where Q(s, a) directly approximates the optimal action value q * and is independent of the policy followed. For the state s , an optimal action a _ is expected to be selected within action set A(s') so that the Q value at s can be maximized, i.e., max . The updating rate of the Q value is α, 0 ≤ α < 1, and Q has been proven to converge to q * with a probability 1 by [19]. The backup diagram of Q-learning is depicted in Figure 13 with a blue background. Through actual interactions, the experiences (s, a, r, s ) can be acquired. According to the interaction experiences, the Q values can be updated by (12) and stored in a tabular form, which is called a Q-table. Once the Q-table is fully constructed, the optimal policy can be extracted by greedily choosing the action with the largest optimal action value for each state, i.e., argmax a∈A(s) Q(s, a). Table 1 is an example of policy extraction from a well-constructed Q-table. The agent will take the action A3 when it is at the state S1 since argmax a∈{A1,A2,A3} Q(S1, a) = A3. Similarly, A1 should be taken at states S2 and S4, and A2 should be taken at state S3.

Q(s,a)
A1 A2 A3 S1 Q(S1,A1) = 0.5 Q(S1,A2) = 1 Q(S1,A3) One of the issues in RL is the exploration-exploitation trade-off [8]. For some RL problems, the policy changes with time, thus off-line learning is not suitable for them. Therefore, for each time step, the system is designed to be ε-greedy, i.e., choosing actions randomly to explore new possible rules in a probability ε and otherwise following the learned policy and always choosing the action with the largest Q value.
The algorithm of Q-learning is shown in Algorithm 1. First, the elements in the Q-table are initialized. The ε-greedy, the learning rate α and the discount factor γ are also initialized. Note that for a non-episodic MDP, there is no ending state, therefore, γ should be less than 1 to avoid infinite expected return.
The agent observes the current state first. With a probability ε, the agent will randomly choose the available actions, otherwise, it will choose the action with the largest Q value according to the Q-table. After applying the action, the environment will generate the reward signal r t+1 , and then a new state s t+1 can be observed. With (s t , a t , r t+1 , s t+1 ) obtained, the element Q(s t , a t ) in the Q-table can be updated using (34). Finally, the current state s t is replaced by the new state s t+1 , and one step of an update is completed.

Algorithm 1: Non-episodic Q-learning algorithm
Initialize Q(s, a), ∀s ∈ S, a ∈ A Initialize ε, α, γ, 0 ≤ ε ≤ 1, 0 ≤ α < 1, 0 < γ < 1 Observe s t Repeat (for each time step t) Apply a t , observe r t+1 and s t+1 Q(s t , a t ) ← Q(s t , a t ) + α r t+1 + γ max An example of the agent-environment model with the Q-table is illustrated in Figure 14. The state representation generated by the environment is required to be discretized to perform table lookup on the Q-table, which is impractical in some cases, such as the state space is too large or the state space is continuous. Therefore, an approximation of the Q-table using a neural network, named Q-network [13], has been implemented in this paper.

Q-Learning with Neural Network Approximation
The Q-table in Figure 14 can be approximated as the Q-network in Figure 15. The state representation can be directly used as the input of the Q-network without discretization. The number of input nodes is the dimension of the state representation. The output of the Q-network Q(s, a; θ) is used to approximate the optimal action value q * , where θ is the weight of the network, i.e.,Q(s, a; θ) ≈ q * (s, a).  The loss function, also known as the cost function [20], of the Q-network in iteration i is defined as: where y i = r + γmax a Q(s , a ; θ i−1 ) is the target Q value based on the current reward r and the maximum Q value at the next step S generated by the Q-network in iteration i − 1, and Q(s, a; θ i ) is the estimated Q value provided by the Q-network in iteration i. For every i, (13) should be minimized in order to approximate the estimate Q value to the target Q value. The target Q-network parameter of the previous iteration, θ i−1 , should hold fixed during the training in iteration i. This may cause the oscillation or divergence of the policy since the target Q value is affected immediately after updating the Q-network for every iteration.
To stabilize the learning process, a fixed target Q-network technique is proposed by [13], and the loss function is written as where θ − is the old parameter several iterations before, and it will update to the current value for every C T iterations, C T > 1. Therefore, the frequent update of the Q-network Q will have less effect on the target Q valueQ, and the training process will be more stable.
To break down the correlation between training samples, the experience replay technique is proposed by [12]. For each time step t, an experience sample is gathered and stored by the agent into a data set E = {e 1 , . . . e t }, where e t = (s t , a t , r t+1 , s t+1 ). For each learning iteration, several samples are taken randomly as a mini-batch s j , a j , r j+1 , s j+1 to perform mini-batch gradient descent [21], and the loss function can be rewritten as Finally, a Q learning algorithm using a Q-network as the function approximation is shown in Algorithm 2. The parameter of Q andQ, θ and θ − , are initialized. The target Q parameter update period C T , the experience data set E, the ε-greedy policy, the learning rate α, the discount factor γ and the experience replay threshold pth are also initialized. The agent then observed the current state s t . With a probability ε, the agent will do the action randomly, otherwise, it will choose the action with the largest Q value according to the output of the Q-network, i.e., a t = argmax a_ Q(s t , a_; θ). After applying an action, the agent observed the reward and the representation of the next state, and (s t , a t , r t+1 , s t+1 ) will be stored into E. After cumulating enough experiences, an experience replay can be performed by sampling a mini-batch of experiences s j , a j , r j+1 , s j+1 from E randomly. The mini-batch target Q values y j are calculated by theQ instead of Q because of the performing of the fixed target Q-network technique, and then a mini-batch gradient descent [21] is applied to minimize the loss function shown in (37). The target Q-network parameter θ − will be updated for every C T iterations. Finally, the state is updated, and a new iteration will begin. Algorithm 2: Q-learning using Q-network approximation with fixed target Q-networks and experience replay Apply a t , observe r t+1 and s t+1 Store (s t , a t , r t+1 , s t+1 ) in E If t > pth Randomly Sample mini-batch s j , a j , r j+1 , s j+1 from E Calculate loss function as (15) Perform mini-batch gradient descent to optimize the loss function Every C T step updateQ ← Q s t+1 ← s t

Design of an RL MPPT System
To perform MPPT based on RL, the system must be able to be described by MDP. The element needed in the RL MPPT system is defined in Table 2. The PV module and the converter can be seen as the environment, and the controller is the agent as depicted in Figure 16. The goal of the agent is to reach the MPP through interacting with the environment. Action ∆D Reward ∆P = P − P Figure 16. Agent and environment defined in the RL MPPT system.
The system's condition, the state, is described by the solar irradiance, the module temperature and the duty ratio D since the I-V curve is affected by the solar irradiance and the module temperature as mentioned previously. The system's operating point is at the intersection of the I-V curve and the load line, and the load line is controlled by the duty ratio of the converter.
In this study, the action is defined as a set of duty ratio changing step ∆D with different step sizes. Therefore, the tracking progress can be seen as a sequential decision-making problem, i.e., the MPP of the system can be reached by applying a series of variable step sizes ∆D appropriately.
The reward is a numerical signal that helps the agent judge how "good" or "bad" an action is. The action that moves the operating point close to the MPP is better than the action that moves the operating point away from the MPP. Therefore, the power difference ∆P = P − P is defined as the reward since it provides not only the moving direction of the operating point but also the numerical scaling representation of the effect caused by applying the action, for example, a larger step size may lead to a larger power difference.
Also, the Markovian property holds since the current state is only affected by the state and the action taken one step before. Through combining the elements of the MDP model designed in Table 2 and the concept of the RL MPPT shown in Figure 7, the detail of the RL MPPT can be described as Figure 17. The perturbation step size is chosen according to the Q value of the current irradiance, temperature, and duty ratio D. After applying the change of D, the power difference ∆P and the new state description s can be observed. In the Q-table approach, the experience will be used to update the Q value immediately, but in the Q-network approach, it will be stored to perform experience replay. With the simple workflow of the RL MPPT described as above, the flowcharts of the RL MPPT using Q-table (RL-QT MPPT) and the RL MPPT using Q-network (RL-QN MPPT) are shown in Figures 18 and 19. The flowcharts essentially follow the Q-learning algorithms provided in Algorithm 1 and Algorithm 2, respectively.  The proposed RL MPPT methods in this paper include two phases, the learning phase and tracking phase. In the learning phase, the experiences are learned, i.e., the Q-table or the Q-network is updated. However, to speed up the tracking speed, the update process is skipped in the tracking phase. The description of the learning phase and the tracking phase of the RL-QT MPPT and RL-QN MPPT are shown below: (a) Learning Phase of RL-QT MPPT First, the Q-table, the discount factor γ, the learning rate α, the ε-greedy policy, and the duty ratio D are initialized, and the solar irradiance and the temperature of the PV module are sensed to form the state representation (irradiance, temperature, D). The output power of the PV module is sensed and stored as P. With a probability of ε, the agent will randomly choose an action and change the perturbation step size. Otherwise it will follow the Q-table and choose the action with the largest Q value. After applying the new duty ratio, the irradiance, the temperature and the output power P are sensed again. Therefore, the succeeding state s is obtained, and the reward r can be calculated as P − P. Finally, with (s, a, r, s ), the corresponding element in the Q-table, i.e., Q(s, a), can be updated by (34).

(b) Tracking Phase of RL-QT MPPT
In the tracking phase, the agent selects the action by looking up the Q-table, and D is changed accordingly. Then the iteration ends without updating the Q-table.
(c) Learning Phase of RL-QN MPPT The RL-QN MPPT is similar to the RL-QT MPPT. First, the fixed target Q-network update iteration C T , the experience dataset E, the network parameters θ and θ − , the experience replay threshold pth, the discount factor γ, the learning rate α, the ε-greedy policy, and the duty ratio D are initialized, and the iteration counter is also initialized. Then the state (irradiance, temperature, D) is obtained by acquiring the solar irradiance data and the module temperature data. The output power of the PV module is calculated as P = VI. Under the ε-greedy policy, the action will be chosen randomly in a probability ε, otherwise the action with the largest Q value will be taken. The subsequent state representation s and the reward r can be obtained after changing the duty ratio of the converter. The experience (s, a, r, s ) is stored in E, and if the experiences in E are enough, the experience replay techniques will be performed. For every C T step, the fixed target Q-network weight θ − will be updated, and finally, it will be increased to move onto the next iteration.

(d) Tracking Phase of RL-QN MPPT
In the tracking phase, the agent follows the policy approximated by the Q-network, i.e., always choose the action with the largest Q value. Then D is modified by ∆D. The system's operating point is changed, but the experience is not stored to speed up tracking. Then a new iteration will begin.

System Configuration
The environment configuration of both simulation and experiment is shown in Table 3, including the description of the PV module and the parameter of the DC-DC boost converter. The agent configuration is shown in Table 4, which is identical for both simulation and experiment. The corresponding hardware structure is depicted in Figure 20. For the environment described in Table 3, the configuration of the PV module, the DC-DC boost converter and the resistive load are identical for both simulation and experiment, while two large resistor R 1 and R 2 are added into the actual circuit as a voltage divider to measure the voltage, as shown in Figure 20.   As for the configuration of the agent, the range of the duty ratio is set between 0.2 and 0.9 due to the limit of the hardware. The P&O method and the RL MPPT methods are performed one time per second for both simulation and experiment. ∆D of the P&O method is fixed at 0.05, while the RL MPPT methods provide a set of actions with different step sizes, i.e., A = {∆D = 0, ∆D = ±0.01, ∆D = ±0.05, ∆D = ±0.1}. The state, reward, ε-greedy, and the discount factor γ are the same in the RL-QT MPPT and RL-QN MPPT. However, the discretization of the state is needed to perform in the RL-QT MPPT method. For the RL-QT MPPT method, the irradiance is discretized into 10 levels between 0 and 1000 W/m 2 , and the temperature is divided into 6 levels, which are below 20 • C, 20~30 • C, 30~40 • C, 40~50 • C, 50~60 • C, and beyond 60 • C, respectively. The level of D is discretized by 0.01, the minimum value of non-zero ∆D, and is limited between 0.2~0.9 as mentioned before. Therefore, the size of the Q-table can be obtained by calculating the size of the state space and the action space, which is 4260*7.
A multi-layer perceptron is used as the Q-network in this study. After several attempts and adjustments, a 5-layer structure with 3 hidden layers is applied in both simulation and experiment. The input layer consists of 3 neurons, which are the irradiance, the temperature, and duty ratio input, respectively. Each hidden layer is constructed by 40 neurons, and the output layer has 7 neurons, which represent the Q value of 7 actions,∆D = 0, ∆D = ±0.01, ∆D = ±0.05, ∆D = ±0.1, respectively. The learning rate α is 0.9 in RL-QT MPPT method and 0.0001 in RL-QN MPPT method. C T , E and pth are also initialized for the Q-network method to perform the experience replay and the fixed target Q-network techniques. The total amount of experiences gathered by RL-QN MPPT in the learning phase are 18,646 and 26,838 for the simulation and the experiment, respectively.

Simulation Result
The performance of the P&O, RL-QT MPPT, and RL-QN MPPT methods are simulated by MATLAB and Simulink R2017b with AMD Ryzen Threadripper 1920X processor, 3.50 GHz, 64 GB of DRAM memory and Microsoft Windows 10 operating system. With the irradiance and the temperature signal and the simulation configuration provided in Figure 21, Tables 3 and 4

Experimental Result
The hardware specification for the actual experiment setup are listed in Table 5. An analog to digital converter (ADC) is needed since there is no built-in ADC in Raspberry Pi 3 Model B [22]. The voltage is measured by the ADC, and the analog output of the current sensor is transferred into the digital signal by the ADC as well. An ambient light sensor module MAX44009 GY-49 is used to measure the illuminance, which can be converted to the solar irradiance with a ratio of 0.0079 W/m 2 per lux [23]. To measure the temperature, an infrared thermometer non-contact module MLX90614 GY-906 is used to measure the surface temperature of the PV module. The power MOSFET IRF840 is driven by the photocoupler TLP250. With a high current capability, 1N5408 is an appropriate choice for the diode D1 in the boost converter. Finally, the actual experiment setup is shown in Figure 25.  The experiments of the P&O method, the RL-QT MPPT method, and the RL-QN MPPT method were conducted under similar environmental conditions. The irradiance was about 650 W/m 2 , and the surface temperature of the PV module was about 48 • C. The result is shown in Figures 26-28. Similar to the simulation result, the step sizes are selected appropriately by the RL MPPT methods during the tracking process, which leads to a better performance comparing to the P&O method.    Table 6 shows the comparison of the experiment results of the three solar MPPT tracking methods. Both the RL-QT MPPT and the RL-QN MPPT are capable of tracking the MPP with fewer tracking steps than the P&O method and remain stable near the MPP. The average power around MPP of the two RL MPPT methods are near that of the P&O method, which indicate that the proposed methods are able to track the MPP.

Conclusions
In this paper, two MPPT methods based on model-free reinforcement learning are proposed. The tracking process can be seen as a sequential decision-making problem since the MPP can be achieved through selecting an appropriate perturbation step size for every time step. Therefore, an MDP model is suitable for describing the interaction between the circuit connected to the PV module and the controller which is able to choose ∆D and change the duty ratio D of the circuit. An MDP model consists of four elements, which are state, action, transition, and reward. With the MDP model described, an RL-QT MPPT method is proposed by constructing the Q-table to perform MPPT control. However, the state representation is needed to be discretized for the tabular method, which may cause the loss of MPPT control accuracy. Therefore, a Q-network-based MPPT method is proposed. In the RL-QN MPPT method, the Q-table is approximated by a neural network, so that the discretization of the states are not needed. For both RL-QT MPPT and RL-QN MPPT, the tracking method consists of the learning phase and the tracking phase, which is able to expedite the tracking process since the Q value will not be updated in the tracking phase. A conclusion can be drawn from the simulation and experimental results that the RL MPPT methods are more effective than the traditional P&O method to track the MPP since the proposed RL MPPT methods possess smaller ripples and faster tracking speeds. Furthermore, for the proposed RL MPPT methods, the RL-QT MPPT method performs with smaller oscillations and the RL-QN MPPT method achieves a higher average power.