An Analysis of Double Q-Learning-Based Energy Management Strategies for TEG-Powered IoT Devices

The study presents a self-learning controller for managing the energy in an Internet of Things (IoT) device powered by energy harvested from a thermoelectric generator (TEG). The device’s controller is based on a double $Q$ -learning (DQL) method; the hardware incorporates a TEG energy harvesting subsystem with a dc/dc converter, a load module with a microcontroller, and a LoRaWAN communications interface. The model is controlled according to adaptive measurements and transmission periods. The controller’s reward policy evaluates the level of charge available to the device. The controller applies and evaluates various learning parameters and reduces the learning rate over time. Using four years of historical soil temperature data in an experimental simulation of several controller configurations, the DQL controller demonstrated correct operation, a low learning rate, and high cumulative rewards. The best energy management controller operated with a completed cycle and missed cycle ratio of 98.5%. The novelty of the presented approach is discussed in relation to state-of-the-art methods in adaptive ability, learning processes, and practical applications of the device.

An Analysis of Double Q-Learning-Based Energy Management Strategies for TEG-Powered IoT Devices Michal Prauzek , Senior Member, IEEE, Jaromir Konecny , Member, IEEE, and Tereza Paterova Abstract-The study presents a self-learning controller for managing the energy in an Internet of Things (IoT) device powered by energy harvested from a thermoelectric generator (TEG).The device's controller is based on a double Q-learning (DQL) method; the hardware incorporates a TEG energy harvesting subsystem with a dc/dc converter, a load module with a microcontroller, and a LoRaWAN communications interface.The model is controlled according to adaptive measurements and transmission periods.The controller's reward policy evaluates the level of charge available to the device.The controller applies and evaluates various learning parameters and reduces the learning rate over time.Using four years of historical soil temperature data in an experimental simulation of several controller configurations, the DQL controller demonstrated correct operation, a low learning rate, and high cumulative rewards.The best energy management controller operated with a completed cycle and missed cycle ratio of 98.5%.The novelty of the presented approach is discussed in relation to state-of-the-art methods in adaptive ability, learning processes, and practical applications of the device.Index Terms-Energy harvesting, energy management, Internet of Things (IoT), reinforcement learning, thermoelectric generator (TEG).

I. INTRODUCTION
T HE APPLICATION of machine learning (ML) methods in combination with embedded Internet of Things (IoT) devices remains a challenging task due to the limited computational resources, low-power demands, and self-operating requirements of these devices.This study is an extended version of a pilot study [1] presented in the 2022 IEEE Symposium Series on Computational Intelligence and delivers a more detailed and complex analysis of Q-learning (QL) The authors are with the Department of Cybernetics and Biomedical Engineering, VSB-Technical University of Ostrava, 70800 Ostrava-Poruba, Czechia (e-mail: michal.prauzek@vsb.cz;jaromir.konecny@vsb.cz;tereza.paterova@vsb.cz).
Digital Object Identifier 10.1109/JIOT.2023.3283599Fig. 1.Energy management principle applied by the ML controller in the energy harvesting TEG-powered IoT device.
performance, an improved reward policy, and a double QL (DQL) policy tested over four years.
The study investigated methods of powering IoT platforms with thermometric generators [thermoelectric generator (TEG)] [2] according to the scheme depicted in Fig. 1.The amount of energy harvested by a TEG is a dynamic parameter which depends on temperature in the surrounding environment [3].An energy management system which specifies how an IoT device should behave at certain times should, therefore, be applied according to the energy which is available to the device [4].The controller described in the study applied a DQL-based strategy to manage the duty cycle in a TEGpowered IoT device.The device itself was designed to monitor environmental parameters and transmit collected data via a wireless communications interface for storage in a cloud and subsequent advanced data processing.
The algorithm used by the controller applied real-time self-learning principles.The energy harvesting IoT sensor, therefore, did not require any energy system or hardware customization or modification (capacitor size, energy harvester type, replacement of aging hardware, etc.) to suit the device's application or deployment location.Self-learning also solved the many disadvantages of state-of-the-art methods, for example, the need for historical data to train a neural network, time consuming, and computationally extensive processes to optimize a fuzzy-based controller, or the need to predict ambient energy in a prediction-based controller.The proposed DQL implementation in a low-cost embedded IoT platform is very effective and has low computational and memory requirements in combination with the reinforcement learning algorithm.This feature is designed for very demanding, low-cost, and low-power designs.
The study's contribution is summarized in the following.1) A novel, self-learning DQL-based approach designed for low-power, low-cost TEG-powered IoT nodes managed with a wake-up scenario.2) Comparison of the proposed solution's performance with a static energy management configuration and a state-ofthe-art fuzzy-based controller tested with a simulation and soil environmental data.3) Discussion of the features of the proposed approach in relation to the results obtained from the device's selflearning ability, model-free design, and computational cost requirements.The article is organized as follows.Section I presents the aim, novelty, and benefits of the study; Section II summarizes related studies and the state-of-the-art; Section III describes the device's DQL principles, learning policy, and adapted DQL energy management algorithm; Section IV describes the study's experiment and input data and provides an evaluation of the device's performance; Section V discusses the experimental results in relation to the device's learning parameters, a time domain analysis, and comparison with a reference solution; Section VI discusses the results and the article's contribution; and Section VII concludes this article and outlines potential future work.

II. RELATED WORKS
Energy management policies in energy harvesting IoT sensors are designed to provide a continuous and uninterrupted supply of energy [5].Due to the unpredictable and dynamic nature of the harvesting environments, successful operation of adaptive energy management algorithms in IoT sensors remains a challenge [6].For adaptive energy management, ML methods can be applied.Table I summarizes the current research and state-of-the-art ML methods.These methods are categorized into offline and online learning approaches.Offline methods exploit a neural network or fuzzy logic to predict energy management system parameters.Online methods are based on self-learning algorithms, such as deep reinforcement learning or QL.
Neural networks are mainly used for predictive analysis.In neural network applications, the quantity of available energy can be predicted from energy harvesting nodes [7] or multiple energy harvesting sources [8].Output power can be predicted from hybrid energy harvesting sources [9].The high computational demands of neural networks mean it is not always feasible to deploy this approach with energy-constrained devices [23].
Fuzzy logic is a suitable method for building adaptive algorithms that are used to achieve a continuous energy source for sensor nodes and, thus, prolong sensor node lifetime [10].Genetic algorithms are applied to optimize fuzzy rule-based controllers [11], forecast next-day solar energy availability using evolutionary fuzzy rules [12], or predict the quantity of available energy in IoT devices [13].Algorithms based on fuzzy logic can also be applied to manage the operation of wireless sensor nodes equipped with energy harvesting devices [15].Systems based on fuzzy logic can provide optimal operational strategies that assess the node's current resource requirements, current battery status, or expected energy charge [14].However, because neural networks and fuzzy rule-based systems do not possess self-learning abilities, they are not suitable for adaptive energy management algorithms in dynamic environments.Self-learning algorithms suitable for IoT-embedded platforms are commonly based on semi-supervised reinforcement learning approaches.One of these approaches is deep reinforcement learning, which combines neural network and reinforcement learning principles.Deep reinforcement learning algorithms can be used to manage energy in battery-less event detection sensors [16], manage resources in hybrid energy wireless networks [17], or jointly optimize data offloading and resource allocation in renewable-energy-aware IoT devices [18].Although, this approach is self-learning and suitable for multidimensional environments, the algorithm is not fully optimized due to the computational complexity involved in real-time neural network training.Another online and selflearning approach is QL, which is suitable for coordinating the energy consumption in wireless sensor networks [19], Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
allocating resources for energy harvesting device communications in IoT networks [20], and managing energy in solarpowered environmental wireless sensor network nodes [21].This approach is suitable for managing energy in dynamic environments because it applies a self-learning algorithm that is not computationally intensive due to its semi-supervisory nature and a Q-table updated with single values according to the current reward.This can be demonstrated in the energy and transmission management of a node which has been programmed using QL and exhibits appropriate data and energy flushing in queues to achieve better throughput and fewer lost packets [22].
Hybrid ML approaches, such as the combination of fuzzy logic and reinforcement learning techniques, can also be applied.These fuzzy-based reinforcement learning mechanisms first prioritize tasks using fuzzy logic.A reinforcement learning mechanism is then used to solve the problem of tasks with high dimensionality in a dynamic environment [24].Besides conventional computational models, task schedulers can also be used.This type of approach is able to intelligently schedule application tasks to avoid power failures and maintain forward progress in IoT devices [25].

III. PROPOSED MODEL
This section describes the DQL method, its learning policy and implementation for wake-up scheduling in an IoT device.The section includes a reference solution based on a fuzzy logic controller.

A. Double Q-Learning
The DQL method belongs to the reinforcement learning algorithm family.Hasselt [26] proposed this method as a modification to avoid overestimation of the action values produced by the QL algorithm.This approach differs by using two Q-tables instead of one.The Q-tables are denoted Q A and Q B and applied to each state/action pair.The DQL finds the action a * , which is the maximal valued action in the next state s , according to the value function ( A similar process is applied to b * , according to Each Q function is updated with a value from the other Q function for the next state.Q B is used to update Q A , according to the equation This is performed conversely for Q B , according to the equation Fig. 2. DQL algorithm learning process: updating the Q-tables for Q A and Q B over two iterations. The expected value (E) of Q B for action a * is mathematically proven to be less than or equal to the maximum value of Q A (s , a) [26] ( If a large number of iterations are executed, the expected value of Q B (s , a * ) will be less than the maximum value of Q A (s , a).It means that Q A (s , a) is never updated with a maximum value and, thus, never overestimated.This also applies conversely to Q B (s , a).To select the action for the next run, the appropriate Fig. 2 shows the learning phase of the DQL algorithm.Generally, DQL algorithms use two estimators instead of one to eliminate any overestimation of rewards [26].The advantage of two estimators is that the first is used to select an action and the second is used for evaluation.The estimators are switched after each iteration.When the Q-table for Q A updates, the future reward value is taken from the Q-table for Q B , and vice versa.Fig. 2 shows an example of two learning phases over two consecutive iterations.The Q-table for Q A is updated during the first iteration, the Q-table for Q B during the second.

B. Learning Policy
An integral part of DQL is the learning policy.Generally, the learning policy defines the actions, states, and the reward Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
policy.The DQL controller's aim is to optimize the use of energy.Actions are defined for the next sleep time, i.e., the next period duration.The action set is defined as follows: A = {720, 480, 240, 120, 60, 10} (min).(6) States are defined according to the energy stored in the supercapacitor.The maximum energy which can be stored is calculated from the equation where E max is the maximum stored energy in joules, C store is the electrical capacitance in farads, and V max is the maximum supercapacitor voltage generated by the dc/dc converter.
For the experiment in the current study, an LTC3109 dc/dc converter supplied electrical output until the supercapacitor voltage dropped below a desired output voltage.The supercapacitor was consequently charged using a minimum of energy, calculated according to the equation where E min is the required stored energy in joules in the supercapacitor corresponding to the desired output voltage V out .The state of energy storage (SoES) is computed according to SoES = E store < E min : 0 else: SoES is normalized to the interval < 0, 1> and divided into six states If SoES is less than 1/6, then the state is S 1 .
The reward policy is based on the current SoES value and the SoES value from the previous cycle; the reward is calculated from the equation where SoES cycle is the supercapacitor's range normalized remaining energy, SoES cycle-1 is the supercapacitor's rangenormalized remaining energy from the previous cycle, and R is the reward.The policy is defined according to two conditions: 1) when SoES rises, the controller obtains a positive reward and 2) the action with a longer period increases the probability that the SoES is higher after the performed action.The relationship between reward and incoming energy is straightforward.If a sudden temperature difference occurs on the TEG, caused by, for example, blowing wind, the reward will also be high.A high reward may cause overestimation, but using DQL instead of QL will decrease the probability of overestimation.

C. DQL Energy Management Algorithm
This section presents a DQL algorithm dedicated to controlling the behavior of an IoT node.The principle behind adapting DQL to this purpose is that the algorithm uses the current and previous step states instead of the current and future states.

and s ( -greedy policy) Start action a s ← s Sleep time according to executed action end
Algorithm 1 defines the DQL process for controlling an IoT node's behavior.After initializing variables and the Q-tables, the IoT node is woken up.The reward R and current state s are checked.The next step is a learning phase which updates one of the Q-tables (Q A or Q B ).In the first iteration, the previous state s is unknown, therefore, the learning phase is omitted.
Action a is then selected from the appropriate Q-table Finally, the selected action is performed, the current state is stored (as the previous state for the next learning phase), and the IoT node enters sleep mode.After a certain period has elapsed, the IoT node is again woken up, and the algorithm repeats.

D. Fuzzy Logic Controller Reference Solution
This section describes a state-of-the-art energy management solution based on a fuzzy controller, adapted from previously published research articles [10], [11], [14].The reference solution uses a fuzzy logic controller instead of a DQL controller to schedule the next wake-up time.
The reference fuzzy logic controller has two inputs and a single output.The inputs have been designed to respond to the information available from the IoT node according to the DQL reward policy, where the first input contains the fuzzy sets of SoES cycle and the second contains SoES cycle-1 .The output is the next period duration discretized into {10, 20, 30, . . ., 720} min.
The shape of the input and output fuzzy sets depicted in Fig. 3 is triangular, and both inputs contain low, middle, and high sets.The output contains three fuzzy sets, representing slow, middle and fast operation.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II RULES FOR THE FUZZY CONTROLLER REFERENCE SOLUTION
Table II lists the rules for the fuzzy logic controller.The knowledge base contains five rules according to the condition that if the SoES cycle is X and the SoES cycle-1 is Y, then the next period duration is Z.To achieve the maximum comparable behavior to the proposed DQL approach, these rules follow the same DQL policy.When the current SoES is low, the fuzzy controller selects slow operation, when it is high, it selects fast operation.When the current SoES is in the middle, then the fuzzy logic controller's behavior depends on the previous SoES, where an increase in the SoES resulted in fast operation and a decrease in slow operation.When the SoES is balanced and both input sets are in the middle, then operation is also in the middle.

IV. EXPERIMENTAL PROCEDURE
This section describes the experimental hardware parameters applied in the simulation, the input data, and the performance evaluation parameters for the input data.

A. Experimental Setup and Data
The experiment used a hardware model (Fig. 4) composed of three main modules: 1) a TEG; 2) a dc/dc converter; and 3) a load.The parameters of this device were applied in a simulation for analysis.The TEG hardware is a TEC1-12706 [27] module which generates electrical energy when it is exposed to temperature differences.The study in [28] described the properties of this TEG module through an experimental analysis of its current and voltage characteristics by exposing it to a range of temperature differences.A standalone TEG module is able to produce an open circuit voltage in the range of tens to hundreds of millivolts.This voltage range, however, is not sufficient to directly supply electrical devices, such as microcontrollers (MCUs) or transmission modules.Dc/dc converters boost voltages.The hardware model's dc/dc module is based on an LTC3109 converter which converts electrical energy from extremely low input voltage sources such as TEGs [29].The dc/dc converter module in the experiment was a mathematical model designed according to a physical LTC3109 module and respected its basic functionality.
Harvested energy is consumed by the load module.The load module is composed of an MCU, an environmental sensor, nonvolatile memory, and a wireless communications interface.An NXP KL25Z MCU [30] was selected for its low power consumption and wide range of integrated peripherals.The NXP KL25Z also fulfilled the requirements of the architecture and provides a number of low power modes,a feature which allows fine tuning of the energy profile.
Serving as a data collector, a Bosch BME688 4-in-1 [31] environmental sensor, which measures ambient temperature, air humidity and atmospheric pressure, is connected to the MCU via the I 2 C bus.The experiment did not impose the necessity of any specific sensor model; the only requirement was an advanced sensor design.This device also integrates a gas sensor which detects volatile organic and sulfur compounds and various other gases.
The model performs two operations: 1) data measurement and 2) data transmission.A memory buffer synchronizes these two operations through a 24CW1280 EEPROM [32], although the FRAM technology was also considered.Currently, no FRAM device is capable of operation at low voltages (e.g., at 1.8 V).A LoRaWAN, currently one of the most popular types of communication tools for IoT devices and offering three communication classes (classes A, B, and C) to cover various use cases, provided communications.The communications link is established with a Semtech SX1261 [33] LoRa transceiver connected via a serial peripheral interface (SPI).
The experimental data contained air temperature and nearsurface soil temperature measurements (measured at several depths 0.05, 0.5, 0.1, and 0.2 m) collected over a period of four years (2016-2019).These data were obtained from the Czech Hydrometeorological Institute [34] of the Ministry of the Environment of the Czech Republic and used as input for the experimental model.The data were measured at 10-min intervals at the Churanov Monitoring station, located in the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. Performance Evaluation
To evaluate the hardware's performance in the simulation, several criteria and performance characteristics were analyzed.Performance was assessed according to successful/completed cycles and unsuccessful/missed cycles.A missed cycle is a period during which transmission is required but the available energy is insufficient.The ratio of both indicators (completed and missed cycles) is calculated according to the equation The hardware model's energy consumption characteristics, especially unused energy, was evaluated.The quantity of unused energy E U is the sum of the energy when the supercapacitor is fully charged and the load does not use all of the produced energy.This sum of unused energy is then measured as a ratio to the sum of produced energy.
The average SoES level (SoES), average period between two successful transmissions (P), percentage of power good pin is active (P GOOD ), and average supercapacitor voltage (V STORE ) were also monitored to allow an analysis of energy consumption, the average data availability in the cloud and status variables of the IoT node.

V. RESULTS
This section evaluates the performance and variations in the learning parameters over time and discusses the bestperforming controller.The learning parameters, represented by α and γ , determine the learning rate and cumulative reward preferences of the DQL algorithm.The ablation study demonstrates the performance of the solution without the ML control algorithm, considering different static settings of the wake-up period.The variations in the learning parameters over time are reflected in the α R policy, which adjusts the algorithm's learning capability.The time domain analysis provides detailed insights into the behavior within a 200-day interval and the selected types of actions.

A. Learning Parameters Performance
Learning parameters performance was investigated by applying different values (α and γ ) in the experimental model.To evaluate learning speed, α was set to a value in the range 0-1, with a step of 0.1.To test sensitivity to the cumulative reward, γ was set in the range 0.1-0.9, also with a step of 0.1.The -greedy policy was set to 0.98 to help reduce the number of random actions since the experimental model was a control system contained within a measurement device and not having deterministic functionality was undesirable.Each variant of the experiment was repeated 100 times to reduce any stochasticity caused by DQL behavior.
Fig. 5 depicts the results for learning parameters performance.Fig. 5(a) indicates a high number of completed cycles but also a high number of missed cycles.This is, especially, evident at a high α (high learning rate) and low γ (low cumulative reward).Fig. 5(b) indicates that a lower α (slow learning process) produced fewer successful cycles and also far fewer missed cycles.Fig. 5(c) shows the ratio of completed/missed cycles calculated from (12).
Table III lists the ten best-performing controllers according to the ratio of completed/missed cycles defined in Section IV-B.Each candidate differs in its configured learning parameters (α, γ ).These candidates were evaluated according to the number and ratio of completed and missed cycles, unused energy (E U ), average SoES, and average period.It is clear that the majority of the best controllers had α parameters distributed in the interval 0.2-0.6,representing slower learning rates.In terms of γ , all the best controllers preferred a cumulative reward in the range 0.7-0.9.
In terms of average SoES, a relationship between the total number of cycles and average SoES is evident.The higher the number of cycles, the lower the average SoES, indicating better energy use.V STORE parameter reflects the same principle and corresponds with the average SoES.A relationship is also evident between the number of cycles and the average transmission period.A higher number of cycles produces a shorter average transmission period and, thus, higher data availability in the cloud.However, the more aggressive strategy with a maximum number of cycles also produced a higher Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. Ablation Study and Reference Algorithm
The purpose of the ablation study is to compare the DQL approach with control methods that do not rely on ML.The previous study [1] defined several reference controllers which applied static duty cycle periods.In the current study, the authors compared these solutions plus a reference solution based on a fuzzy logic controller to a solution containing a DQL controller.Table IV compares the performance results of reference static controllers and a reference solution based on fuzzy logic.The lowest average SoES and V STORE corresponded to the expected static controller behavior.Short duty cycle controllers decreased the SoES, resulting in short average periods; the long-duty-cycle controllers did not use incoming energy, resulting in long average periods.The average SoES of the fuzzy reference solution was similar to the DQL controllers, however, the average period was approximately four times higher.In terms of P GOOD , configurations with longer wake-up periods represent higher P GOOD , which leads to more reliable operation.The fuzzy solution achieves P GOOD of 77.5%, which corresponds to the range of 180 to 240 min in the static configurations.Fig. 6 graphs the results for the reference solution in relation to the DQL controller's results.The reference solution is marked in blue and indicates static operating periods (720, 480, 240, 120, 60, or 10 min); the fuzzy controller is marked as yellow diamonds; the DQL controller and its dynamic operating periods are marked in brown.The DQL solution is not clearly visible in the upper graph in Fig. 6(a), therefore, Fig. 6(b) provides a scaled detail of this graph and indicates the best DQL cases with the points A-J.For complete cycles and missed cycles, the blue curve splits the graph area into two parts.Controllers (below the blue curve) achieved higher performance than the static controllers.The reference fuzzy controller's results are also below the blue curve and indicate that this method achieved higher performance than the static controllers.The results for the best-case DQLs fall below the blue curve to the right and indicate that the DQL algorithm's performance was greater than both the static controllers and fuzzy controller.

C. Changes in the Learning Policy
In this section, the algorithm's ability to learn in time by reducing the learning factor in each learning cycle (α R principle) is discussed.This experiment used the controller (α = 0.3 and γ = 0.8) which provided the best performance in the previous experiments.Reduction of the learning rate was based   on the hypothesis that in a repetitive and conservative environment, preserving obtained knowledge and reducing the ability to learn is advantageous.By contrast, when a controller operates in a dynamic environment, it is better to maintain the learning rate at the initial level.The α R principle is defined according to the equation where α i is the initial value of the learning factor, r is the reduction coefficient, G is the gradient coefficient, and n is the number of simulation steps.Table V lists the numerical results for the α R configurations which were applied to determine the algorithm's ability to learn over time.The policy was modified by the reduction coefficient, which reduced the time required for the learning rate.Parameter G was set to 0.5, resulting in a gradual decrease of the learning factor and eventual reduction to half in r cycles.In this experiment, the r coefficient was set to 500, 1000, 2000, 4000, 8000, 16 000, and 32 000.The best controller attained approximately 45 thousand learning cycles over four years of operation; a factor of around 11 300, therefore, indicates that the learning rate was reduced by half in approximately one year.

TABLE VI NUMBER OF INDIVIDUAL PERIODS SELECTED
DURING CONTROLLER OPERATION Fig. 7 shows a graph of the seven α R configurations for determining the algorithm's ability to learn over time.The results indicate that a high reduction in the learning rate by the reduction coefficient produced poorer performance in terms of missed cycles and the ratio of completed/missed cycles.The best ratios were achieved with low reduction coefficients (8000-32 000).This behavior demonstrates that a quick reduction (500-4000) in the learning rate is not suitable for energy management controllers based on a TEG.It is possible that the controller may benefit from a long-term reduction policy; for example, the best configuration, with 16 000 reduction cycles, required 1.42 years to reduce the learning rate to half.However, it should be noted that the ratio of completed/missed cycles was very close to the best result without a reduction policy, although the result clearly demonstrates that a continual learning rate is a suitable solution.

D. Time Domain Analysis
This section discusses the behavior of the best-performing controller, which used the settings α = 0.3 and γ = 0.8.
Table VI lists the number of individual periods selected during controller operation.The controller operated according to defined output actions in periods of 10, 60, 120, 240, 480, or 720 min.The second column indicates the number of times this period was selected by the controller.The third column specifies the percentage of times this period was selected by Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the controller and indicates that the controller selected the fastest action in 91.60% of cases.When energy was unavailable, the controller selected slower actions to obtain a better reward.In 3.26% of cases, the controller slowed down the operating period to 720 min to prevent an outage in the IoT device.
Fig. 8 graphs the results of the simulation for the best DQL candidate over 1200-1400 days.The blue and red curves represent the voltage waveforms of V store and P good , respectively.V store is the supercapacitor voltage which corresponds to the SoES, and P good is the output signal which indicates whether the output voltage is at a sufficient level.The yellow curve indicates whether a cycle was active (completed) or missed.The purple circles indicate executed actions (period).
At a high level of V store , a high density of action was executed every 10 min; at a low level of V store , a high density of action was executed every 720 min.These results correspond with the expected behavior.

VI. DISCUSSION
This section compares the presented DQL approach with state-of-the-art methods and discusses the performance, features, and applications of DQL in TEG-powered IoT nodes.

A. Comparison With State-of-the-Art Approaches
The research from related studies and experiments open several discussion points.From a review of the literature on advanced methods, the current study is novel in three aspects.Table VII provides a comparison of the related studies listed in Section II with the proposed DQL approach.The individual ML methods are compared according the design needs of the models (model-free design), computational complexity, the ability to learn continuously (dynamic learning), the ability to learn without cloud assistance (on site updates), and compatibility with TEG-powered systems (TEG harvesting compatibility).Besides QL-based strategies, none of the methods are model-free and, therefore, require models for development.Approaches based on neural networks are characterized by high computational complexity and are, therefore, not suitable for implementation with low cost, low power IoT devices.
In general, approaches based on reinforcement learning are suitable for embedded applications.For IoT devices powered using energy harvesting methods, the embedded energy management algorithm must be able to adapt to the dynamic nature of the environment where the device is located by being able to learn at every step and adapt to the surrounding conditions.Algorithms based on reinforcement learning (deep reinforcement learning, QL, and DQL) satisfy this condition.Another significant parameter in a low-power IoT device is its ability to function with the cloud technology.ML approaches which learn by themselves without the assistance of the cloud technology belong to the reinforcement learning family.Methods based on neural networks and fuzzy logic lack this capability.
Energy harvesting based on the TEG technology is characterized by sudden incoming peaks of energy caused by changes in weather conditions.Energy management strategies must, therefore, be robust and eliminate overestimation of such events.Neural network and fuzzy logic strategies are developed offline and, therefore, resistant to this type of adaptation in principle.The knowledge base created through reinforcement learning methods may also be compromised by the overestimation of external events.The proposed DQL approach using two Q-tables offers an effective solution to suppress overestimation during sudden changes in incoming energy.
Table VIII presents a summary of the key parameters for static, fuzzy, and DQL approaches.To compare ML approaches, two static controllers (20 and 180 min) are selected based on the comparable average period P parameter.The fuzzy controller has a comparable P with the 180-min static configuration, but the overall reliability, as indicated by the ratio parameter related to missed cycles, is significantly higher.In terms of P GOOD and E U , the fuzzy controller and the 180-min static configuration show negligible differences.These facts clearly demonstrate that a dynamic-oriented approach is more suitable for controlling TEG-powered IoT nodes.Similar observations can be made when comparing the DQL approach to the corresponding static 20-min approach.There is a significant difference between the ratio and missed cycles, despite the comparable average period.This finding further confirms that DQL is an appropriate solution for IoT energy management.The comparison of the DQL and fuzzy approaches reveals that DQL outperforms the fuzzy approach in terms of all key parameters.

B. DQL Performance and Features
The study's results demonstrate that the real-time selflearning algorithm designed for IoT devices deployed in environments with variable sources of energy is a suitable solution.This conclusion is based on the study's α R experiment, which produced superior results without any reduced learning ability in the algorithm.This feature permits application to a wide range of IoT sensors and deployment scenarios.The controller's adaptability is an advantage with IoT sensors where the hardware configuration differs in energy harvester type, capacitor size, hardware age, and other factors as a consequence of DQL principles and a semi-supervised approach driven only by relative state variables from the reward policy.
Self-learning algorithms provide solutions for various ML methods which may require additional data sets (e.g., training data sets for neural networks).The proposed solution uses online self-learning principles and, therefore, performs semi-supervised learning within the deployed device itself.This feature not only eliminates the need for a training data set, it produces different learning results in each IoT device.This approach also eliminates the time-consuming and computationally demanding process of optimizing the design (e.g., fuzzy rule-based controllers) or providing ambient energy predictions (e.g., prediction-based controllers).
In terms of required computational resources, the DQL controller is suitable for IoT devices with hardware limitations.Memory implementation includes two data arrays representing Q-tables with floating point variables.In the each learning step, the Bellman equation updates only one variable selected from the data arrays.Finally, actions are selected by averaging and sorting the data arrays.This simple procedure is more effective than state-of-the-art approaches, such as fuzzy rulebased controllers or neural network evaluation.Overall, DQL provides the means to implement computationally limited, low-cost hardware with low-power specifications.

VII. CONCLUSION
The study presented a reinforcement learning principle designed to optimize energy management in IoT devices and experimentally tested a hardware model for such a device.The model consisted of a TEG energy harvesting subsystem with a dc/dc converter, a load module with an MCU, and a LoRaWAN communications interface.The device followed a reward strategy which compared its current charge status to the charge status in the previous learning step.
The study also presented a DQL-based approach with configurable learning parameters.The results showed that the best-performing DQL controller operated with a 98.5% success rate derived from the ratio of completed/missed operation cycles.The novelty of the solution was discussed in relation to state-of-the-art methods and their properties.
Future work includes two possible directions.In the first, and because the proposed approach demonstrated its ability to adapt to the surrounding environment and specific application, DQL-based methods applied in other domains could be evaluated with simulations that use other hardware or large data sets from a range of deployment locations.The second research opportunity involves long-term deployment and evaluation of an IoT device to study the differences between simulated data and real-life data which contains several observed parameters (e.g., disturbances, malfunctions, temperature changes, etc.)

Manuscript received 27
October 2022; revised 21 April 2023 and 18 May 2023; accepted 5 June 2023.Date of publication 7 June 2023; date of current version 24 October 2023.This work was supported in part by the "Development of Algorithms and Systems for Control, Measurement and Safety Applications IX" of the Student Grant System, VSB-TU Ostrava under Project SP2023/009, in part by the "Development of a System for Monitoring and Evaluation of Selected Risk Factors of Physical Workload in the Context of Industry 4.0" of the Technology Agency of the Czech Republic under Project FW03010194, and in part by the European Union's Horizon 2020 Research and Innovation Programme under Grant 856670.(Corresponding author: Michal Prauzek.)

Algorithm 1 :
DQL Algorithm Adapted to the Controller of an IoT Node Initialize Q(s, a), Q A (s, a) and Q B (s, a), for each s ∈ S, a ∈ A while true do Wake up Observe R, s if UpdateA then Define

Fig. 3 .
Fig. 3. Shape of the input and output fuzzy sets representing the current and previous SoES, with next period duration output.

Fig. 4 .
Fig. 4. Hardware model for an IoT device composed of a TEG, dc/dc converter, and load.

Fig. 5 .
Fig. 5. Results for learning parameters performance by varying the learning parameters (α, γ ); (a) represents the number of completed cycles; (b) represents the number of incomplete cycles (failures); and (c) represents the ratio of completed/missed cycles.

Fig. 6 .
Fig. 6.(a) Performance comparison of the reference solutions [1] and DQL solution, indicating completed and missed cycles.(b) Detail of the upper graph indicating the best DQL controllers.

Fig. 7 .
Fig. 7. Seven α R configurations for determining the algorithm's ability to learn.

Fig. 8 .
Fig. 8. Selected time window for IoT device operation with the best DQL candidate (α = 0.3 and γ = 0.8).The upper part of the chart shows the time parameters V store , P good , and active and missed cycles.The lower part of the chart indicates the actions performed (period) over 1200-1400 days.

TABLE III CONTROLLER
CANDIDATES WITH THE BEST PERFORMANCE BASED ON THE RATIO OF COMPLETED/MISSED CYCLES TABLE IV ABLATION STUDY AND REFERENCE SOLUTION BASED ON FUZZY LOGIC COMPARISON number of missed cycles.P GOOD indicates the percentage of the time, when IoT node works properly.The best-performing controllers have P GOOD in range from 77.7% to 79.6%.

TABLE V SEVEN
α R CONFIGURATIONS FOR DETERMINING THE ALGORITHM'S ABILITY TO LEARN IN TIME

TABLE VII COMPARISON
OF FEATURES IN STATE-OF-THE-ART METHODS AND THE PROPOSED APPROACH