Deep Reinforcement Learning on FPGA for Self-Healing Cryogenic Power Amplifier Control

Wireless sensing and communication for space exploration in areas inaccessible to human often suffer from severe performance degradation due to the cryogenic effects on the transmitters’ circuits. To survive extreme temperatures, programmable radio frequency (RF) power amplifiers (PA) can be built into the transmitter, and intelligent PA controllers need to be integrated into the system to interact with the environment and restore the PA’s functionalities. This problem can be modeled as the controller acts (control the PA) in an environment to maximize the reward (signal quality), and it is most suitable to use reinforcement learning as a solution. This paper presents a cryogenic and energy-efficient reinforcement learning (RL) module on Field Programmable Gate Arrays (FPGA) that can directly program the PA. By characterizing a self-healing PA in a liquid nitrogen environment, we generated an RF signal data set and built an interactive RL environment to model the PA’s behaviors across its configurations and cryogenic temperatures down to −197°C. We developed a deep RL model with a high generalization capability introduced by the neural networks to control the PA and restore its performance. The RL model with fixed-point training and inference is implemented on FPGA to survive the cryogenic conditions and carry out fast and low-power training and inference for PA control. All functionalities of the programmed FPGA operate correctly in the cryogenic testing environment.


I. INTRODUCTION
S PACE exploration tasks require wireless communication devices that can operate in extreme temperatures. Off-the-shelf electronic devices typically have an industrial temperature rating between −45 • C (228 K) and 80 • C (353 K), which is inadequate for the extreme temperatures encountered during space missions. For example, the lunar surface has an equatorial noon temperature of approximately 397 K (124 • C) and a polar region minimum temperature of around 50 K (−223 • C) [1]. To ensure reliable communication in these environments, it is necessary to develop wireless transceivers that can mitigate electronic failures and performance degradation caused by cryogenic effects.
The power amplifier (PA) is a critical component of wireless communication systems, and previous studies have shown that cryogenic effects can significantly impact its operation by changing transistor parameters and causing performance deviations from nominal operating characteristics [2], [3], [4]. Although there have been few works in the literature addressing PA design for environments with extreme temperatures, a significant amount of research has been done on cryogenic low-noise amplifier (LNA) design for very low-noise applications, either through the use of alternative technologies like SiGe HBTs or by incorporating cryogenic MOSFET modeling data into the design process [5], [6], [7], [8], [9], [10]. In this paper, we propose leveraging a reinforcement learning agent to program a highly configurable PA design that can operate across a wide temperature range. By using an RL-based approach, we can optimize the PA's performance across various cryogenic temperatures and ensure reliable communication in space exploration missions. The primary temperature-dependent MOSFET parameters that significantly impact the large-signal operation and PA behaviors are threshold voltage and mobility. Prior works on cryogenic MOSFET characterization consistently report significant increases in both parameters at low temperatures across CMOS technologies [11], [12]. To allow the PA to accommodate these changes, the main PA transistor is sliced into nine nominally enabled and three redundant selectable elements. RDACs are used to apply forward body bias to counter changes in threshold voltage. Fig. 1. (a) illustrates the simplified programmable PA. In this PA, there are four controllable parameters, which are the number of enabled elements, supply voltage (V dd ), the bias voltage applied at the gate (V g ), and the RDAC-controlled V b . The system diagram of a transmitter equipped with such a PA is shown in Fig. 1. (b). Because it is unknown to the user which combination will produce the optimal performance in the operating environment, a controller is needed to characterize and configure the PA subsets to achieve maximum performance in different environments. Although our study only focuses on the effect of temperature on the PA, other factors like circuit noise, humidity, and device aging also influence the PA's characteristics, which are unpredictable and difficult to know after the deployment. To eliminate the need for human intervention to tune the PA after the deployment, the on-device controller is required to have the ability to actively learn the PA characteristics and control the PA to adapt to new environments by itself. We believe this problem is most suitable to be modeled as reinforcement learning (RL), which requires minimal supervision from humans to perform the tasks. In this work, we present an RL-based PA controller on FPGA, which reconfigures the PA and recovers its nominal performance in extreme temperatures.
FPGAs are among the trending platforms for machine learning applications. They also have the proven ability to tolerate cryogenic temperatures and are being used in many cryogenic-related instruments and tasks as major computing or control units [13], [14], [15]. The works in [13] and [15] performed a series of experiments to demonstrate that all FPGA components maintain their functionalities in temperatures down to 4 K (-269 • C). Our work uses the same FPGA (Artix7 XC7A100T) as the hardware platform to carry out energy-efficient RL agent training and inference. Our testing results also show that the programmed FPGA's functionalities are unaffected by the liquid nitrogen temperatures.
The key contributions of this work are as follows: • To the best of the authors' knowledge, this is the first work that demonstrates using reinforcement learning to control the self-healing combinatorial CMOS device. This paper presents the RL model, feature selection, and reward functions' implementations in detail.
• To prove the idea of using reinforcement learning to control the PA, we built a large synthesis data set and RL environment covering 2,400 states of the PA in each temperature across +80 • C to −197 • C (12 total measured temperature points) by characterizing a testing circuit in liquid nitrogen temperatures.
• An FPGA implementation for this deep reinforcement learning task is presented in this work to support fastand energy-efficient training and inference of the RL agent. While it is common to fine-tune the neural network models on edge devices using the stochastic gradient descent (SGD), our FPGA implementation can train the model from scratch with fix-point model parameters using the Adaptive Moment Estimation (Adam) optimizer [16], which has better convergence results but more complexity in computations.
The remainder of this paper is organized as follows. Section II introduces the self-healing PA and describes the RL data set and environment that model the RF PA's behaviors in liquid nitrogen temperatures. The implementation of the RL model is presented in Section III, where the feature selection, model parameters, and reward functions are discussed in detail. In Section IV, we discuss and evaluate the FPGA implementation of the RL model with the Adam optimizer and quantized training using fixed-point model parameters. Section V concludes the paper with future directions.

A. CONFIGURABLE PA SYNTHESIS DATA SET
The schematic of the configurable CMOS PA is shown in Fig. 2. A linear Class-A driver stage drives a highly configurable Class AB main stage with off-chip input and output matching networks tuned for 10.5 GHz. Off-chip  microstrip components were simulated in Sonnet Lite to generate S-parameter data files for use in simulation with a 65 nm PDK. With the designed off-chip networks and PA set to the nominal configuration with nine enabled transistor elements and body bias disabled, the amplifier exhibits a P1dB of 12.4 dBm with the third harmonic 49.3 dB down from the fundamental while consuming 256 mW of power in the main stage. The PA's transfer functions can be altered by selecting one or more elements (from 1 to 12), adjusting the supply voltage V dd , bias voltage V g , and setting the on-chip 3-bit RDAC levels (0 V-1.05 V). To obtain data on transistor behavior under cryogenic conditions, we characterized a transistor circuit shown in Fig. 3 in the Sun Systems EC1A temperature chamber. We obtained the current-voltage (I ds vs.V ds ) curves across selected V gs with different temperatures to modify the BSIM4 [17] model card parameters from the PDK for each temperature. These parameters were obtained by fitting the below BSIM1-style drain current models for triode and saturation to the measured data using a mixture of the 3-point Hamer method [18] and global optimization in MATLAB.
These BSIM1-style extracted model parameters were then used as a basis for modifying the existing BSIM4 model parameters for the RF NMOS transistors to enable data set generation for temperatures down to 77 K.
Ideally, V dd and V g are continuous values, and each enabled element in the PA owns unique internal parameters. However, it would be difficult to simulate such a data set extensively due to the limitations of simulation time and computer storage. In this work, we set V dd and V g into a group of typical discrete values, and all enabled elements share the same internal parameters in each PA configuration. The discrete parameter values used in the simulation are the following: • Number of enabled elements: from one to 12.

B. RL ENVIRONMENT WITH PA STOCHASTIC FAILURES
Based on the collected data, a synthesis interactive PA environment is built to verify the RL model's performance. The environment simulates the behavior of the PA in different ambient temperatures from +80 • C to −197 • C. The input to the environment includes the current temperature and discrete PA parameters, and the output is the PA's current state (feedback features). The RL model can directly interact with the environment during the learning by altering the PA configurable parameters and receiving feedback from the environment.
With extreme temperatures and radiation effects, we expect the PA's large variations of the transistor parameters and assume that the PA elements and other circuits may experience unexpected fail-to-configure circuit failures and that the PA may enter faulty states. The PA controller is expected to detect the subnominal state of the output signal, learn to configure the PA to achieve better performance, and recover the PA from the noisy faulty states.
To make the environment more comprehensive, we assumed some failures the PA would encounter and added these features to the synthesis environment. To model these electrical failures of the device and test the RL agent's ability to recover the PA's state in unexpected scenarios, we assigned each controllable parameter in the PA a stochastic property that models random failures of the PA element. After the controller takes action to change the PA parameter, the environment will go through a function that models the PA's stochastic behavior to determine the final values of the PA parameters. Two parameters model the stochastic features in this function: alpha (α) and sigma (σ ). α is a straightforward probability where the configuration process for PA parameters has completely failed inside the PA, and the PA has entered a totally random state. σ is the standard deviation of a normal distribution centered at 0. During the interaction with the environment, random numbers are sampled from the normal distributions to determine how far the actual values being set are from the desired values. In these faulty cases, the controller would need to attempt to reconfigure the PA and restore its performance. An example of stochastic failure modeling using normal distribution functions is shown in Fig. 4 to help understand how it works. In this example, the controller's objective is setting PA's V dd = 1.2 V. Due to stochastic failures, the actual value being set may move up or down by an arbitrary number of discrete units, which is determined by its inherent normal distribution function that controls this stochastic behavior. We set fixed zones in the distribution to determine how to change to discrete values. The probabilities of the sampled random number landing at different zones are determined by the standard deviation of the normal distribution function, and they can be altered by changing the stochastic parameter σ in each configuration PA parameter. Fig. 4 provides two plots with different σ values. The effects of the zones based on the sample number are as follows: • Within ±1, set value = desired value (1.2 V).
• Between ±1 and ±2, set value = 1 unit away from the desired value (1.08 V or 1.32 V). • > ±2, set value ≥ 2 units away from the desired value (0.96 V or 1.44 V). In this work, α is set to 0.01, and σ is set to 1. The RL environment is more comprehensive and challenging with this stochastic feature of the PA. We expect that a robust controller would be able to characterize the PA with stochastic failures in this environment and restore its performance in arbitrary scenarios.

III. REINFORCEMENT LEARNING MODEL
In this work, the problem to be solved is defined as the PA being exposed to the environment, and the controller will learn from the PA's behaviors and control the PA to achieve nominal or above-nominal performance. We believe this problem is most suitable to be solved with reinforcement learning methods. Deep reinforcement learning, specifically the deep Q network (DQN) [19], [20], was first proposed to derive efficient representations of the environment from sensory inputs and generalize experience to new situations. It was demonstrated to be capable of human-level performance on Atari video games. In DQN, the agent is trained to take specific actions at certain states to maximize rewards from an environment, which directly fits the concept of our work. In this section, we will discuss the RL development for the PA controller system and present experimental results.

A. INPUT FEATURE SELECTION
This study evaluates the quality of the PA configurations by examining the output signal's power level, power efficiency, and nonlinearity. As the work is conducted via simulations, feedback features are currently extracted from the FFT of collected transient signals. Whereas in a more realistic setup, PA power efficiency and nonlinearity features could be obtained using power and intermodulation distortion information given by analog sensors like current monitors, envelope detectors, and out-of-band leakage power detectors, which add less hardware overhead to the system than an FFT implementation.
The FFT-based feedback features used in this work are as follows: • RF output power (dBm) VOLUME 4, 2023 179 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. • Power efficiency (dB), which is Power RF (dBm) − Power DC (dBm) • The second, third, and fifth harmonic distortions (dBc) RF output power and power efficiency are the two straightforward metrics for evaluating PA performance. They determine the communication range and PA power consumption. The three most prominent harmonic distortion features are directly related to the nonlinearity of the output signal. At each temperature, the goal is to maximize the RF output power and power efficiency while minimizing undesirable harmonic distortions. The histograms of the output features' distributions at three temperatures are shown in Fig. 5 to help illustrate the selected features and the differences between configurations in the PA. A general trend is that the PA performs better on linearities at lower temperatures but suffers from power degradation.
The controller (RL agent) also needs to know the current temperature and PA parameters to make new decisions to control the PA. Therefore, the input to the controller would include the current ambient temperature ( • C), four current control registers, and five feedback signal features, and the output of the controller is the action to change the PA parameters.

B. REWARD FUNCTION
The selection of the reward function is important, as it is the only way to provide appropriate feedback to the RL agent about the actions it took. We evaluate the agent's action-taking reward using the previously mentioned five feedback features: RF output power, power efficiency, and the second, third, and fifth harmonic distortions. Upon the RL agent's action, the PA's current output feedback features will be compared to its previous feedback features and the nominal operating condition's feedback features. The nominal operating condition of the PA is defined as follows: The general rule of thumb for designing the reward function is that the controller should be encouraged to control the PA to achieve better performance than the nominal state, and each step the controller takes should produce better results than the previous state. Therefore, for each feedback feature, the feature's reward is designed to be a sum of two terms: progress reward and performance reward. The progress reward evaluates how much progress has been made by taking the new action. The performance reward evaluates the quality of the new output signal compared to the nominal state output signal. The simplified reward function is depicted as follows: where α i represents different weights for feedback features in the reward calculation. Currently, α i is just 1, and all feedback features share the same weight. Fig. 6 provides an example of how the reward for the RF power feature is calculated. Note that the feature values related to harmonic distortions (lower = better) are inverted during the reward calculation for simplicity. In equation (5), the performance reward will always be a nonpositive value to prevent the agent from sometimes receiving excessive rewards, and it encourages the RL agent to seek progressive rewards instead. For example, without the min() function, the agent can take an action and still receive a high reward if the new harmonics distortion performance is much better than the nominal operating condition while the RF power is subnominal. However, this situation should be avoided because the agent should be seeking a configuration that has a balanced performance in every aspect. The performance reward function does not give a positive reward if a feature has reached a better performance, but any shortcomings in the new output should bring down the final reward. The progressive reward in equation (4) prompts the controller to take better actions in each step to make the PA arrive at a better state and earn positive rewards. Some exceptions apply to the progressive reward calculation in equation (4), and they are the following: These exceptions are self-explanatory. They prevent reward overestimation for a similar reason as the performance reward calculation. The carefully designed reward function provides appropriate feedback to the controller and is used throughout the training process.

C. MODEL EVALUATION
The RL agent network is an empirically determined fullyconnected three-hidden-layer neural network with 128 neurons in each layer. The agent network selects an output from the one-hot encoding output array, allowing one controllable parameter to change in value at each step. The Adam optimizer [16] is utilized in the model, as our testing results showed that it performed much better than SGD for convergence. All hyper-parameters in the Adam optimizer are set to their default values during the training. The training process simulates the deployed PA controller system in various temperatures, and the RL agent works as the controller to interact with the PA in the synthesis environment. An experience replay buffer that stores intermediate action-output observation transitions is used for DQN's training. During DQN training, batch input data with size = 32 are randomly sampled from the replay buffer to train the agent. An episode of the controller-PA interaction is defined as the PA being initialed in a random state, and the controller configures the PA from a random state until the PA has reached a terminal state. The terminal state is reached when (1) the controller's new action makes no difference on the PA parameters, which indicates no further action is required, and (2) the previous feedback features and new feedback features are close, which indicates the PA is in a stable state.
The metric for evaluating the controller's performance is the percentile rank of the final arrived-at PA state. After the model is trained to converge, the controller is tested at each temperature for 1,000 episodes with random initial states. In each episode, the final arrived-at PA state's performance is compared with all available configurations, and its percentile rank is then obtained. A percentile rank of 100 indicates that the final arrived-at PA configuration is the optimal configuration. A percentile rank of 0 indicates a PA configuration that exhibits the worst performance in the current temperature. Using the currently available data set and synthesis PA environment, we set some scenarios to practice different training strategies of the controller and show the testing results. The scenarios are as follows: 1) The PA and controller are exposed to each temperature in the synthesis environment. The controller actively learns in each temperature, and its performance is tested in each temperature. The results are summarized in the format of box plots in Fig. 7. In scenarios 1, 2, and 3, even with the modeled faulty behavior of the PA in extreme temperatures, the controller can still recover the PA's performance. The controller is able to find configurations with average percentile ranks better than the nominal configuration (except for 20 • C, where the nominal configuration receives the highest rank). Scenarios 2 and 3 indicate that although the controller has only learned in intermittent temperatures, the trained model can still generalize to different temperatures to configure the PA to achieve beyond-nominal performance. In this work, we define scenario 2 as the baseline scenario, where the controller learns in interleaved temperatures in the data set with temperature intervals = 40 • C. Fig. 8 shows the average training metrics related to the controller's interaction with the environment every 2,500 episodes for the baseline scenario. During the learning, the average earned reward from each action increases over episodes, indicating that the controller's ability to guide the PA successfully evolves over time. Table 1 provides a real example of the trained RL controller configuring the PA elements to recover the PA's performance in the −197 • C testing environment. The PA was initialized to the nominal configuration, which exhibits a high performance at room temperature. However, the nominal configuration behaves poorly at −197 • C. The controller took actions to reconfigure the PA, and the final configuration's performance surpasses the nominal configuration in every aspect with a percentile rank of 94.08. As mentioned in Section II-B, the synthesis environment also models stochastic circuit failures. Table 2 provides a case where the fail-to-configure happened during the reconfiguration. The controller was still able to control the PA and achieve performance beyond the nominal configuration with a percentile rank of 99.08. Fig. 7 (d) shows that if the controller only learns from room temperatures, its ability to generalize said learned information to other temperatures is limited. Therefore, the controller must be equipped with online training to constantly update itself in new environments. In the next section, we present the energy-and resource-efficient implementation of the model with online training on FPGA.

IV. FPGA IMPLEMENTATION
FPGAs are widely used in ML applications for their programmability, high performance, and low power. Commercial FPGA devices have been tested to perform in cryogenic and radiation environments [21], [22], and space-grade  FPGAs have also become available for various applications [23], [24]. This section presents a resource-and energy-efficient implementation of the RL model on FPGA.

A. FIXED-POINT REPRESENTATION
As the controller of the PA, the RL on FPGA is expected to be deployed near the PA. Therefore, it is extremely area-and energy-constrained. The deep reinforcement learning model, specifically the DQN in this work, is inherently inefficient for hardware implementation compared to common supervised learning. First, the DQN model contains two copies of neural network parameters, one for the policy network and another for the target network. This doubles the memory requirement compared to supervised learning, where all updates are made in place on a single neural network. Secondly, the Adam optimizer used in this work promises good convergence results, but it has extra computational overheads compared to simple optimizers like SGD. The Adam optimizer's parameter update in an iteration is shown in equation (6). With the Adam optimizer, each model parameter has an independent learning rate that is calculated within each iteration. Adam tracks the m t and v t , which are first-moment and secondmoment estimations for each model parameter. It not only introduces two times more parameter storage requirements to the system but also adds extra computational efforts to the model. Therefore, training deep learning models with complicated optimizers is often done with high-performance computers, and it is challenging to implement on low-power devices.
To address the memory and computational requirements for resource-and energy-efficient FPGA implementation, we use fixed-point representations for all model parameters and computations. Reducing the bit-width of the model parameters with sub-32-bit number representations like halfprecision or fixed-point could directly improve memory resource usage. Additionally, fixed-point numbers are much more efficient for hardware arithmetic than floating-point numbers, and the FPGA has specific hardware like DSP to accelerate computing. However, compared to the floatingpoint format, the fixed-point format usually comes with a much lower dynamic range and precision. Problems like inaccurate and vanished gradients brought by lowprecision and numerical underflow can easily occur during the training. Therefore, training neural network models with fixed-point representation requires extra effort to tune the model parameters. In recent years, quantizing trained deep learning models for inference using integer or fixed-point numbers has become popular [25], [26], but they are not well-adapted for training yet. Reference [27] shows that when neural network parameters have limited bit-width, even fine-tuning the pre-trained neural networks with quantized fixed-point parameters has become difficult. Training neural network models using fixed-point representations usually needs higher bit-width. Reference [28] demonstrates that given sufficient bit-width (i.e., using 16-bit wide), training with fixed-point numbers can perform well on some open-source data sets.
Several adjustments must be made to make the RL model in this work adapt to efficient fixed-point training. First, the computing efficiency of the Adam optimizer needs to be improved for fixed-point operations, and some hyperparameters require adjustments. Then, appropriate bit-width needs to be determined for the model parameters and computations to achieve near-floating-point performance.

B. ADAM WITH LOOKUP TABLE
As suggested in [16], iteration update efficiency in Adam can be improved by changing the last three lines in equation (6) to those in equation (7). However, the remaining is still computationally expensive. We propose to replace α t = α · 1 − β t 2 /(1 − β t 1 ) with a lookup table, which eliminates the series of power-of-t, square root, and division terms to improve the computational accuracy, efficiency, and latency of the model parameter updates. Fig. 9 shows a t 's changes as the optimizer iterates 5,000 times. By default, β 1 = 0.9 and β 2 = 0.999. With an asymptotic line, α t eventually reaches α, which is the preset learning rate. After 3,916 epochs, α t reaches > 0.99 · α and from here can be estimated as α t = α. We used a lookup table to fetch α t for the first 4,096-cycle optimizer updates. After 4,096 cycles, α t can be estimated as α without computation. We further decimated (down-sampled) the a t sequence in the first 4,096 iterations 4× to reduce the size of the lookup table to 1,024 while preserving the α t estimation. The lookup table access pattern is shown in Fig. 10. There are 1,024 entries in the lookup table. Assuming the iteration number t is represented with a 16-bit unsigned integer type, the lookup table access index would be t [15:2], which drops 2 bits in relation to the decimate-by-4. We observed no performance loss with this trick, and more aggressive decimation factors like 8 and 16 also work well to save more resources. If the fixedpoint representation is 18-bit, one 18k-bit BRAM block is utilized to implement the lookup table and avoid excessive resource usage and long latency of computing α t , improving the efficiency of the Adam optimizer. In this work, there are a few more hyperparameter adjustments in the fixed-point version of the Adam optimizer in equation (7), which are (numerical stability term prevents divide-by-zero), β 2 (exponential decay rates for the second moment estimate), and α (learning rate).
Fixed-point numbers are prone to arithmetic underflow due to the limited precision. The has a default value of 1e-8 in the FP32 version of the model. However, this number underflows in the fixed-point format unless it has more than 27 fractional bits. Also, due to arithmetic underflow in the denominator, becomes m , and this number overflows when is a small number near zero. To address these problems, in the low-precision fixed-point version of Adam, we increased the fixed − point to 1e-2 to meet low-precision conditions and increase numerical stability. For a similar reason, to prevent the underflow of (1 − β 2 ), where β 2 has a default value of 0.999 in v t 's update in equation (6), β 2 is set to max(1 − 1 2 fractional bits , 0.999) in the fixed-point implementation.
The fixed-point training process can produce vanished gradients (g t ) due to arithmetic underflow. The biased moment estimations (m t and v t ) in equation (6) are affected by vanished gradients to have diminished values. Therefore, during the fixed-point training, we increase the learning rate from the default 1e-3 to 2e-3 to compensate for the effects brought by the precision-related problems in equations (6) and (7).

C. TRAINING WITH FIXED-POINT NUMBERS
The fixed-point version of the model is implemented with Xilinx Vitis HLS. The Xilinx fixed-point number representation has a format of <W, I>, where W is the total number of bits, I is the total number of bits in the integer part (before the decimal point) including the sign bit, and (W -I) is the number of bits that represent the fractional part after the decimal point. In this work, the quantization mode is set to round-to-nearest, and the overflow mode is set to saturation. The values of W and I in this work are empirically determined.
With the previously mentioned optimizer and hyperparameter adjustments, we changed the fixed-point number format of the model to different bit-widths to explore its sufficient bit-width requirement. To simplify the process of testing the performance vs. bit-width, the integer bits and fractional bits used in the tests are set symmetrically (W = 2 × I), where the number of bits in the signed integer part is equal to the number of bits in the fractional part. Because the input features do not share the same unit, no coherent normalization was applied to the input vectors of the neural networks. However, we scaled down the temperature and feedback feature elements in the input vector by 2 bits (4×) because these values are usually much larger than the PA parameters (e.g., the lowest temperature in the data set is -197 • C). This gives the input vector elements a relatively similar low-magnitude range, which also prevents overflow on the input layer or the intermediate output layers when the fixed-point representation has a limited dynamic range. In our experiments, the quantized model's parameters were initialized from PyTorch to be trained from scratch using the same random seed (0). They are included in the FPGA bitstream for power-on initialization. Fig. 11 shows the testing results of the model's performance vs. bit-width using the baseline scenario. The model achieves average percentile ranks of more than 90 when the bit-width of the model reaches 18-bit. 18-bit is also a sweet spot for Xilinx FPGA implementation, as a single DSP unit on the FPGA supports up to 18-bit × 25-bit fixed-point multiplication. The 18-bit version of the model  achieves an average percentile rank of 91.43, which has a loss of 2.63 compared to the floating-point implementation.

D. PERFORMANCE EVALUATION
Using low-bit-width fixed-point throughout the RL implementation brings the benefit of a smaller memory footprint as well as low-complexity computing units on the FPGA. Therefore, we can unroll the whole model, including two copies of the neural network parameters and the Adam optimizer on a single Arty A7-100T.
The diagram and data flow of the FPGA implementation's major blocks are shown in Fig. 12. Currently, the experience replay buffer is assumed to be an external peripheral, and the controller on FPGA provides all functionalities of RL agent inference and learning. The FPGA implementation of the 18-bit fixed-point model has an inference latency of 0.42 ms with a power consumption of 54 mW for fast PA performance recovery in cryogenic environments. The training latency for a batch of data with a size of 32 is 33.74 ms. The performance metrics of the FPGA implementation are summarized in Table 3. To ensure the proposed cryogenic RL module on FPGA can operate effectively in extreme environments, we tested all functionalities of the controller, including inference, gradient calculation,  and model parameters update. As shown in Fig. 13, our testing was conducted in a liquid nitrogen environment, with the lowest ambient temperature reaching −197 • C and the running FPGA temperature sensor reading at −178 • C. Table 4's testing results confirmed that all functionalities of the controller worked correctly under these extreme conditions, demonstrating the robustness and reliability of the proposed FPGA-based RL module, which can effectively program RF PAs for wireless sensing and communication in extreme temperatures.

V. CONCLUSION
This paper introduces a novel self-healing PA system that utilizes a deep RL-based controller to characterize and control a reconfigurable PA for reliable operation in cryogenic environments. To evaluate the performance of the RL controller, a comprehensive synthesis environment was developed using real-world measurements. The careful selection of the input features and reward function contributed to successful PA controller reinforcement learning. Based on multiple scenarios, we demonstrate the controller is able to control the PA to achieve beyond-nominal performance in extreme temperatures, and the controller's learned experience can be generalized to other conditions. The Adam optimizer in the neural network promotes the convergence of the model and is optimized for fixed-point training on FPGA. Fixed-point model parameters are used throughout the FPGA implementation to deliver energy-and resource-efficient controller and near-floating-point performance.
Future work will focus on designing a more comprehensive data set covering a wider range of PA combinations with a finer-grained discrete or continuous action space to enable a more thorough modeling of the PA's behavior. Additionally, the RL model will be altered to adapt to the continuous action space for optimal control granularity. Moreover, we plan to fabricate a physical prototype of the PA, incorporating a customized ASIC controller design that can tolerate cryogenic environments for low-power and low-area system implementation.