Artificial Neural Networks for Microwave Computer-Aided Design: The State of the Art

This article presents an overview of artificial neural network (ANN) techniques for a microwave computer-aided design (CAD). ANN-based techniques are becoming useful for performing forward/inverse modeling for active/passive components to enhance a circuit design. With measured or simulated data of microwave devices, ANNs can be trained to learn relevant microwave relationships, which are, otherwise, computationally expensive or for which efficient analytical formulas are not available. Fundamental concepts of the ANN structure and training, such as feedforward neural networks (FFNNs), recurrent neural networks (RNNs)/dynamic neural networks (DNNs)/time-delay neural networks (TDNNs), deep neural networks, and neural network training and extrapolation, are described. Knowledge-based neural networks (KBNNs) are described for improving the accuracy and reliability of modeling and design optimization. Various advanced ANN techniques, such as neuro-transfer function (neuro-TF) modeling, neural network inverse modeling, and deep neural network modeling, are discussed. The existing and emerging applications of ANN in microwave CAD are identified, such as electromagnetic (EM)/multiphysics modeling, modeling of nonlinear circuits and transistors, filter design, very large-scale integration (VLSI) interconnects, oscillator, transmitter and receiver modeling, and CAD applications in such as gallium nitride (GaN) high electron-mobility transistor (HEMT), wireless power transfer (WPT), microelectromechanical system (MEMS), and substrate-integrated waveguide (SIW).


I. INTRODUCTION
A RTIFICIAL neural networks (ANNs) are important techniques for a microwave computer-aided design (CAD) to perform forward/inverse modeling for active/passive components to enhance a circuit design [1], [2], [3], [4], [5]. With measured or simulated data of microwave devices, ANNs can be trained to learn relevant microwave relationships, which are, otherwise, computationally expensive or for which efficient analytical formulas are not available. By training an ANN using data from electromagnetic (EM)/physics simulations, one can use the trained ANN as models for microwave devices to replace the EM/physics models, which are typically CPU-intensive, to significantly accelerate the circuit design with EM/physics-level accuracies.
ANNs can help address two of the frequently encountered challenges in microwave CAD: One is the computationally expensive challenge in forward modeling, and the other is the no-analytical-equation challenge in the inverse design. The types of ANNs used in microwave CAD have ranged from fundamental neural networks, such as multilayer perceptrons (MLPs) and radial basis function (RBF) networks, to advanced neural networks, such as knowledge-based neural networks (KBNNs) for handling reduced amount of training data and deep neural networks for handling a large number of input variables. In this article, we review the state of the art of the ANN techniques for microwave CAD, from fundamental neural networks to advanced KBNN, neural network-based inverse modeling, and deep neural networks.
The ANN has also been trained to learn the complex and high-dimensional relationships between inputs and outputs in the inverse problems [73], [74], [75], [76], [77], [78]. In many situations, the inverse input-output relationship in the training data may have non-uniqueness problems, which means that different training samples with the same input values have contradictory output values. A fundamental issue in this research direction is that, for the same input values, it is impossible to have a trained ANN model match contradictory output values simultaneously. Therefore, it is often challenging to obtain an inverse model with a small training error and high model accuracy. Various advanced inverse modeling techniques [79], [80], [81], [82], [83] have addressed this nonuniqueness problem in the microwave area.
The last decade has witnessed the great research efforts devoted to the development of neural network models with many hidden layers (known as the deep neural network models) in the research community [84], [85], [86], [87], [88], [89], [90]. Deep neural network has been recognized as a very powerful tool in modeling intricate relationships in large datasets [84]. The advantages of deep neural networks are then leveraged to address the challenges in high-dimensional microwave component modeling [91].
In the subsequent part of this article, we first describe the fundamental concepts of ANN structure and training in Section II. The various KBNN methods are discussed in Section III. The formulation and training algorithm of neuro-TF are described in Section IV. The neural networkbased inverse modeling algorithms are reviewed in Section V.
In Section VI, the structure and deep learning algorithm of the various deep neural network techniques for microwave modeling are presented. Several major categories of applications of ANN in microwave CAD are highlighted in the review in Section VII. Section VIII concludes this article and discusses the future directions of ANN in microwave CAD.

II. ANN STRUCTURE AND TRAINING
A neural network consists of a number of neurons and a corresponding set of links connecting between them. Different neuron types or different manners in which the neurons are connected can form different neural network structures [1].

B. Dynamic/Recurrent/Time-Delay Neural Networks
A time-domain dynamic behavior of nonlinear devices or circuits is typically characterized by dynamic neural networks (DNNs) [130], [131], recurrent neural networks (RNNs) [26], [132], [133], [134], [135], [136], and time-delay neural networks (TDNNs) [137], [138], [139]. The DNN is a special type of feedback neural network, which uses continuoustime formulation to model the dynamic microwave behavior. The neural network output of DNN is a function of the input-output signals and their time derivatives. RNN is another type of neural network with feedback from output to input. It has the capability of representing a dynamic system's behavior through a suitable learning process. The RNN uses discrete-time formulation to represent the dynamic behavior of microwave circuits. The neural network output of RNN is a function of its present input and a history of its input and output. The TDNN is an FFNN whose output is formulated by a time-delay series of inputs in discrete-time formulation, such as RNN. TDNN is simpler than RNN, since the output of TDNN is a function of its present input and history of only its input.

C. Deep Neural Network
A deep neural network is defined as a neural network with many hidden layers [84], [88], [89], [90], [140], [141], [142], [143]. It has been shown that deep neural networks have much better performance than shallow neural networks (i.e., neural networks with a few hidden layers) in dealing with modeling problems exhibiting very intricate relationships. One type of deep neural networks is the deep MLP [89], [91], [142], [144]. The selection of hidden neuron activation functions in deep MLP could affect the ultimate training accuracy. In deep MLP, one of the most popular activation functions is the rectified linear unit (ReLU). The ReLU helps solve one of the main problems, i.e., the vanishing gradient problem, in training deep neural networks. A convolutional neural network (CNN) is another type of deep neural network [77], [145], [146]. A typical CNN is structured as a series of stages. Each stage usually includes three different layers: a convolutional layer, a nonlinearity layer, and a pooling layer [84], [145]. CNN can be used to learn EM problems where the inputs and/or outputs can be treated in image-like representations [74], [146]. A deep belief network (DBN) is also in the class of deep neural networks. In contrast to the FFNNs, such as deep MLP and CNN, DBNs are hybrid networks that contain both directed and undirected layers [147]. DBN has shown advantages in handling the inverse modeling problem for microwave CAD [148]. Another type of deep neural network is the long short-term memory (LSTM) [149], which is an advanced type of RNN. LSTMs can handle the issue of vanishing gradient encountered during the training procedure of traditional RNNs [149]. LSTM has been introduced into the microwave area for nonlinear device modeling [150] and extrapolation of frequency domain EM responses [151].
An advanced knowledge-based modeling approach, that integrates neural networks with transfer functions (shortened as neuro-transfer function or neuro-TF), can be used to build effective parametric models of EM structures [54], [55], [56], [57], [58], [59], [60]. This knowledge-based approach can be used even if accurate equivalent circuits or empirical models are unavailable. Further description of neuro-TF will be included in Section IV.

E. Discussions for Different Neural Network Structures
Different ANN structures can be chosen for different applications. When modeling simple and straightforward relationship and data generation is not expensive, the FFNN is the simplest and most efficient neural network structure to be used. Sufficient amount of training data is the key to ensure the quality of the neural network training and accuracy of the trained neural network. When the data generation is computationally expensive, KBNN becomes more suitable, since the prior knowledge can help decrease the amount of training data needed while maintaining good modeling accuracy. When the equivalent circuit/empirical model is not practically available and the input-output response of the training data behaves a highly nonlinear response with sharp ripples with respect to frequency, neuro-TF can be applied and can be trained with smaller amount of data than pure ANN structures. To address the situation where the dimension of model inputs is high and/or the amount of training data is large, deep neural networks typically become better choices. When the neural network outputs represent time-domain responses, especially having memory effects, DNN/RNN/TDNN structures are needed.

F. Neural Network Training and Data Generation
Neural networks can only represent a considered microwave device's or circuit's behavior after it is well trained with corresponding data [1]. The neural network training involves the use of training data to guide the adjustment of neural network internal connection weights (or synaptic connection weights). A separate set of data, called the test data, is used to assess the quality of the trained neural network. It is expected that the test data are not used during training, to fairly assess the trained neural network. The process of developing a neural model generally involves six steps. They are as follows.
Step 1): Identify inputs and outputs of the microwave modeling problem.
Step 2): For the identified inputs and outputs, generate training and testing data. Data generation is an important issue for ANN model development. Loosely speaking, an ANN model's accuracy enhances, as the number of data samples increases. However, generating large numbers of data samples is inevitably timeconsuming and resource-demanding (e.g., 3-D EM simulations). Various data sampling strategies have been introduced for ANN model development, such as uniform/nonuniform grid distribution [1], star distribution [37], random distribution [152], Latin hypercube sampling (LHS) [153], [154], adaptive sampling [155], [156], [157], and design of experiment (DoE) with orthogonal distribution [55], [70]. In grid distribution, each input parameter is sampled at equal or unequal intervals. In star distribution, sample points are generated by perturbing a central point twice along each input dimension, once toward the positive direction, and once toward the negative direction. When the input dimension increases, the number of data samples using grid distribution grows exponentially, while the number of data samples using star distribution increases linearly. The grid distribution provides the most sufficient data for training, but is not practically usable when the input dimension is high. The star distribution cannot provide sufficient data for pure neural network training, but it can be used for training neural networks combined with prior knowledge and has the least number of training data among all the sampling methods, especially when the input dimension is high. In the random distribution, each input sample is a vector of random variables within the range of input parameters. LHS improves the uniformity of distributed samples in the sampling space compared with random distribution. Adaptive sampling uses the interaction between the intermediate surrogate model and the available data to adaptively and iteratively build an optimized training dataset. It can distinguish where data are needed in the input modeling space. Compared with grid sampling and random sampling, adaptive sampling helps to achieve the required model accuracy using the fewest data samples, making the data generation process flexible and efficient. DoE with orthogonal distribution can be used for generating multiple sample points in high-dimensional parameter spaces. In DoE, the subspace divisions are orthogonal and sampled with the same density. The use of orthogonal distribution around the center point allows DoE to have far fewer sampling points than grid distribution does. Due to this, the surrogate model developed by DoE can be valid in a much larger neighborhood than that developed by star distribution. With the increasing model dimension and complexity, data generation in ANN modeling faces new challenges. How to keep a good balance between the cost of data sampling and the accuracy of a model is an ongoing direction to be further addressed in the future. A robust automated model generation (AMG) algorithm [158], [159] for neural network training can be used to significantly reduce the intensive human effort demanded by the conventional step-by-step neural modeling approach. The algorithm inherently distinguishes between smooth and nonlinear regions of model behavior. It then uses relatively fewer samples in smooth subregions and more samples in nonlinear subregions. The AMG algorithm can systematically generate a neural network model satisfying the user-desired accuracy. Active learning is widely used in deep learning area and can adaptively perform model generation. In active learning, adaptive sampling is used to create a feedback loop between the deep learning model and the data [160], [161]. As more samples are collected, the adaptive sampling method learns how to improve the sampling process by deciding where to sample next. This enables faster training convergence, as more data are generated in the regions that have maximum uncertainty. Incorporation of active learning into the AMG process is an interesting further step.

G. Extrapolation for Neural Networks
Typically, a standard neural network model is accurate only inside the region where it is trained and is unreliable outside this training region. Various extrapolation methods [162] for neural network models have been introduced to address this issue, such as model-level extrapolation [163], [164], neuron-level extrapolation [165], and multidimensional extrapolation [166]. To facilitate the model extrapolation, the model-level extrapolation methods use the neural network and its derivative information at the training region's boundaries [163], [164]. Neuron-level extrapolation method addresses the problem in case when the training boundaries are irregular [165]. This method detects the necessity of extrapolation inside each hidden neuron and performs extrapolation using a modified neuron activation function. The simplification of the problem into a combination of several 1-D firstorder extrapolations makes this method simple and easy to be implemented. To address the modeling challenges in EM optimization and large-signal harmonic balance simulation, an advanced multidimensional extrapolation technique for microwave modeling and design has been used [166]. With a good model accuracy being guaranteed inside the training region, the extrapolation technique [166] can further force the model to be smooth along all the directions outside the training region. Besides various extrapolation methods, the KBNN method [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52], [53] exploits microwave empirical or equivalent models to help the overall KBNN model to achieve better extrapolation capability than pure neural networks. Furthermore, CNN-and LSTM-based neural network structures have been exploited for design space extrapolation and frequency extrapolation [151].

III. KNOWLEDGE-BASED NEURAL NETWORKS
The KBNN incorporates the available microwave information in the form of microwave equivalent circuit/semianalytical models (as prior knowledge) into neural network structures. There are several ways in which microwave knowledge and neural network are combined, as shown in Fig. 1. Let y represent the outputs of the KBNN model. Let x represent the inputs of the KBNN model, i.e., geometrical parameters. Let d represent the values of y in the training or validation data (such as data generated from EM simulation). Fig. 1(a) shows the general structure of KBNN, followed by special structures of various KBNN methods in Fig. 1

A. Basic KBNN Structures
The difference method [9] uses ANN to learn the difference between EM data and prior-knowledge model outputs, illustrated in Fig. 1(b). The inputs of ANN are physical parameters, and the outputs are the differences between the prior-knowledge model and the EM simulation. The difference  [9]. (c) Structure of the multiknowledge embedding approach for KBNN [36]. (d) Structure of the PKI method [28]. (e) Structure of the knowledge-based model with adjoint neural networks [58]. (f) Structure of the SMNN method [37], [167]. (g) Structure of KBNN with adaptive mappings [44]. method is very efficient in dealing with the modeling problem where the differences between the outputs of the prior knowledge and the EM simulation have simpler relationship with respect to the inputs compared with the outputs of the EM simulation itself.
The multiknowledge embedding approach [36] for KBNN embeds the multiple prior knowledge into separate subspaces of the neural network model, resulting in an overall model covering a larger space than any of the individual knowledge models alone, as shown in Fig. 1(c). Compared with the difference method, the multiknowledge embedding approach embeds the prior knowledge into the internal of neural networks, allowing the boundaries between subspaces for each prior knowledge to be learned too.
In the prior-knowledge input (PKI) method [28], the priorknowledge model outputs are used as the inputs to the ANN, in addition to the original problem (EM data) inputs, as illustrated in Fig. 1(d). The PKI method is more suitable when dealing with the modeling problem where the similarities between prior-knowledge output and EM output lead to simpler training of the ANN.
Space mapped neural networks (SMNNs) [22], [37], [167] concept divides the modeling problem into a coarse model (prior-knowledge models) and a fine model (EM simulations). ANN is used to map between the fine model and the coarse model, as shown in Fig. 1(f). SMNN concepts can be applied for input mapping, output mapping, and implicit mapping. By exploiting the vast set of available knowledge models, SMNN methods reduce the number of fine model data needed for training, improve the model generalization ability, and simplify the ANN topology with respect to pure ANN modeling methods.

B. Knowledge-Based Model With Adjoint Neural Networks
To further speed up the KBNN development process, EM sensitivity information is incorporated into the KBNN model [58], as shown in Fig. 1(e). Let d represent the values of derivatives of y with respect to x in the training or validation data (such as data generated from EM sensitivity analysis).
The knowledge-based model with adjoint neural networks [92] allows robust parametric model development by learning both the input-output behavior and the derivatives of the EM modeling problem simultaneously. Compared with the KBNN modeling without EM sensitivity, the same modeling accuracy can be achieved with fewer training data by exploiting EM sensitivities. The knowledge-based model with adjoint neural networks is only utilized during the training process. The final model for a user is the simple original knowledge-based model, which can be further used in high-level designs.

C. KBNN With Adaptive Mappings
In KBNN modeling, there are several different ways to map the difference between the knowledge/coarse model and the EM data, such as input mapping, output mapping, frequency mapping, or a combination of any two or three of these mappings [3], and each mapping can be linear or nonlinear. Different modeling problems or different knowledge/coarse models can lead to different types and complexities of the mapping in KBNN. However, a quantitative decision of mapping structure in a KBNN is unknown in advance. To address this issue, KBNN with adaptive mappings [44] helps to systematically determine the type and topology of the mapping structure in KBNN. The structure of the KBNN with adaptive mappings combines the empirical model with one input mapping, one frequency mapping, and one output mapping to include all cases of mappings. To include both linear and nonlinear mapping, each mapping ANN is an expansion of conventional MLP by adding an additional direct connection between the input layer and the output layer, as shown in Fig. 1(g). A twostage training strategy with l 1 optimization is presented in [44] to train the model. Using the properties of l 1 norms, the training method automatically determines whether to use linear or nonlinear mapping in various parts of the model in the firststage training and also determines whether a linear mapping, if exists, can be deleted or not in the second-stage training. After the two-stage training, the appropriate mapping structure of the knowledge-based model for the modeling problem is achieved.

D. Discussions
This section provides a brief qualitative characterization of the advantages and disadvantages of the different kinds of KBNNs reviewed in this article. Table I shows the comparisons of pure neural network and different KBNN methods. A main advantage of KBNN over pure neural network is reduced amount of training data needed. The efficiency of KBNN also depends on the choice of KBNN methods. The difference method uses the difference between the prior knowledge and the EM simulations to form the model, resulting in a smaller range of the ANN outputs and a simpler input-output relationship. This method is the simplest form of KBNN and is expected to give good results when the difference has a simpler input-output relationship as a function of the inputs than the EM simulation data. The multiknowledge embedding approach allows multiple knowledge functions in different regions of the model. Since these knowledge functions are used for some neurons instead of standard activation functions, the KBNN needs to be trained using methods other than the conventional backpropagation. The advantage is its ability to accommodate multiple knowledge models for different subregions of the overall modeling space and allowing the boundaries of subregions to be learned during training. The PKI method makes direct correction of the output responses of the prior knowledge. Compared with the difference method, the PKI is slightly more complex. However, the PKI is applicable even if the difference between prior knowledge and EM output has a complicated relationship with respect to inputs. Knowledgebased modeling with an adjoint neural network incorporates the EM sensitivity information into the KBNN model development process. Since more information is provided, the knowledge-based modeling with an adjoint neural network needs fewer EM data to achieve an accurate model than the knowledge-based modeling without sensitivity information. Therefore, this method is suitable if the data generation is very expensive, and sensitivity information is available from the data generator. SMNN uses input mapping, output mapping, or implicit mapping for different microwave modeling problems to construct mapping functions from fine model (EM simulations) to coarse model (the prior knowledge), modify the coarse model, or align the responses between the fine model and the coarse model. When the mapping structure in the SMNN for a microwave modeling problem cannot be determined in advance, KBNN with an adaptive mapping method can be applied to automatically determine the optimal mapping structure for the specific modeling problem. Instead of sequential trials of linear and nonlinear mappings every time for every mapping structure, KBNN with an adaptive mapping method effectively increases the efficiency of knowledge-based model development.

E. Application Example of KBNN With Adaptive Mappings for Microwave Modeling
As an example, the method of KBNN with adaptive mappings is applied to develop a parametric model of a microwave low-pass filter with double microstrip sections [44], as shown in Fig. 2(a). The geometrical input parameters of the model are the lengths and widths of the microstrip lines of the filter. The output of the model is the magnitude of S 21 . Fig. 2(b) shows the prior-knowledge model, which is an equivalent circuit of the low-pass filter using simple transmission lines. For comparison purposes, KBNN models are developed for two different cases. In Case 1, the range of the geometrical input parameters is small, while in Case 2, the range is large. Using the modeling method of KBNN with adaptive mappings, a linear input mapping is good enough to achieve a KBNN with 1.92% testing error for Case 1. For Case 2, the modeling method chooses a nonlinear input mapping together with a nonlinear frequency mapping to achieve a KBNN with 1.97% testing error [44]. From these two modeling examples, it is demonstrated that the modeling method of KBNN with adaptive mappings can adaptively determine the mapping structure of the KBNN model for specific modeling problems. Fig. 2(c) shows modeling results at four different geometrical values for the low-pass filter example.

IV. NEURO-TF MODELING METHODS
As an advanced knowledge-based modeling approach, the neuro-TF method introduces transfer functions to perform the parametric modeling of the frequency response of microwave  [54], [55], [56], [57], [58], [59], [60], [68], [69], [70], [71], [72]. To develop the neuro-TF model, a transfer function is needed as the prior knowledge to represent the highly nonlinear relationship between EM responses and frequency. Using the knowledge of transfer functions, the remaining relationships of the transfer function parameters and geometrical parameters are less nonlinear and much easier for the neural networks to learn, resulting in the training of neuro-TF to be much more efficient.
Let the order of transfer functions be denoted as N. Let p be a vector containing all the transfer function parameters (e.g., rational function coefficients, or pole/residues, or gain/pole/zeros), i.e., p = [  H ( p(x, w) where H is the transfer function response; s is the Laplace domain frequency. Fig. 3 shows the structure of the neuro-TF model in general format.

A. Neuro-TF in Different Transfer Function Formats
The neuro-TF in rational format consists of neural networks and rational transfer functions. Let the neural network outputs represent the coefficients of the rational transfer function. The rational transfer function for neuro-TF is relatively simple. However, its order of the transfer function or its frequency bandwidth is limited to preserve neuro-TF accuracy [54], [59].
Another structure is the neuro-TF model in a pole/residue format, which uses the pole-residue-based transfer function [55] instead of the rational transfer function. The neuro-TF model in the pole/residue format consists of neural networks and pole/residue-based transfer functions. The neuro-TF in pole/residue format has low sensitivities of transfer function parameters with respect to geometrical parameters even when the order of transfer function is high, which can result in a good modeling accuracy and robustness.
To extract the feature parameters from the neuro-TF model to assist EM design optimization, the neuro-TF in pole/zero format is introduced [56], [68], [69]. The neuro-TF in pole/zero format can efficiently extract the transfer function zeros through a calculation to obtain the feature parameters (e.g., feature frequencies) [68], which assists neuro-TF-based optimizations to have better chance of jumping out of local minima [69]. Table II summarizes the formulations and comparisons for the neuro-TF model in different transfer function formats. When the transfer function with relatively low order can represent the EM frequency responses of microwave applications, the neuro-TF in rational format is efficient to use, because the rational coefficients are automatically sorted correctly among  different geometrical samples. The neuro-TF model in rational format can automatically avoid the mismatch issue (i.e., different sequences of transfer function parameters among different geometrical samples) in pole/residue or pole/zero format. When the transfer function with relatively high order is needed to represent the EM frequency responses of microwave applications, the neuro-TF model in pole/residue format can achieve better accuracy and robustness, because the relationship of the response of transfer functions versus the pole/residues is much less sensitive than that versus rational coefficients. When feature parameters are needed to be extracted from a neuro-TF model to assist the design optimization, the neuro-TF model in pole/zero format is effective and efficient to be used.
As an alternative way to model the frequency response of microwave components apart from neuro-TF, a discrete frequency ANN approach can be used to consider each frequency point as a separate output dimension [151], [168]. By removing the frequency variable and converting the information to discretized output format, the training of the neural network becomes simplified. However, it may result in the loss of continuity in the frequency variable, and the dimension in the ANN outputs becomes high when the frequency range increases. Further steps, such as frequency interpolation and frequency extrapolation, would be applied when necessary.

B. Neuro-TF With Different Neural Network Structures
In the abovementioned neuro-TF methods, different formats of transfer function are used to develop accurate parametric models for different applications, while MLPs are used as the neural network structures [54], [55], [56].
Besides MLPs, different neural network structures are also introduced in neuro-TF methods to expand the methods into broader applications. One of the alternative neural network structures suitable for neuro-TF method is the ELM [127], which fixes input weights and hidden layer bias of a singlehidden layer FFNN. Furthermore, a dynamic adjustment kernel ELM is also used as the neural network structure, which contains increased learning, reduced learning, and hybrid learning in the interior structure [128]. The neuro-TF model with ELM can speed up the training process over that with MLP.
Another alternative neural network structure suitable for the neuro-TF method is the RBF-NN. For specific training of the RBF-NN in neuro-TF, a semi-supervised training, including both supervised and unsupervised training, is implemented [123]. By incorporating the semi-supervised training, the modeling accuracy and efficiency can be improved with the same number of training data.

C. Two-Stage Training Process for Neuro-TF Models
In general, the neuro-TF model is trained using a two-stage training process. Before training the neuro-TF model, vector fitting [169] is used to extract the transfer function parameters (rational coefficients, or pole/residues, or gain/pole/zeros) from EM responses for all the training samples. The training of the neuro-TF uses a two-stage training process [55]. The first training stage is the preliminary training process, where the relationships of transfer function parameters versus geometrical parameters are learned by neural networks. The second training stage is the model refinement of the overall neuro-TF model. The model refinement is performed by optimizing the weighting parameters in neural networks to minimize the training error, denoted as E, which is formulated as where n s represents the number of training samples of various geometrical parameters; n f represents the number of frequency samples; d k, j is the training data generated using EM simulators at the kth sample and the j th frequency. Fig. 4 illustrates the flowchart of the development process for the neuro-TF model [55].

D. Application Examples for Different Neuro-TF Models
The first example illustrates the development of the neuro-TF model for parametric modeling of the EM behavior of an ultrawideband (UWB) antenna [55], shown in Fig. 5(a), where R l and d l are the radius of and the distance between the two large pads, respectively. t is the substrate thickness. g s is the gap between the small and large pads connecting to the port. r s is the small pad radius. According the vector fitting results, the order of the transfer function is high in this example. Therefore, the neuro-TF in pole/residue format is suggested. Fig. 5(b) shows that neuro-TF in pole/residue format can achieve better accuracy than that in rational format, since the sensitivities of transfer function output with respect to pole/residues are higher than that with respect to rational coefficients.
For the second example, we perform parametric modeling and design optimization using neuro-TF of the sixth-order cavity filter [69], whose structure is presented in Fig. 5(c). The neuro-TF model has seven geometrical variables as its inputs, i.e., x = [H c1 H c2 H c3 W 1 W 2 W 3 W 4 ] T , and one output, i.e., y = |S 11 |. Since feature parameters are needed to be extracted to assist the design optimization in this example, neuro-TF in pole/zero format is suggested. As shown in Fig. 5(d), the design optimization using feature-assisted neuro-TF in pole/zero format successfully achieves the final EM optimal solution, which satisfies the design specifications.

V. NEURAL NETWORK-BASED INVERSE MODELING
We start the description of inverse modeling by contrasting it to forward modeling. A neural network-based forward model is a neural network trained to represent the original microwave device where the inputs are the physical or geometrical parameters, and the outputs are the electrical parameters. Let n and m represent the number of inputs and outputs of the forward model, respectively. The forward model is expressed as where x represents an n-vector containing the inputs of the forward neural network model; y represents an m-vector containing the outputs; f ANN represents ANN functions for the forward model; w represents a vector containing neural network weights. For the parameter extraction or design purposes of microwave devices, the information is often processed in the reverse way to find the physical or geometrical parameters from desired outputs, which is called the inverse problem. The inverse problem usually can be solved either by an inverse modeling method or an optimization method. For inverse modeling, the model inputs are the electrical parameters, and the model outputs are the physical or geometrical parameters. After a good inverse model is developed, it can obtain the physical or geometrical parameters immediately. This makes the inverse modeling much faster than the iterative optimization process. ANNs have been employed as useful tools for inverse modeling of microwave components [73], [74], [75], [76], [77], [78], [148], [170]. In recent years, various advanced neural network-based inverse modeling techniques have been developed for microwave CAD to address the challenging nonuniqueness problem for inverse modeling [79], [80], [81], [82], [83], [171].

A. Formulation of Neural Network-Based Inverse Model
Since the inverse input-output relationship is usually unknown and hard to be analytically formulated, a neural network becomes a logical choice to develop the inverse model by learning the input-output relationship based on the training data. For the inverse model, let x represent an m-vector containing the inputs, where m is the number of inputs. The inputs are usually the electrical parameters. Let y represent an n-vector containing the outputs, where n is the number of outputs. The outputs are usually the physical or geometrical parameters. The neural network-based inverse model is expressed as where f −1 ANN represents the ANN functions for the inverse modeling.

B. Non-Uniqueness Problem in Inverse Modeling
The training data are generated from the forward model (or simulation of forward problem). The input-output data are then swapped, such that the input data from the forward model are used as the output data during the inverse model training process. When the input-output relationship is unique, i.e., the inverse problem is simple, the training of such an inverse model is easy, and the model is accurate.
For general situations, there will be the non-uniqueness problem of the inverse input-output relationship in the training data. The non-uniqueness problem happens when the different training samples with the same input values have multivalued solutions (contradictory output values). The ANN model cannot learn the contradictory output values simultaneously for the same input values, which results in large training error and low model accuracy. This is the reason why training an inverse model is typically much more challenging than training a forward model.

C. Derivative-Based Training Method of Neural Network for Inverse Modeling
The derivative-based training method of neural network for inverse modeling addresses the problem of non-uniqueness by dividing the data containing contradictory output values into groups, such that the data in each divided group do not have the problem of contradictory output values [80]. The training data are divided into groups according to the derivatives of outputs versus inputs of the forward model. A neural network inverse submodel is trained for each group of training samples. Since there is no contradictory output values for each group, the training of the subinverse model is simple, and the training accuracy can be high.
After submodels training, the multiple inverse submodels need to be combined to represent the overall model. For this purpose, an algorithm is developed to select the right one among multiple inverse submodels for a given input x. This is achieved with the help of the forward model. Fig. 6 depicts the inverse submodel combining method for a two submodel system [80]. The input x is given to each inverse submodel, and outputs from them are fed into accurately trained forward models, respectively, which generate different values of y. These outputs of forward models are compared with the input data x. For the given input, the correct inverse submodel is determined by the least error between y and x.

D. Multivalued Neural Network Inverse Modeling Technique
In [83], a simpler and more automated multivalued neural network inverse modeling technique is developed to solve the non-uniqueness problem and to simplify the inverse modeling process. This technique addresses the inverse problem where y i is the i th value of the multivalued inverse model. g ANN represents the ANN functions of the multivalued inverse model. Fig. 7 shows an example of the multivalued inverse model. Let T r be the index set of all the training samples. Let  (x k , d k ), where k ∈ T r , represent the kth training sample pair. To solve the non-uniqueness problem, such that different values y i can match different conflicts for the same or very similar input x, a training error of the multivalued neural network is used as [83] E(w) = k∈T r E k (w) (6) where E k (w) represents the training error for the kth sample, which is defined as where e i,k (w) represents the error between y i and d k as follows: where y i is the i th value of the inverse neural network model g ANN (x k , w) for the input x k . The error function focuses on matching the training sample with one of the values of the multivalued inverse model, which is closest to the current training sample, while ignoring other values. In this way, the non-uniqueness problem is solved, and the developed multivalued inverse model can systematically fit multiple contradictory samples.

E. Invertible Neural Nets for Inverse Modeling
Invertible neural network models are also developed to handle the inverse modeling problem [171]. Invertible neural networks consist of reversible blocks. In the forward process, the reversible block can compute the output values from input values. In the inverse process, the reversible block can also compute the input values from output values without bringing additional computing complexity compared with the forward process. The model structure of bijectivity of the invertible neural network is suitable for bidirectional operation and training. Therefore, both forward and inverse processes can be well addressed.
The invertible neural network is trained in a bidirectional way in each training iteration. The invertible neural network learns the transformation between the sample distribution and the prior distribution of a latent variable. After the training process, the conditional posterior distribution of the design parameters can be predicted for the given target. This makes this technique especially advantageous for solving the nonuniqueness and degeneracies when multiple candidate solutions exist in the inverse problem.

F. Discussions
Table III compares different neural network-based inverse modeling methods. Here, we discuss the advantages and situations of using different inverse modeling methods. The simplest ANN method for inverse modeling is called a direct inverse modeling method, where an ANN model is trained using data obtained by swapping the input and output data from the forward problem. When the input-output relationship is unique, i.e., the inverse problem is simple, the training of such an inverse model is easy, and the model is accurate. When the inverse problem is complex and challenging due to the nonuniqueness problem, the derivative-based training method, the multivalued neural network method, or the invertible neural network method is needed. The derivative-based method addresses the nonuniqueness problem by dividing the inverse model into different inverse submodels according to derivative criteria. Therefore, the derivative-based method requires data preprocessing. If the data preprocessing for the modeling problem is easy, the derivative-based training method is efficient to be used. If the data preprocessing is very intensive and complex, especially in multidimensional input space, the multivalued neural network is more efficient to be used, because it deals with the nonuniqueness problem without the need of data preprocessing. To address both forward and inverse modeling problems simultaneously, the invertible neural network model can be used due to its bidirectional ability.

VI. DEEP NEURAL NETWORK TECHNIQUES FOR MICROWAVE MODELING
Neural networks with one or two hidden layers have been used to successfully solve many application problems in microwave CAD. Further advance in microwave CAD led to the use of deep neural networks with many hidden layers to address more complex microwave modeling problems. Various deep neural network methods have been explored for microwave CAD, such as deep MLP [89], [90], [91], [142], RNN [133], [140], [150], and CNN [74], [77], [85], [151], [172].

A. Hybrid Deep MLP for Microwave Modeling
A fundamental type of deep neural network is deep MLP. The number of hidden layers in a deep MLP should be more than three (including three). Training a deep MLP is hard due to the vanishing gradient problem [141]. A hybrid deep MLP can be used for microwave modeling to further solve the vanishing gradient problem and maintain the smooth outputs and continuous derivatives simultaneously [91]. The hybrid deep MLP technique [91] for microwave modeling uses both sigmoid functions and the smooth ReLUs as activation functions, so that the vanishing gradient problem [141] can be overcome, and the number of hidden neurons can be reduced. The smooth ReLU is formulated as [91] where α is a value greater than zero. The structure of the hybrid deep MLP model is shown in Fig. 8 [91]. Let q s represent the number of hidden layers using sigmoid functions, which are the layers close to the input layer, as shown in Fig. 8. Let q r represent the number of hidden layers using smooth ReLUs, which are the layers close to the output layer, as shown in Fig. 8. The total number of layers in the hybrid deep MLP is expressed as q s + q r + 2.
To deal with the vanishing gradient problem that may exist during training of the hybrid deep MLP, a three-stage deep learning algorithm can be used [91]. In Stage I, the deep MLP with q s sigmoid hidden layers is trained. Usually, q s is initialized as q s = 3. New sigmoid hidden layers are added until the vanishing gradient problem begins to appear or until the model accuracy satisfies the requirement. If the model accuracy satisfies the requirement, the training is finished. Otherwise, proceed to Stage II. In Stage II, hidden layers with conventional non-smooth ReLUs are added until the accuracy of the deep MLP model satisfies the requirement. In Stage III, all the conventional non-smooth ReLUs in the trained deep MLP model are replaced by the smooth ReLUs. Then, the hybrid deep MLP with sigmoid functions and smooth ReLUs is further trained to refine the model accuracy.

B. Local-Feedback Deep RNN
To model the dynamic behavior of a nonlinear microwave device, the RNNs are explored. The RNN structure contains feedback loops involving delay units. Due to the feedback loops, RNN structures are naturally deep in time dimension. This is because the current state of RNN is related to all historic states. An example of a deep RNN is the local-feedback deep RNN (LFDRNN) developed to capture more complex time dependencies [133]. The LFDRNN structure consists of multiple local-feedback RNN layers. In the LFDRNN structure, time-delayed local feedbacks are incorporated to improve the convergence of the RNN model [173]. Similar to other ANN structures, the training process of the LFDRNN is a process to optimize the objective function by adjusting the neural network weight parameters. To train the LFDRNN, the gradients of model outputs with respect to weights should be computed both through time and through layer weights [133]. By introducing the time-delayed local feedbacks, the LFDRNN can improve the ability of modeling the complex nonlinear device behaviors as compared with the conventional RNNs.

C. Augmented CNN
An example of CNN for microwave CAD is the augmented CNN, developed for behavioral modeling and digital predistortion of concurrent multiband power amplifiers [172]. To reduce the model complexity and improve the linearization performance of the digital predistortion model, the augmented CNN is composed of an input layer, a convolutional layer, a pooling layer, a fully connected layer, and an output layer. In the input layer, the input signals are reshaped into 2-D matrices. The layer following the input layer is the convolutional layer. For the purposes of convolution, sliding convolutional filters are employed. Most of the computations of the CNN structure are performed by the convolutional layer. The convolved outcomes are added with the biases. The convolutional layer is followed by the pooling layer, which is a many-to-one mapping used to decrease the size of the output matrix from the convolutional layer. The pooling layer can also reduce the risk of overfitting during training. The pooling layer is followed by the fully connected layer. The fully connected layer is followed by the output layer, where linear functions are used as the activation functions.

D. Discussions
Table IV compares the different deep neural network techniques. Here, we discuss the advantages and situations of using different deep neural network techniques. If the dimension of model inputs is relatively low, the shallow MLP (i.e., MLP with one or two hidden layers) is easy to use for modeling. When the dimension of model inputs becomes high, the deep MLP is needed for accurate modeling. When the training of a deep MLP becomes hard due to the vanishing gradient problem, a hybrid deep MLP can be used to solve the vanishing gradient problem and maintain the smooth outputs and continuous derivatives simultaneously. Since the hybrid deep MLP is a fully connected neural network, the connections between neurons are easy to establish, and the training complexity is relatively low. Compared with the deep MLP, the CNN is a partially connected neural network where the connections are established between subgroups of neurons. The size of the subgroups and the connection patterns between the subgroups affect the structure of the CNN model. Therefore, the CNN has a better ability to extract features/patterns than deep MLP. Besides, CNN is specifically good at addressing modeling problems with image-like inputs. For the timedomain modeling with memory effects, the RNN is effective to use, since the model outputs are the functions of the present and past time series of model inputs and outputs. Compared with the conventional RNN with depth only in the time dimension, the LFDRNN can capture more complex time dependencies, because it has depth both in the time and in the space dimensions.

E. Application Example of the Hybrid Deep MLP for Microwave Modeling
In this example, the hybrid deep MLP technique is used to develop a parameter-extraction model for the sixth-order multicoupled cavity filter. The center frequency of the filter is 11785.5 MHz, and the bandwidth is 56.2 MHz [174]. The ideal coupling matrix for this filter is  The model inputs are S 11 in dB at 41 frequency samples, i.e., x = [dB(S 11 ( f 1 )) dB(S 11 ( f 2 )) · · · dB(S 11 ( f 41 ))] T . In this example, each hidden layer contains 200 hidden neurons. After training with the three-stage deep learning algorithm, the structure of the hybrid deep MLP model for this sixthorder example is shown in Fig. 9. In the developed deep neural network model, eight hidden layers employ sigmoid functions, and four hidden layers employ smooth ReLUs. The training and test errors of the developed model are 1.31% and 1.79%, respectively [91].
After the deep MLP parameter-extraction model is developed, a detuned filter is used to examine the developed model. Fig. 10 shows the comparisons between the desired S-parameters and the S-parameters calculated from the extracted coupling matrix (i.e., outputs of the hybrid deep MLP model) for the detuned filter.

A. EM Analysis and Parametric Modeling
EM parametric modeling is to provide EM behavior with variable values of physical/geometrical parameters of an EM structure. Such a model is important in EM-based design. The use of neural networks can help avoid repetitive EM simulations by learning the relationship between EM response and the varying values of geometrical parameters in advance. Many research activities have been devoted to such applications, for example, high-speed interconnects [93], CPW circuit components [94], practical multilayered shielded microwave circuits [124], spiral inductors [45], [95], EM interference (EMI) estimation [96], internally decomposed EM structure [97], metasurfaces [175], differential via holes [46], and couplers [98].
Bandler et al. [37] have presented a modeling method of microwave circuits using a combined ANN and space mapping (SM) technique. The SM concept combines the computational efficiency of coarse models (typically empirical functions/equivalent circuit models) with the accuracy of fine models (typically EM/physics models). Besides the SM technique, the RBF-NN models have been presented by Soliman et al. [125] for the efficient filling of the coupling matrix of the method of moments (MoM). Ding et al. [99] have introduced the ANN approach to EM-based modeling of passive components in both frequency and time domains.

B. Behavioral Modeling of Nonlinear Circuits
The behavioral modeling for nonlinear circuits is an interesting problem where the circuit responses are dependent on signal amplitude and terminations as well as frequencies.
Simple analytical formulas often do not exist. Therefore, neural network becomes a good choice for this task. Various ANN techniques have been developed for behavioral modeling of nonlinear circuits. Novel neural network structures have been introduced to represent the dynamic relationships between nonlinear circuit inputs and outputs, for example, RNN [132], [134], [135], DNN [130], nonlinear model-order reduction method [107], neural network-based bidirectional and dispersive behavior models [25], real-valued TDNN (RVTDNN) [137], adjoint neural networks [176], envelope-domain time-series model [177], and the Wiener-type DNN [131].
Gallium nitride (GaN) devices, such as GaN HEMTs, are important for next-generation wireless communication systems and microwave power device applications. However, the GaN HEMT devices exhibit strong thermal effects and trapping effects that would deteriorate the performance of the RF and microwave circuits. As a consequence, an accurate largesignal model for GaN HEMT devices capable of characterizing the sophisticated thermal and trapping effects is the key to high-reliability RF and microwave circuit design. Examples of ANN-based GaN HEMT modeling are the ANN-based electrothermal model [106], [189], [190], [191], the ANNbased large-signal model [192], [193], the hybrid modeling approach using ANN [194], the SM technique using decomposed mappings [195], and the ANN-based consistent gate charge model [196].

D. Filter Modeling and Design
Filter design is one of the most fundamental parts of microwave CAD. However, the full-wave EM simulationbased approach to filter design is computationally expensive. Various researchers have exploited the ANN approach to help filter modeling and design. For example, homotopy optimization based on the ANN model overcomes the problem that the initial value is not close enough to the optimal solution [197]. An efficient hybrid sampling method for ANN-based microwave component modeling and optimization can achieve an accurate ANN model with limited training data [116]. An efficient ELM with transfer functions improves the learning speed and accuracy [127], [128]. A high-dimensional modeling problem has been addressed in [50] and [198]. A novel deep-q-network (DQN)-based fine-tuning approach is developed to reduce the computation time even with different initial values [143]. Specific KBNN-based approaches are developed for different filter applications, for example, high-power dielectric resonator filters [199], narrow-bandwidth four-pole Ku-band bandpass filters [200], and even multiplexers [201].

E. VLSI Interconnects
Neural network techniques have also been introduced to very large-scale integration (VLSI) interconnects.
For example, Veluswami et al. [93] have discussed a neural network-based approach to the EM simulation and optimization of interconnects in high-speed VLSI circuits. Ding et al. [99] have applied ANN to EM-based signal integrity analysis and optimization of a high-speed VLSI interconnect circuit with embedded passive terminations and a nonlinear buffer. Ilumoka [202] has applied the modular ANN model consisting of groups of networks competing to learn different aspects of a problem to complex interactions of 3-D interconnect. To alleviate the time-consuming iterations of the full-wave simulations procedure, Hsu et al. [203] have utilized the neural network-based approach to efficient design of reflectionless via structures with single-ended interconnections. The deep neural network approach has been applied by Lu et al. [204] to high-speed channel modeling for signal integrity analysis, which saves complex circuit simulations and substantial domain knowledge altogether. Chiu et al. [205] have adopted ANN optimization to facilitate the design of signal integrity. Swaminathan et al. [206] have discussed the neural network applications for signal and power integrity in packaging. To overcome the problem that the reliability simulation process takes too much time and resources, Tian et al. [207] have introduced a neural network modeling method to analyze ultralarge-scale integration (ULSI) interconnect reliability.

F. Behavioral Modeling Related to Oscillators
Behavioral modeling of oscillators is an important application of neural network techniques. For example, Yu et al. [117], [118] have developed a modeling method for voltage-controlled oscillators (VCOs) using augmented neural networks to improve computational efficiency. FFNNs are used to predict the oscillatory output waveforms, and RNNs are used to capture the nonlinear dynamic current-voltage relation at the output port. Sen et al. [119] have reported a neuralnetwork-based parasitic modeling and extraction verification for RF/millimeter-wave integrated circuit design. They demonstrate the verification approach using automatically generated passive test structures and ring oscillators. This method helps improve the extraction-verification efficiency and accuracy.

G. Transmitter and Receiver Modeling and Design
Transmitter and receiver modeling and design using neural networks have gained attention in recent years. For example, Yin et al. [208] have discussed the behavioral modeling of millimeter-wave beamforming transmitters. They have utilized concurrent dynamic configurations with heterogeneous neural networks and achieved good performance on both training and validation sets, which provides a promising high-accuracy OTA model for mmW beamforming transmitters with concurrent configurations. Zhang et al. [209] have presented to use a RVTDNN for the calibration of multiport wideband receivers.
They have considered all the system impairments, including the non-ideal six-port correlator, the nonlinearity of the diodes, and the memory effects under the excitation of the wideband signals. The performance achieved using the method in [209] for most commonly used communication signals is better than the previous methods.

H. Neural Network-Based Multiphysics Modeling
Multiphysics simulation of microwave components involves the simultaneous solutions of EM and other physics domains [210]. Multiphysics simulation can provide more accurate results but is more time-consuming than pure EM simulation. The ANN technique has also been introduced to the multiphysics modeling and design of microwave components. For example, neural network-based modeling approaches have been introduced to improve the efficiency of multiphysics modeling with limited training samples [100], [101]. The multiphysics optimization technique has also been developed using KBNNs to accelerate the multiphysics design process [102], [103], [104].

I. Further Applications of ANN in Microwave CAD
There are further applications of ANN in microwave CAD. For example, Kabir et al. [120] have discussed a neural network method for RFIC inductors to make the large space model efficient. The neural network model is used to compute coefficient values for the empirical equation, which produces the values of lumped elements of the double Pi network. An ANN-based modeling approach for the design of RF microelectromechanical system (MEMS) switches has been presented by Lee and Filipovic [105] to improve the modeling efficiency while maintaining the modeling accuracy. Gall et al. [211] have applied neural networks to spectrumbased direction-of-arrival estimation for automotive radar to achieve super-resolution performance and the capability of estimating. Jeong et al. [212] have discussed the implementation of a neural network-based machine learning strategy for real-time range-adaptive automatic impedance matching of wireless power transfer (WPT) applications. This method can effectively predict the optimal parameters of a tunable matching network and select the range of adaptive transmitting coils. Viveros-Wacher et al. [121] have applied an analog gross fault diagnosis method based on ANNs and constrained parameter extraction to analog fault diagnosis. Wei et al. [213] have introduced self-attention bidirectional LSTM (Bi-LSTM) networks for radar signal modulation recognition. The developed model has better recognition performance, especially at low SNRs with much lower computational complexity. Klinefelter and Nanzer [214] have applied the ANN-based approach to design millimeter-wave interferometric radar for automotive velocity sensing to achieve better estimation than the Bayesian estimator.

VIII. CONCLUSION AND FUTURE DIRECTIONS
This article has reviewed the state of the art of the ANN techniques for microwave CAD. Fundamental concepts and advanced techniques of ANN structure and training have been introduced. The KBNN has been presented for improving the accuracy and reliability of modeling and design optimization. Various ANN techniques, such as neuro-TF, neural network inverse modeling, and hybrid deep neural network technique, have been discussed. Further applications of ANN in microwave CAD have also been reviewed in this article. The literature review in this article covers the samples of the existing works on microwave ANN. However, our review is not exhaustive, and there are more papers existing in the literature, which are also valuable contributions to microwave ANN.
There are tremendous opportunities for further innovations and applications of ANN for microwave modeling and design, from advanced microwave-oriented ANN structures and training algorithms to novel applications. New methods will be needed for ANN models capable of wider input range, higher input dimensions, and, at the same time, allowing further reductions in the need of large amount of training data. Since the quality of ANN inherently depends on the quality and adequacy of training data, yet the expense of data generation in microwave problems is typically high, further advances in microwave-oriented ANN structures, data sampling, and training methods will remain important. Further advances in knowledge-based modeling methods continue to be another important direction, exploring new ways to incorporate different microwave knowledge into various intermediate parts of neural structures and training, and more generic methods to combine mapping of different subspaces of knowledge and neural network models. Incorporation of EM and multiphysics internal formulations into neural-based modeling process may lead to new solutions to conquer the expense of EM/multiphysics-based parametric modeling. Design is inherently an inverse problem to simulation. Subsequently, exploration of inverse modeling for fast and direct design solutions continues to be an attractive direction. Coupled with this design-oriented direction, there are many design steps that presently heavily require human experience and judgment. The exploration of machine learning algorithms to mimic the process of knowledge abstraction and knowledge utilization to accelerate the design process continues to be an open and strategic direction.

ACKNOWLEDGMENT
The author Qi-Jun Zhang would like to thank Prof. Jianguo Ma, Editor-in-Chief of the IEEE TRANSACTIONS ON MICROWAVE THEORY AND TECHNIQUES, for the initial invitation to write this review paper.