Proof of concept of a fast surrogate model of the VMEC code via neural networks in Wendelstein 7-X scenarios

In magnetic confinement fusion research, the achievement of high plasma pressure is key to reaching the goal of net energy production. The magnetohydrodynamic (MHD) model is used to self-consistently calculate the effects the plasma pressure induces on the magnetic field used to confine the plasma. Such MHD calculations serve as input for the assessment of a number of important physics questions. The VMEC code is the most widely used to evaluate 3D ideal-MHD equilibria, as prominently present in stellarators. However, considering the computational cost, it is rarely used in large-scale or online applications. Access to fast MHD equilbria is a challenging problem in fusion research, one which machine learning could effectively address. In this paper, we present artificial neural network (NN) models able to quickly compute the equilibrium magnetic field of W7-X. Magnetic configurations that extensively cover the device operational space, and plasma profiles with volume averaged normalized plasma pressure $\langle \beta \rangle$ ($\beta$ = $\frac{2 \mu_0 p}{B^2}$) up to 5% and non-zero net toroidal current are included in the data set. By using convolutional layers, the spectral representation of the magnetic flux surfaces can be efficiently computed with a single network. To discover better models, a Bayesian hyper-parameter search is carried out, and 3D convolutional neural networks are found to outperform feed-forward fully-connected neural networks. The achieved normalized root-mean-squared error ranges from 1% to 20% across the different scenarios. The model inference time for a single equilibrium is on the order of milliseconds. Finally, this work shows the feasibility of a fast NN drop-in surrogate model for VMEC, and it opens up new operational scenarios where target applications could make use of magnetic equilibria at unprecedented scales.


Introduction
The computation of magnetohydrodynamic (MHD) equilibria is central in magnetic confinement fusion, where it represents the core component of most modeling and experimental applications. In the stellarator community, the 3D ideal-MHD variational moments equilibrium code (VMEC) [1] is the most widely used, e. g., to infer plasma parameters [2,3], to reconstruct magnetic equilibria [4][5][6][7][8], and to design future devices [9][10][11]. VMEC is also employed for equilibrium studies in perturbed, and hence non-2D, axisymmetric configurations [12][13][14][15][16][17][18][19][20][21][22]. However, a single VMEC equilibrium evaluation can take up to O(10) minutes I even on a high-performance computing (HPC) facility, especially for a reactor-relevant high-β plasma configuration. Table 1 reports the orders of magnitude of VMEC total iterations and wall-clock time typically encountered in target applications. The high computational cost limits an exhaustive exploration of the use case input space. A parallel version of VMEC has recently been developed [23], however, for example, the wall-clock time of a single free boundary equilibrium reconstruction, both in the case of a stellarator and a tokamak scenario, is still on the order of hours [24,25].
In this paper, we use artificial neural networks (NNs) (see section 2.3) as function approximators to build a fast surrogate model for VMEC. A reduction in run times of up to 6 orders of magnitude can be achieved. The models are trained on VMEC runs from two independent data sets: D config and D β (see 2.2.1). D config includes a wide range of vacuum magnetic configurations, while D β covers a distribution of plasma profiles for a fixed magnetic configuration. To find better models, and to take the human out of the loop, a Bayesian hyper-parameter (HP) search is performed (see section 2.4).
I Run time on the Max Planck computing and data facility (MPCDF) cluster "DRACO", using the small partition and 16 cores. Bayesian Inference [26] 10 4 10 5 -10 8 Equilibrium reconstruction [4,25] 10 0 10 1 -10 4 Stellarator optimization [27] 10 3 10 4 -10 7 Since neural networks poorly extrapolate beyond the expressiveness of training data, a large and experimentally relevant data set is essential for good out-of-sample performance (see section 2.2). Training runs are sampled as employed in the Bayesian scientific modeling framework Minerva [28,29], aiming to reduce the covariate shift between the training and test data set. The magnetic configurations are sampled from a large hyper-rectangle around the nine Wendelstein 7-X (W7-X) reference configurations [30], while the plasma profiles are modeled as Gaussian processes (GPs) [31], and domain knowledge is embedded in the training data through virtual observations [32][33][34][35].
Since VMEC assumes nested magnetic flux surfaces, magnetic islands in the equilibrium field are not included by design. Furthermore, an ideal coil geometry (i. e., no coil misalignment or electro-magnetic deformations) is considered, while ideal coil currents and plasma profiles (i. e., error-free measurements) are assumed. The relaxation of these assumptions is not in the scope of this paper.
In the past, Sengupta et. al successfully regressed single Fourier coefficients (FCs) of the VMEC output magnetic field, using Function parametrization (FP) with quadratic or cubic polynomials for vacuum [36] and finite beta [37] magnetic configurations. The regression of the full VMEC output was broken down into subproblems, where a FP model was derived for each FC, leading to many free parameters to learn. In this work, on top of the previously mentioned components (i. e., physics-like plasma profiles and HP search), the learning task is to infer the full magnetic field geometry with a single multiple-input multiple-output (MIMO) model, where all the VMEC output FCs are regressed at once (see section 2.2.2). Using a single model drastically reduces the number of free parameters to learn, and it forces the NN to efficiently share them among the outputs. Contrary to FP, it is well-known that sufficiently wide or deep NNs can approximate a broad class of functions [38][39][40][41][42]. In addition, convolutional neural networks (CNNs), as powerful tools of current deep learning methods, are better suited to extract and reproduce translation-invariant spatial features from grid data, and to share their free parameters between the features while reducing overfitting. Furthermore, from the user standpoint, a single model can more easily be improved, adapted, and deployed.
For real-time plasma control, having access to low-cost magnetic equilibria can improve traditional strategies, and enable completely new data-driven approaches (e. g., reinforcement learning (RL) based control). In fusion research, the use of NN models to compute the plasma topology [43][44][45] and to speed up slow workflows [46][47][48][49][50] is not a novel idea, nevertheless, to our knowledge this paper represents the first which effectively addresses the 3D MHD physics in W7-X scenarios.

Methods
In the following relevant concepts and employed methodologies are described.

The VMEC code
The equilibrium problem under the ideal-MHD model is characterized by the force balance equation, Ampere's and Gauss's law VMEC uses a variational principle to solve the inverse formulation, which computes the mapping f : ζ → x between flux coordinates ζ = (s, θ, ϕ), normalized toroidal flux (s = Φ /Φ edge , where Φ(s) is the toroidal magnetic flux enclosed between the magnetic axis and the flux surface labeled s), poloidal and toroidal angle, respectively, and real space cylindrical coordinates x = (R, ϕ, Z), major radius, azimuth and height above mid-plane, respectively. VMEC adopts a spectral representation of x along the poloidal and toroidal angles. Assuming stellarator symmetry, the cylindrical coordinates can be expressed as where N fp ∈ N is the number of field periods. Furthermore, is an angle renormalization parameter such that θ * = θ + λ(s, θ, ϕ) represents the poloidal angle for which magnetic field lines are straight in (s, θ * , ϕ) [1]. The equilibrium magnetic field B can be written in contravariant form where B · ∇p = B s = 0 under the assumption of nested magnetic flux surfaces. The non-zero components are given by [1] whereι is the rotational transform, the prime denotes ∂/∂s, and √ g = ( ∇s · ∇θ × ∇ϕ) −1 is the Jacobian of the coordinate transformation f . The covariant representation of B can be obtained from equations (8) and (9) and the metric tensor g ij = e i ·ê j = ∂ x ∂ζi · ∂ x ∂ζj as follows Finally, the magnetic field vector strength is given by As in case of x, the magnetic field strength is described by VMEC using a spectral representation: Like in [1], x is redefined as x = (R, λ, Z), where the angle renormalization parameter λ replaces the toroidal angle ϕ.

Data set generation
To generate a large and W7-X relevant data set of magnetic configurations and plasma profiles, Minerva [28,29] is used. Within Minerva, models are described as directed, acyclic graphs. Each node can be deterministic (e. g., a diagnostic model or a physics code) or probabilistic (e. g., plasma parameters or diagnostic observed quantities). The edges define the dependencies between nodes. Model free parameters can be described via probabilistic nodes, where the node a priori distribution encodes the domain knowledge on the parameter. In the forward mode, observed quantities can be computed, while in the inverse mode, the model free parameters can be inferred with different inversion techniques (e. g., maximum a posteriori (MAP) and Markov chain Monte Carlo (MCMC) methods).
Using Minerva to generate physics relevant samples for NNs training has already been explored [51]. Here, a VMEC node is included in a Minerva model. Free parameters are represented by the magnetic configuration and the plasma profiles. The model is relatively simple and can be built as a stand-alone object, yet Minerva allows embedding domain knowledge (i. e., the prior distribution of the model free parameters) in the NN surrogate by reducing the covariate shift between the training data set D and the target application data set D target . This approach is similar to that described in [52], where experimental data have been used to populate the training data set. However, in this work, experimental data are not used directly, but simulated data drawn from experimentally validated distributions are used instead. This allows a dense coverage of the input parameter space, while restricting its extension to physically relevant regions only.
W7-X possesses a N fp = 5-fold stellarator symmetry, i. e., the main coil system comprises five identical modules, each of which is point symmetric towards the module center (see Section 2.1). The resulting magnetic field has a five-fold symmetry along the toroidal direction. Each half module includes five different non-planar and two planar coils. The vacuum field depends only on the currents I 1...5 and I A,B , respectively, the currents in the non-planar and planar coils. Except for a scaling of the magnetic field strength, the vacuum magnetic configuration does not depend on the absolute values of the coil currents but only on their ratios with respect to I 1 , i 2...5 and i A,B . The current ratios are uniformly sampled from a hyper-rectangle whose boundaries are provided in Table 2. These boundaries cover the nine reference configurations of W7-X [30], while extending to a larger set of conceivable configurations. To obtain a magnetic field strength of approximately 2.5 T on axis, at ϕ = 0 and for the standard configuration, the normalization coil current I 1 is set to I 1 = 13 770 A.
The plasma profiles cover a broad range of W7-X discharge scenarios, and include plasma pressure on axis up to 200 kPa, corresponding to volume-averaged β of approximately 5 %, and a net toroidal current ranging from −10 to 10 kA. The profiles are defined as a function of the normalized toroidal flux s: p(s) is the pressure of the plasma at the flux surface labeled s, and I(s) is the enclosed toroidal current flowing inside the surface s. With this definition, I(s = 1) is the total toroidal current in the plasma, which we will refer to as I tor .
Theoretically, all the possible continuous functions for s ∈ [0, 1] should be sampled. The exploitation of domain knowledge obtained from experience with W7-X discharges allows us to restrict the function space of the profiles, and to sample the region of interest denser as compared to unconstrained parametrization. The profile shapes are modeled via GPs [53], stochastic processes whose joint distribution of every finite, linear combination of random variables is a multivariate Gaussian. GPs are usually employed in the modeling context, as they can be seen as distributions of functions. For example, a one dimensional function f : s ∈ R → R with a GP prior is where µ(s) is the mean function, and Σ(s, s ) is the covariate function. Then, for a set S * := {s ∈ R}, the corresponding F * are distributed as where the covariate function matrix is computed elementwise.
In Minerva, plasma profiles are usually specified as GPs with zero mean, which does not restrict the mean of the posterior process to be zero [53], and squared exponential covariance function [54]. In particular, since profiles can have substantially different gradients in the core and edge regions [55], a non-stationary covariance function [56] is used [57]. Here, the GP mean and covariance functions are where σ y is usually fixed to σ y = 10 −3 σ f [34], and σ x , which represents the length scale function, is a hyperbolic tangent function where l core and l edge are the core and edge length scale, respectively. s 0 is the transition location and s w represents the length scale for the transition. The domain knowledge on the plasma profiles is encoded via the HPs of the GP used to represent them, which define the distributions from where the profiles are drawn. T he values of the GP HPs are uniformly sampled from a hyper-rectangle, whose boundaries are given in Table 2. These values are adapted from previous works where the plasma profiles in W7-X are modeled via GPs [34,49,54,58].
The profiles are further constrained by the use of virtual observations [34], such that the GP prior is refined with "virtual diagnostic measurements", described by a normal distribution. As usually observed in W7-X experiments, the electron and ion density and temperature profiles are peaked II in the core [59][60][61]. Therefore, the normalized pressure profile is constrained to 0 at the last closed flux surface (LCFS) and 1 on axis. Contrarily, the normalized toroidal current profile is set to 0 on axis, and 1 at the LCFS. Figure 1 shows a subset of the normalized plasma profiles, which are independently sampled from the two refined GPs. Finally, the profiles are scaled to the desired values: the pressure profile is multiplied by p 0 , the pressure value on axis, and I tor , which is the total toroidal current enclosed by the plasma, is provided as input parameter to VMEC.
All VMEC calculations are performed in free boundary mode, where the confined region is characterized with the total enclosed magnetic toroidal flux, Φ edge = Φ(s = 1). Given the large input space, VMEC runs which are not relevant for W7-X, e. g., runs which did not converge or exhibit values for the plasma volume and minor radius outside the boundaries given in Table 3, are discarded.

Training scenarios
To decouple the regression complexity of the 3D ideal-MHD equilibrium from the vacuum field computation, the problem is broken down in two different scenarios: a null and finiteβ cases, which lead to two independent data sets, D config and D β . D config is populated with vacuum magnetic configurations, i. e., pressure and plasma current profiles are constant 0. This scenario targets two applications: discharges with low β values which could be effectively studied with a vacuum field, and further investigations of the properties of the vacuum configurations of W7-X. In particular, the use of a slightly modified model is envisioned to further explore the richness of the vacuum magnetic configurations of W7-X, searching for optimized equilibria in terms of, e. g., neoclassical transport via the effective helical ripple amplitude ef f [62] or ideal MHD stability via the magnetic well [63]. In D β , the standard magnetic configuration (EJM+252) [30] is fixed, and the data set is populated with plasma profiles II A globally decreasing function of the radial profile, not to be confused with a high "peaking factor" as used in the fusion community.   as described in Section 2.2. This scenario covers discharges with volume-averaged β up to 5 % and net toroidal current up to 10 kA. The number of VMEC simulations for the two scenarios are 11 360 and 11 709, respectively. Of these, only 10 339 and 9675 converged. Finally, after filtering out the equilibria based on the ranges given in Table 3, the data sets contain |D config | = 9589 and |D β | = 9332 valid runs, respectively.
In this work, the two dimensions to characterize a W7-X magnetic configuration, the vacuum field geometry and the plasma profiles, are independently explored in D config and D β . Given the large vacuum magnetic configuration space probed in D config and the relatively low values of β included in D β , the spread of the FCs describing the equilibrium field is expected to be higher in D config than in D β . Furthermore, W7-X is an optimized stellarator where the plasma influence on the magnetic configuration has been strongly reduced by the minimization of the bootstrap current and the Shafranov shift [64]. Hence, the equilibrium field coefficients are expected to be smooth functions of the main parameters characterizing the plasma, p 0 and I tor , in contrast to the flexibility of the vacuum magnetic configurations of W7-X. In the scope of the next steps of this proof of concept, working models in these two extreme cases can give valuable insights on the use of NNs for the regression of the equilibrium magnetic field in a arbitrary finiteβ configuration.

Models inputs and outputs
In D config , the inputs are represented by Φ edge and the six independent coil current ratios, while in D β , Φ edge , p 0 , I tor , and the normalized pressure and toroidal current profiles are used. In both scenarios the regressed outputs are the iota profile,ι(s), the Fourier series of the flux surface coordinates, represented by R mn (s), λ mn (s) and Z mn (s) and the Fourier series of magnetic field strength, B mn (s). The output FCs are regressed instead of the real space values for the following reasons: first and foremost, the Fourier series profiles are a compressed representation of the magnetic field, thus letting the network learn a reduced number of independent outputs. Furthermore, we seek to replace VMEC with similar input and output signature as the original code such that our application can serve as a drop-in replacement for existing use cases. For example, in the context of the application of this work in the inference of plasma parameters, the flux surface coordinates are needed to map real space diagnostic measurements to flux coordinates [54,65], and the magnetic field strength plays a crucial role in the analysis of many diagnostics (e. g., electron cyclotron emission (ECE) [26]).
For the generation of the data set, the resolution of the VMEC output is set to N s = 99 flux surfaces and m pol = n tor = 12 , where |m| < m pol and |n| ≤ n tor are the poloidal and toroidal Fourier modes respectively. Since all the outputs are real quantities, o mn = (o −m,−n ) * for o ∈ {R, λ, Z, B}. This limits the independent FCs to a subplane (usually m ≥ 0). Despite this symmetry consideration, still 28 512 coefficients remain per output III . Figure 2 shows the FCs of the three coordinates for one sample in the data set, evaluated at the LCFS. However, it has been argued that m pol = n tor = 6 modes are sufficient to represent the magnetic field in case of W7-X configurations [37]. In this work, the sufficient Fourier resolution is further investigated. For the radial profile, a subset ofN s flux surfaces is selected, while up tom pol andn tor poloidal and toroidal modes are used for the FCs. To more densely cover the plasma region near the axis, the flux surfaces are selected such that their III The number of FCs per coordinates scales as O(Ns · m pol · ntor).  radial locations s follow a quadratic progression in [0, 1]. To compute the loss of information due to the downscaling, the reduced representation is upscaled to match the full resolution by asserting x mn = 0 for x ∈ {R, λ, Z} if m ≥m pol or |n| >n tor . Then, the outputs R, λ, Z and B are evaluated with equations (4) to (6) and (13) on a grid along the θ and ϕ angles, using N θ = 18 poloidal and N ϕ = 9 toroidal points per period. Finally, the full radial resolution is recovered by cubic interpolation along s. Similarly, a reduced resolution of the iota profile is investigated, usingN s flux surfaces (the same as those employed for x and B). To compare the reduced to the full resolution, the iota profile is then upscaled via cubic interpolation. Given a set Y = {y ∈ R K } of generic quantities y with true or reference value y * , the root-mean-square error (rmse) between y and y * is computed as Here it is used to compare the two resolutions, where for each output, y is the reduced output representation of y * , and K is the number of evaluation points: Kι =N s , and Figure 3 shows the rmse for different values of the resolution parameters.
In case of the iota profile,N s = 20 flux surfaces are sufficient for a deviation of approximately rmse * ι = 10 −4 , N s = 10,m pol = 6 andn tor = 4 are needed for the flux surfaces coordinates to achieve rmse * R,Z = 10 −3 m and rmse * λ = 10 −3 rad , andN s = 10,m pol = 6 andn tor = 12 are used for B to obtain rmse * B = 10 −3 T. These choices result in 20 locations for the iota profile, while 1500 FCs describe the flux surface coordinates and the magnetic field strength IV . This resolution represents a practical trade-off between the complexity of the regression task and the reconstruction fidelity. It is important to note that rmse * ι , rmse * R,Z , rmse * λ and rmse * B represent a lower bound of the reconstruction error that can be achieved by using the models presented in this work.
Given the two data sets, D config and D β , and the three output quantities,ι, x and B, six independent regression tasks are defined: config-iota and β-iota, config-surfaces and β-surfaces, and config-B and β-B. In the config-iota and βiota tasks, a NN is trained to compute the reduced resolution iota profile, using respectively D config and D β as data set. Similarly, in the config-surfaces, β-surfaces, config-B and β-B tasks, a NN is trained to compute the FCs of the reduced resolution x or magnetic field strength B, using D config and D β , respectively.
Considering the scope of this paper which attempts to develop a VMEC proof-of-concept surrogate model, it is useful to investigate the performance on independent subproblems. In future works, a single NN could be trained to compute all outputs and to handle both vacuum and finiteβ runs.

NN architectures
In general, given two quantities ψ ∈ R K and γ ∈ R D , and a set of N observations ( ψ i , γ i ) sampled from a fixed but unknown distribution p( ψ, γ), a NN, parametrized with a set of free parameters w, can be employed to learn a mappingf : R K → R D which minimizes the empirical loss In this work, the expressive power of NNs is exploited to learn a low-cost approximation of a known function, using observations sampled from a known distribution. A NN usually employs successive layers of artificial neurons to create the mappingf , where each neuron computes a nonlinear transformation of the neurons from the previous layer. The NN free parameters w are derived during the training process to minimize the empirical loss on the given training set. For a detailed introduction on NNs please refer to [66]. Two NN architectures are adopted herein. One is a feedforward fully-connected neural network (FF-FC), which is composed of a sequence of dense blocks, each comprising a dense layer with L 2 regularization and a non-linear activation function. The number of hidden units is halved for each successive block. The activation function for the last block is the identity. Figure 4 illustrates the architecture for the config-iota task, where a network with five of such blocks is shown.
The FF-FC architecture is used on the iota reconstruction, where the regressed output is composed of only 20 elements. However, its number of free parameters grows linearly with the dimensionality of the output. Thus, more efficient architectures are needed for the surfaces and magnetic field strength reconstruction, where each sample has 1500 output elements. Hence, 3D CNNs [67,68] and encoder-decoder like architectures [69][70][71] are explored. In these architectures an encoder processes variable-length input features and generates a fixed-length, flattened representation. Conditioned on the encoded representation, the decoder then builds the required outputs.
For the these tasks the x coordinates are stacked. Figure 5 displays an example of such architecture for the βsurfaces task, where a CNN architecture with transposed convolution is used. In the encoder tree, high-level features are extracted from the plasma profiles via consecutive 1D convolutional blocks and concatenated back with the scalar inputs into a flattened representation. Then, a decoder tree gradually builds up the output via consecutive transposed convolutional blocks. Finally, the output shape is matched via a 3D cropping operation. Each encoder block comprises a 1D convolutional layer, batch normalization [72], and a non-linear activation function with dropout [73]. Similarly, a decoder block is composed of a 3D transpose convolutional layer, batch normalization, and a non-linear activation function with dropout. For the last block, batch normalization is not included and the identity activation function is used. For each block the number of filters in the encoder tree is doubled, while halved in the decoder tree. Convolutional layers with stride are employed over up-sampling operations, as suggested by [74].
The stacking of consecutive convolutional layers acting on inputs of different length scales, in conjunction with a scaling of the feature channels, is a common approach in modern deep convolutional neural network (DCNN) architectures. The use of the SeLU activation function, as discovered by HP search, leads to whitened layer input distributions, which improve the convergence of the training process [79].
This structure decreases the number of free parameters by forcing the model to learn a hierarchical representation of high-and low-level features, while imposing a regularizing effect during training. A subset of the NN architecture HPs is not fixed a priori, but optimized via HP search. The lists of the explored HPs (e. g., the layer non-linear activation function) are provided in section 6.1.
In both architectures all weights are uniformly initialized as suggested by [75], while the bias terms, where present, are initialized to zero. The weights are then optimized via the Adam optimizer [76], while reducing the learning rate by a fixed multiplier factor once a validation loss plateau is reached. Early stopping [77] is employed during training. The NN models are built, trained and evaluated via the open source software package Tensorflow [78] on a single NVIDIA RTX8000P virtual graphical processing unit (GPU).

Training and evaluation pipeline
For each task defined in Section 2.2, the training and evaluation pipeline includes the following steps:  tectures make a manual model optimization procedure hardly effective. Therefore, to standardize the search of more performing models, an automated approach to HPs search is used in this work. In particular, the tree-structured Parzen estimator (TPE) [82] algorithm, provided via the open source software package hyperopt [83], is employed. TPE is a sequential model-based optimization (SMBO) algorithm, where the true fitness function, e. g., the model training and evaluation, is approximated with a low-cost model that is cheaper to evaluate. The proxy model is then numerically optimized to retrieve new configurations to be evaluated. Contrarily to other SMBO strategies where the fitness function is directly learned, TPE models the distribution function of configuration values given classes of optimal and non-optimal fitness function values. It then optimizes the expected improvement (EI) criterion [84] with a heuristic procedure. Its main advantage over other HP search approaches is the sampling efficiency on treestructured configuration spaces [82], i. e. spaces in which not all dimensions are well-defined for all the configurations (e. g., number of hidden units in the second layer of a single-layer FF-FC model). For a detailed description of the algorithm, please refer to [82]. On each learning task, 30 search iterations are performed. The data set is split in 20 % for testing, 10 % for validation and 70 % for training. For each search iteration, the training data is used to train the model, while the validation data is used to assess the model regression error and inform the search strategy. The best performing model is then adopted in the cross-validation scheme. To ease the computational cost of the search, a simple mean-squared error (mse) loss is used for training and HP validation.

Repeated k-fold cross-validation
To estimate the regression error on out-of-sample data, a five-fold cross-validation evaluation is repeated 10 times. In a k-fold cross validation scheme [85,86], the data set is partitioned into k-folds of equal cardinality. Then, for each fold, the training process is repeated k-times, using the selected fold as test set, and the remaining folds for the training and validation sets. The estimate of the regression error is the average of the test error on each fold. However, the cross-validation estimate of the regression error can be highly variable due to the single partition of the data set into the k-folds [87]. To overcome this limitation, in the repeated k-fold cross-validation scheme, the k-fold cross-validation scheme is repeated n-times, partitioning the data set into a different k-fold each time. The average of the test error on each fold is then used as the final estimate.

Data and code availability
The data sets and code needed to reproduce this work are available at https://gitlab.com/amerlo94/ vmecfastsurrogate.

Results
The results achieved on each task are now presented. It is important to remember that D β includes plasma profile

Iota regression
for a fixed magnetic configuration (the standard configuration), while D config explores the rich space of W7-X vecuum magnetic configurations. The changes in D β , induced by finite-beta effects, are then small compared to those in D config , induced by coil currents (i. e., finite-beta effects span a space that only slightly expands the vacuum solution). Therefore, the spread of the output data in the finite-beta cases is smaller than in the vacuum scenarios: the coil system of W7-X has been designed to allow a large flexibility in the vacuum magnetic configuration space [88,89], while the W7-X optimization explicitly targeted robustness against changes in plasma profiles, in particular pressure profiles [64,90]. These features are expected to make the output data in the finiteβ tasks more difficult to resolve because of the smaller spread. Therefore, to quantitatively compare the results across all tasks, the normalized root-mean-squared error (nrmse) is used instead. Given Y = {y ∈ R K } (see section 2.2.2), the nrmse between the predicted y and the true or reference y * is computed as: y ki . Theι profile is evaluated along the radial profile withN s flux surfaces, so Kι = 20. As employed in section 2.2.2, an evaluation grid with N θ = 18 and N ϕ = 9 is used for the flux surface coordinates and the magnetic field strength. The use of the nrmse allows us to aggregate the regression error on the three flux surface coordinates, and to compare the results across all outputs and scenarios. Table 4 summarizes the results for all tasks. As expected, on each output, the nrmse in the vacuum scenario is lower as compared to the finiteβ case. Moreover, a nrmse below 10 % is consistently achieved for theι profile and the magnetic field strength. In the flux surface coordinates tasks, nrmse values between 14 % and 20 % are achieved instead.
Given the relative small size of the data sets and of the NNs, the model training time is on the order of magnitude of minutes but less than an hour. More importantly, the inference time, even in the most conservative evaluation (i. e., with a single thread on 1 CPU core with a batch size of 1) is on the order of few milliseconds. However, parallel computation (e. g., batched inference and GPU deployment), pre-and post-training optimizations (e. g., model pruning and quantization), are expected to deliver consistent orders of magnitude speed-up [92,93]. These optimizations are out of the scope of this paper.
Tables 5 to 10 list the HP values for the best performing models discovered via HP search (see section 2.4). As reported in table 4, the nrmse obtained during search is compatible with the nrmse estimated via cross-validation (i. e., its value is within the 95 % interval of the distribution). This means that the HP search procedure did not overfit V to the validation data, but HP values which perform well on the whole data set were found.
V High variance of the model error on unseen data.
In the following, the fidelity of the different NNs is inspected in closer detail and the major influences on the regression error are identified. Figure 6 shows the rmse profile along the radial flux coordinates for the config-iota and β-iota tasks. Although the average nrmse in the β-iota case is higher than in the configiota case, the rmse is on the order of 10 −3 for both. In the config-iota scenario, the rmse increases from the axis to the edge. This may be caused by the characteristic shear profile of W7-X magnetic configurations and the hence increasing spread ofι profile in the data from the axis to the edge. Instead, in the β-iota task, the toroidal current (and partially the pressure) profile is the main parameter affecting ι. By data set construction, these have a larger spread at mid-radius (see figure 1). The larger spread is reflected in the maximum at s ≈ 0.4. In both cases, this work shows that even shallow, FF-FC NNs can effectively regress theι profile with high accuracy.

Iota regression
The qualitative fitness of the model can be visualized in figure 7, which shows the worst and median predictedι profiles for the worst performing cross-validation fold. In addition, as highlighted in figure 8, e ven in case of the worst predicted sample in the worst performing cross-validation fold (i. e., the worst possible scenario included in the data set), the model is still able to capture the main features of theι profile (e. g., theι shear). config-iota β-iota Figure 6: Results for the config-iota (blue) and β-iota (red) tasks. Lines show the rmse mean and 95 % confidence interval as a function of the radial coordinate s. While the rmse in the β-iota task is generally lower than 10 −3 , the rmse in the config-iota scenario increases along the radial profile, following the characteristic vacuum W7-Xι profile.   In all coordinates, apart from R in the β-surfaces task, the rmse is higher at s = 1, i. e., the LCFS. We find it worth investigating this in more detail, and hence examine the poloidal and toroidal dependency of the rmse specifically at the LCFS with figure 10. In order to emphasize the error, the worst performing cross-validation fold is shown, and a grid with N θ = 36 poloidal and N ϕ = 18 toroidal points per period has been used. The error for R is almost flat on the surface, with maxima at ϕ ≈ 0 rad, representing the tips of the bean-shaped cross section. On the other hand, the error for Z and λ shows a m = 1, n = 1 dependency. In the config-surfaces scenario, at ϕ /2π ≈ 0.03 (and at ϕ /2π ≈ 0.17 following the symmetry) a higher rmse is observed. In the βsurfaces task, while the rmse for Z still shows a poloidal and toroidal dependency similar to that observed in the vacuum case, the dominant rmse factor for λ is a poloidal m = 1 term.

Flux surfaces regression
To further investigate the rmse poloidal and toroidal dependency, figure 11 shows the regression error on the FCs of (R, λ, Z) evaluated at the LCFS. In this figure, to effectively compare the error on both low-order and high-order modes  (see figure 2), the nrmse is used. Again, the worst performing cross-validation fold is shown. It is important to note that the FCs are the actual quantities which the NN learned. In both cases, the leading FCs are regressed with a nrmse below 20 %. However, there are some regions in the (m, n) space which the model struggles to reconstruct, in particular in the β-surfaces task (see figure 11b). The regression performance is visualized in figures 12 and 13, where the true and regressed flux surfaces at the bean-shape (ϕ = 0 rad) and triangular ( ϕ /2π = 0.10) cross sections are represented. Worst (left), median (center), and best (right) regressed samples from the worst performing cross-validation fold are shown. The LCFS shows the largest inconsistency (as already observed in figure 9), and in particular the R coordinate of the high-field side of the beanshaped cross section (i. e., θ = π) appears to be the most complicated feature to resolve (as previously observed in figure 11).
Of the three flux surface coordinates, λ is the most arduous to reconstruct. Although not needed to compute the location of the flux surfaces, it gives information on the direction of the magnetic field lines. In particular, the λ 0n FCs are hardly regressed in both flux surface tasks. Earlier works have encountered similar challenges and the lack of spectral minimization for λ in VMEC is presumed to cause such difficulties [37].

Magnetic field strength regression
The variance of the magnetic field strength contained in the vacuum and finiteβ scenario data sets are on different orders of magnitude. In config-B, the magnetic field strength exhibits an average spread of σ config-B = 252 mT, while in β-B, the spread is only σ β-B = 28 mT. This mainly derives from the rich vacuum magnetic configuration space of W7-X and the low impact of pressure and toroidal current on the equilibrium field [89]. Therefore, the magnetic field strength in the β-B task is more difficult to resolve. Indeed, even though the achieved nrmse for the β-B task is higher than in config-B, the rmse in β-B is considerably lower than in config-B, as figure 14 shows, due to the smaller spread in the data set. Additionally, in β-B, the rmse does not seem to have any radial dependency, while the regression error increases from the magnetic axis towards the edge for config-B.
The topology of the regression error of the magnetic field strength at the LCFS, for the worst performing crossvalidation fold, is visualized in figure 15. In the config-B task, in addition to a non-zero baseline error (i. e., m = 0 and n = 0), n = 1 toroidal and m = 1 poloidal terms are visible. This stems from the fact that the main FCs of W7-X, besides B 00 , are B 01 and B 10 , while in general the other B mn are much smaller [89]. Contrarily, in the β-B task, the rmse surface is more indented and higher B mn error terms become the dominant influence on the regression error. Figure 16 qualitatively captures the regression of the leading FCs, where the true and predicted FC profiles are plotted in case of the worst and median samples. The worst performing cross-validation fold is shown. As observed in figure 15, B 01 shows the largest discrepancy.

Summary and outlook
This paper investigates the feasibility of building a fast surrogate NN model of the MHD equilibrium code VMEC in W7-X magnetic configurations. It extends earlier works [36,37] by using physics constrained plasma profiles, modern NN architectures and workflows, and by employing single models to reconstruct multiple output quantities. The decomposition of the problem into a vacuum and finiteβ data set allows the independent study of the two limiting cases, of which the viability is necessary for a future VMEC surrogate model.
The reconstruction of theι profile shows a nrmse between 1 % and 5 %. Regression of flux surface coordinates (R, λ, Z) gives nrmse values between 14 % and 20 %, where the λ coordinate appears to be the most problematic to regress. For  the magnetic field strength B, nrmse values between 3 % and 10 % are obtained. In almost all outputs, the regression error increases from the magnetic axis towards the edge. As expected, the regression of the finiteβ samples proves to be more challenging than the vacuum cases. However, the observed rmse values were often similar for the two scenarios. Limited to the investigated scenarios, a relatively small data set (e. g., 10 k samples) seems to be adequate.
The promising results of this paper show that NNs can be used to deploy a drop-in surrogate model for VMEC, although additional questions have to be investigated. First, the performance of such models to resolve both the vacuum magnetic configuration and the finite-beta effects on the equilibrium magnetic field has to be assessed by using a data set which comprises both vacuum and finiteβ samples. Second, to define a quantitative required accuracy for the models, which strongly depends on the target application, the degree to which physics quantities of interest, such as MHD stability or neoclassical transport rates, are faithfully reproduced has to be characterized. This verification repres-ents a key metric to gauge the use of NN models to provide fast, yet physics-preserving, MHD equilibria.
Given such unexplored application, several paths can still be investigated. First, multiple output quantities (e. g.,ι and x) can be regressed at once with a single model, thus exploiting the correlation between those quantities. Second, to obtain self-consistent equilibrium magnetic fields and flux surfaces geometries, the magnetic field strength B could be computed directly from the model'sι and x instead of being regressed (see equations (8) to (12)). Third, domain knowledge and physics constraints could be embedded in both NN architecture and training process [94], and the coil system geometry could be extended to a generic device geometry, thus opening up the possibility to use such a surrogate model in a generic stellarator optimization workflow. Fourth, to reduce the dimensionality of the problem, the radial dependency of the output quantities could be cast as an additional predictor, also gaining the ability to compute analytical derivatives with respect to the radial coordinate.
Furthermore, broader HP search and an ensemble of NNs   could further improve the performance over single base learner [95], and optimization techniques, such as pruning and quantization, are expected to deliver improved inference times. Moreover, the results of this work suggest that the full Fourier resolution is in reach if larger NNs and longer training time are accessible. Finally, the use of MHD fast surrogate models can impact multiple applications: fast Bayesian inference of plasma parameters and equilibrium reconstruction workflows for intra-shot analysis VI , access to large and rich optimization VI If target physics quantities are not adequately reproduced by the surrogate model, a two-stage approach should be pursued: employ the model predictions to extensively explore the target input space, then, switch to a high-fidelity equilibrium computation to refine the solution. This may be applied to provide fast transformations for diagnostics, a broad exploration of the posterior probability distri-spaces for present and future magnetic confinement devices, milliseconds-range MHD equilibrium computations for realtime plasma control, and the generation of very large data sets of equilibrium computations necessary to investigate machine learning (ML) control strategies (e. g., reinforcement learning) VII .
bution in a Bayesian framework, or good initial configurations for a more rapid convergence of equilibrium codes. VII It is important to note that when a sufficiently large data set is accessible, given the relative low training time, the proposed NN models could be trained to target specific data distributions expected for a use case application, thus reducing the covariate shift between the training and test set.  Figure 12: True (pink) and predicted (green) flux surfaces for the bean-shape (upper) and triangular (bottom) cross sections on the config-surfaces task ( ). The worst (left), median (center) and best (right) regressed samples are shown from the worst performing cross-validation fold.

Author Statement and
We wish to acknowledge the helpful discussions on VMEC and MHD equilibrium with J. Geiger. Furthermore, we are indebted to the communities behind the multiple open-source software packages on which this work depends.
The data sets were generated on the MPCDF cluster "DRACO", Germany. Financial support by the European Social Fund (ID: ESF/14-BM-A55-0007/19) and the Ministry of Education, Science and Culture of Mecklenburg-Vorpommern, Germany via project "NEISS" is gratefully acknowledged. This work has been carried out within the framework of the EUROfusion Consortium and has received funding from the Euratom research and training pro-

Hyper-parameters values
Tables 5 to 10 report the HP values of the best performing model on each task discovered via HP search.