Neural network performance enhancement for limited nuclear fusion experiment observations supported by simulations

It has recently been shown that artificial neural networks (NNs) are able to establish nontrivial connections between the heat fluxes and the magnetic topology at the edge of Wendelstein 7-X (W7-X) (Böckenhoff et al 2018 Nucl. Fusion 58 056009), a first step in the direction of real-time control of heat fluxes in this device. We report here on progress on improving the performance of these NNs. A particular challenge here is that of generating a suitable training set for the NN. At present, experimental data are sparse, and simulated data, which are much more abundant, do not match the experimental data closely. It is found that the NNs show significantly improved performance on experimental data when experimental and simulated data are combined into a common training set, relative to training performed on only one of the two data sets. It is also found that appropriate pre-processing of the data improves performance. The architecture of the NN is also discussed. Overall a significant improvement in NN performance was seen—the normalized error reduced by more than a factor of three over the previous results. These results are important since heat flux control in a W7-X, as well as in a future fusion power plant, is likely a key issue, and must start with a very limited set of experimental training data, complemented by a larger, but not necessarily fully realistic, set of simulated data.


Introduction
In machine learning, the size of the data set on which an artificial neural network (NN) is trained is crucial for a good performance. However, there are many cases where the cardinality of real (i.e. measured) data is limited. This issue may be addressed by complementing the available data with simulations. The underlying model for a simulation is generally (Some figures may appear in colour only in the online journal) Original content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. Wendelstein 7-X (W7-X) [5], which recently went into operation. Experimental data are still scarce, and the interactions between the edge plasma and the plasma-facing components are multifaceted and complicated, involving at least plasma physics, atomic physics, chemistry, and solid state physics, so that state-of-the-art codes such as EMC3-Eirene [6,7] do not yet capture all the important dynamics. The focus is on developing NNs that, trained on simulated data, supplemented with a very limited number of experimental data, generalize well onto experimental data. Our specific example is the training of a NN using simulated and experimental observations of W7-X heat load patterns to reconstruct ι -, a property of the magnetic field at the plasma edge that determines the heat load patterns. The future practical application of this NN will be a part of a real-time control system, ensuring the control and safety of all W7-X plasma facing components.
In the next section we briefly explain the essentials regarding W7-X, heat load pattern retrieval from infra-red cameras and a proper formulation of ι -, followed by a definition of the data set composition. The two used NN architectures are described and the parametrizations introduced before the NN performance is presented and analyzed.

Context for the neural network application
W7-X is a world-leading, relatively large experiment of the stellarator type [5] (16 m outer diameter, 30 m 3 confinement volume). It has a carefully tailored magnetic field configuration designed to confine hot plasma and aims to explore if this concept can be scaled up to yield net energy production from thermonuclear fusion processes. This paper focuses on the optimization of a neural network that is to become an important piece of a real-time control system for the heat and particles exhausted by the plasma onto the components specially designed to absorb these heat and particle loads. For the present studies, these plasma-intercepting components were so-called limiters (figure 1(c)), but later divertor modules are used. Perhaps the most important parameter determining the spatial heat load distribution onto these components is the magnetic winding number ι -. It describes how many full poloidal turns a magnetic field line performs when performing one full toroidal turn, and is around 0.9 for these studies. Overview of essential parts of W7-X in its first experimental campaign. (a) Top-down CAD view of the W7-X inner vessel, showing sight lines of the IR camera system for the limiter setup with cutaways in modules three and five as used in the first experimental campaign. On this scale and view, the limiters are small (green). One segment of the total 50 modular (blue) and 20 planar coils (black) is overlaid in modules 1 and 2. (b) Coils contained in one module with modular coils 1-5 (red) and planar coils A, B (blue). One module is point symmetrical towards its center. Adapted with permission from [8]. (c) Side view of the W7-X limiter in module 5. Therefore, we started investigating if a NN could determine ιgiven, as input, measurements by infrared (IR) cameras of heat load patterns. Figure 1(a) shows an overview of W7-X and IR camera views. Our first results were recently published showing that indeed this is possible [1]. The present paper is a systematic attempt to further improve the performance of NNs for this application, by investigating how the pre-processing of the input data, the character and quantity of training sets, and the NN architecture affect the performance.
Following the previous paper, we train the NNs to estimate the current in coil set B (figure 1(b)), I B , appropriately normalized, instead of ιitself, but this is a technical detail. For these studies, numerical and experimental, there is a one-toone correspondence between the two.
For details regarding the underlying physics and data origin (experimental as well as synthetic) we refer to [1]. Figure 2 shows an example of a heat load pattern on the limiter from W7-X module 5 for vacuum field line diffusion simulation (figure 2(a)) as well as IR data (figure 2(b)) at the same magnetic configuration. The right half of the limiter is shadowed from the IR camera view. Some basic characteristics are similar, e.g. the maximum heat load is located at the third limiter tile. However, in detail, the structure of the IR observation is not reproduced.

Data sets
The investigated data sets result from experimental and simulated ιscans of W7-X. The scans were performed by varying the current in one coil set, planar coil set B (see [1, section 2.3]). The simulation set S was created by the field line diffusion approach described in [1, section 2.4.2] with |S| = 3993.
The experimental set of processed IR data I comes from six different experimental magnetic configurations that were investigated. For each magnetic configuration, two to five sets of data were taken, giving a total of 16 IR videos (see [1, Each IR video corresponding to one experiment contributes on average 20 frames for the same value of I B , leading to a total cardinality of |I| = 319.
A subset I ⊂ I with |I | = 190 is defined such that I as well as I \ I cover all six experimental settings. Each of these two sets contain either all or none of the frames of each single experiment.
A mixed set is defined as M = (S ∪ I) \ I .
To determine NN quality, three disjoint subsets, namely training set, validation set and test set have to be defined.

NN architectures
Three NN architectures were compared. Since the data set is very limited, the number of free parameters must be kept low to avoid overfitting. The first one is a fully-connected feedforward neural network (FFNN) with 3 layers of 64 nodes each with tanh activation function and L2-regularization with a factor of 0.001. The second one is mainly based on convolutional layers [9] as shown in figure 3. Three consecutive convolutional layers are followed by two fully-connected layers with 64 and 1 nodes, respectively. The number of filters in the convolutions are (in ascending order) 4, 8 and 16 with a kernel slide of 1. The third NN architecture starts with an inception module as defined in [10] but instead of a very deep structure as suggested for the large ImageNet data set [11] it is followed by pooling [12], a 1 × 1 convolutional layer and two fully-connected layers, as shown in figure 4. The number of filters in all convolutions within the inception is 4, it is 8 for the following convolution and there are 16 and 1 nodes within the last two fully-connected layers, respectively. The pool kernel within the inception module has the dimension 3 × 3 and the pool kernel dimension is 2 × 2 for the one following the inception module. All the kernel slides are 1 except for the pool following the inception, which has a slide of 2. In the last two architectures all activation functions except for the last layer are rectified linear units (ReLU). Because the NNs are designed to solve a regression problem, the activation function of the last layer is the identity. The analysis of the parameters presented in the following section 2.4 is mainly performed using the convolution neural network. For the other two networks only examples are shown.
Weights are initialized randomly as recommended in [13] and biases are initialized as zero. The weights and biases are iteratively improved by the adam optimizer [14] during the training process. Early-stopping is applied as soon as the validation loss-function increases. The implementation is done in TensorFlow [15]. The calculations are computed on a workstation with two Intel Xeon CPU E5 − 2650 and four NVIDIA GeForce GTX 1080 Ti.

Parametrization
The heat load is given on an unstructured triangular grid that represents the CAD structure of one limiter. This data is transformed affine from W7-X coordinates to an orthogonal ξ, η, ζcoordinate system, where ζ points in the direction of n m , the normalized mean of the normals of all triangles forming the limiter. The rotation is achieved by the matrix R with Each axis is scaled such that the minimal value in this coordinate is 0 and the maximum value is 1. An example for such an affine transformation is shown in figure 5 for a half cylinder with remote resemblence to the W7-X limiter. Only triangles within a tight bounding box around the limiter are considered for the parametrization. There are no divisions in the ζ direction. The first three columns in table 1 show one dimensional partitionings. The inception NN is applied only for inputs of dimensionality of at least 5 × 5. However, for the onedimensional inputs the convolutional kernel dimensionality is reduced from that presented in figure 3 such that it does not exceed the input dimensionality.

Extracted characteristics.
For each element p of the partitionings, characteristic values can be extracted. One is the heat-load-weighted spatial mean with triangle centroids and weight vector W p defined as with number of triangles per partitioning n p , triangle areas A pi , and triangle heat load q pi (in Wm −2 ). The weighted covariance matrix is defined as with vector of ones J np of dimension n p and element-wise multiplication . Another statistical characteristic is the spatial standard deviation σ p calculated as with element-wise square root and the operator diag() which extracts the diagonal elements of a matrix. The direction vector δ p , calculated as the eigenvector corresponding to the largest eigenvalue λ max of Cov(Ξ p ) can be characteristic as well. This parameter is inspired by divertor heat load patterns which show a more complicated shape [16]. The last examined parameter is the relative heat load with number of partitions m and (8) defined as the absolute heat load.
Three combinations of those parameters are studied as NN input: (µ, σ), (µ, δ) and ρ exclusively. In figure 6 ρ is shown for simulation and experiment of the same physical condition. Note that the input dimension of the NNs is batch size × n ξ × n η × input channels, with input channels being 6 for the cases (µ, σ) as well as for (µ, δ) but 1 for ρ. In the case of ρ as parameter, the 1 × 1 convolutions within the inception module (see section 2.3) perform rather trivial operations. Whether this affects the performance was not investigated. Pooling layers are used for all partitionings to achieve comparable architectures, although they may not be important for smaller partitionings. The number of free parameters resulting from the chosen NN type, partitioning and characteristic is shown in figure 7. We define the following notation to describe the NN settings: f(train, validate, test) parametrization partition, architecture , where f can be any function.
For example rmse(S 90 , S 10 , I) ρ 9×5,inception refers to the root mean square error of an inception NN trained as well as validated on samples from set S and tested with experimentally observed IR data I, requiring a 9 × 5 input of relative intensities ρ. S 90 ⊂ S and S 10 ⊂ S with S 90 ∩ S 10 = ∅ refer to two disjunct sets for training and validation consisting of 90% and 10% of the samples within S respectively. Given a single term in the brackets implies that training, validation and test data sets are disjunct and subsets of the same super set. If one describing parameter is omitted, the entirety of all possible parameters of that kind is referred to. So (S) ρ 9×5 describes all NNs parametrized by ρ and partitioned into 9 times 5 parts with both of the two considered architectures. They are trained, validated and tested on subsets of S.

Results
The convolutional NN performance for all tested settings is shown in figure 8. These two graphics depict the rmse dependence on the partitionings defined in section 2.4.1. The NN performances measured in terms of the rmse are divided into 10 groups representing the partitioning. Within each group the parameter choices are shown. To avoid false conclusions by statistical outliers, an ensemble of 27 NNs trained with the same settings has been calculated. Variations are the randomness of the weight initialization and the mini batch sampling as well as different learning rates and batch sizes. Markers and bars indicate the mean rmse (rmse) and the associated 95% confidence interval for the mean respectively. The confidence interval has been calculated by bootstrapping [17]. The true possible values of I B range between −0.05 and 0.18. In order to facilitate the evaluation of the NN reconstruction quality, a reference value of 10% of the total I B range is marked by the dotted, green line. The figures 9-11 depict subsets of the   outcomes shown in figure 8 to clarify the observations, while figure 13 compares the different NN architectures as an additional hyperparameter.

Simulation trained NNs
The results of the NNs trained on S are shown in figure 8(a). It can be seen that (S) ρ performs well with both NN architectures for partitionings up to 36 × 12. The rmse is minimal at the partitioning 9 × 5 and increases with finer as well as coarser resolution. The NNs (S) µ&σ and the (S) µ&δ perform especially well for the coarse partitionings between 2 × 1 and 9 × 5. For finer partitionings the performance decreases gradually. This can be understood contemplating the decreasing information content per section with shrinking section size.
On the basis of figure 8(a) we investigate the performance of NNs characterized by (S) versus (S 90 , S 10 , I). A good performance of the NN (S 90 , S 10 , I) would be advantageous since it would indicate applicability to new, never conducted experiments. Although it would be preferable, the evaluation indicates that this is not the case.
We observe not only that rmse(S) rmse(S, S, I), but also that rmse(S, S, I) significantly exceeds the 10% I B range. The NNs (S 90 , S 10 , I) are specialized onto patterns of S. Those patterns are not suitable to determine I B from experimental data. The fundamentally different magnitude of the width of the rmse confidence intervals CI(S, S, I) as compared to CI(S) points towards the same reason.
Both parametrizations including µ show a decreasing performance with growing resolution. Only a marginal difference between rmse(S) µ&δ and rmse(S) µ&σ can be observed. The rmse (S) range seems independent of the architecture.

NN trained with simulation and experiment
Since S based training did not lead to sufficient NN performance for application to I, some samples from I are provided during training and validation, i.e. M and I as defined in section 2.2 are used. With this procedure, we intend to force the NNs to consider patterns present in both S and I during training. The performance of the NNs (M 90 , M 10 , I ) is compared to NNs trained, validated and tested with the small amount of available experimental data only, i.e. (I 37 , I 4 , I ). Note, that the performance is tested with the same set I . Figure 8(b) depicts this comparison. The upper end of the occurring rmse range is reduced by two orders of magnitude in this figure as compared to 8(a). As in the case of NNs trained with S, the parametrizations by ρ yield the best rmse for partitionings between 9 × 5 and 36 × 12 while the µ based parametrizations are best for one dimensional partitionings decreasing in performance with growing resolution as visualized in figure 10.
For two-dimensional partitionings, we observe in figure 11 rmse(M 90 , M 10 , I ) < rmse(I), as intended by training on M. In the case of coarse partitioning, training with M has no advantages above direct training with I.
The most impressive results were found for resolutions between 9 × 5 and 36 × 12, for (M 90 , M 10 , I ) ρ conv . In order to display the good reconstruction quality, we choose a representative rmse, namely median rmse, of the NNs (M 90 , M 10 , I ) 18×8,Conv ρ , as seen in figure 12. The rmse is 0.008 which is 3.5% of the I B range, so the median rmse is clearly below the 10% I B range threshold. The results clearly outperform those of the NN in [1], where the best rmse was 0.029.
In table 2 the mean training and validation times of NN with 18 × 8 and 144 × 30 data input are listed, which represent the best performing NNs and the largest NNs respectively. The training time is measured from the beginning to the end of the optimization process, while the validation time represents the time it takes to calculate the loss for the validation set once. Training time is in the order of minutes and does not only depend on the NN size but also on the early-stopping   criteria. The complete validation set is evaluated within milliseconds, which guarantees real-time applicability. Changing the NN architecture to an inception model seems to slightly but not significantly improve the performance for finer partitioning as seen in figure 13. This behavior is not completely unexpected, as the inception model does not take advantage of the 'parallelization' unless very large problem sizes occur. Such larger sizes will be relevant in the future when dealing with the more complex divertor of W7-X instead of the limiter. The FFNN is clearly outperformed by the convolutional NN as well as by the inception NN.

Conclusion and future work
It was shown here that it is possible to reconstruct an important property of the W7-X edge magnetic field structure from limiter heat load patterns with even better accuracy than earlier reported in [1]. The main challenge was to deal with sparse experimental data given. A naive approach to apply NNs trained and validated with synthetic data to experimental data showed good performance only in a minority of cases. After such a training process most NNs focus on patterns not present in the experimental observations.
For a more targeted training and validation, a mixture of experimental and synthetic data is formed for the training process. This approach resulted in convincing NN performance for certain NN input processing. Partitioning the limiter with resolutions between 9 × 5 and 36 × 12 and defining the NN input as the heat load in each part divided by the maximum heat load of all parts results in better performance compared to NNs trained, validated and tested with experimental data only. The low number of experimental results probably leads to overfitting in these nets but the added simulation data diminished these effects. We created NNs that extract relevant patterns from experimental as well as from synthetic data sets to reconstruct an important parameter of the magnetic field at the edge. With this systematic approach NNs were found to outperform the results found in [1].
The upgraded W7-X with installed divertors will be the next object of interest. We will start the investigation with a parametrization based on a two-dimensional partitioning of the heat load. Favoring one of the two examined NN architectures a priori and excluding the other is not possible at this stage because neither consistently outperforms the other. The reached results are satisfactory, however it remains future work to investigate other methods such as generative adversarial nets [18] to further enhance the reconstruction performance when dealing with simulated and experimental data. These could learn to generate 'experimental data' and provide an additional supplement for real experimental data to train NNs which reconstruct plasma properties.