Developing Deep Learning Algorithms for Inferring Upstream Separatrix Density at JET

Predictive and real-time inference capability for the upstream separatrix electron density, $n_\text{e, sep}$, is essential for design and control of core-edge integrated plasma scenarios. In this study, both supervised and semi-supervised machine learning algorithms are explored to establish direct mapping as well as indirect compressed representation of the pedestal profiles for predictions and inference of $n_{\text{e, sep}}$. Based on the EUROfusion pedestal database for JET, a tabular dataset was created, consisting of machine parameters, fraction of ELM cycle, high resolution Thomson scattering profiles of electron density and temperature, and $n_{\text{e, sep}}$ for 608 JET shots. Using the tabular dataset, the direct mapping approach provides a mapping of machine parameters and ELM percentage to $n_{\text{e, sep}}$. Through representation learning, a compressed representation of the experimental pedestal electron density and temperature profiles is established. By conditioning the representation with machine control parameters, a probabilistic generative predictive model is established. For prediction, the machine parameters can be used to establish a conditional distribution of the compressed pedestal profiles, and the decoder that is trained as part of the algorithm can be used to decode the compressed representation back to full pedestal profiles. Although, in this work, a proof-of-principle for predicting and inferring $n_{\text{e, sep}}$ is given, such a representation learning can be used also for many other applications as the full pedestal profile is predicted. An implementation of this work can be found at https://github.com/fusionby2030/moxie.


Introduction
The electron density at the upstream separatrix, n e, sep , is a key parameter for core-edge integration in tokamaks. The power exhaust properties and scrape-off layer (SOL) are strongly regulated by n e, sep and collisionality, with power exhaust requirements typically favoring elevated n e, sep [8]. On the other hand, elevated n e, sep /n e, ped and n sep /n GW ratios have been observed to lead to degraded core and pedestal performance in high confinement mode (H-mode) plasmas [16,10]. Therefore, predictive and real-time inference capability for n e, sep is essential for design and control of core-edge integrated plasma scenarios.
Due to the multiple physical processes impacting n e,sep , there are no first-principles predictive models that would predict n e,sep based on machine control parameters only. 2D SOL multi-fluid simulation packages, such as SOLPS-ITER [17], can in principle calculate a value for n e,sep . However, the predicted value depends on user-specified phenomenological parameters, such radial diffusion and * Corresponding author: adam.kit@helsinki.fi heat conduction coefficients, as well as assumptions on recycling and pumping. To address the gap of predictive capability for n e,sep , empirical scaling laws have been derived based on H-mode databases [1,11]. The scaling law by Frassinetti et al. is based on the JET pedestal database (JET-PDB) [1]. The data is in tabular form with rows (entries) of shots that have columns (features) corresponding to machine control parameters and measurements of various plasma parameters averaged over a stationary time window. The plasma parameters and machine parameters come from a time window corresponding to a flat-top (normally around 1-2 seconds) of an H-mode pulse. The plasma parameters, n e, sep included, are determined from a modified hyperbolic tangent (mtanh) [6,7] fit of the electron temperature, T e , and density, n e , profiles from the High Resolution Thomson Scattering (HRTS) system at JET [1,16]. The profiles used in the curve fitting are those that fall between 75-99% of an Edge Localized Mode (ELM) cycle within the flat-top window. In this analysis, we use JET-PDB to go beyond log-linear regression using advanced machine learning (ML) tools, as well as use the pulses and time-windows given in the database to create Figure 1: An example of the ne, sep estimations (solid colored X) for JET pulse 81836 for varying times within an ELM cycle. The characteristic crashing of density gradients in ELMy H-modes will lead to variations in ne, sep. The assumption of Te ∼ 100 eV is not expected to be valid throughout the ELM cycle due to the fluctuations of the power crossing the separatrix.Also in pulses with very high ELM frequency, it is possible that the linear approximation method breaks due to the rapid fluctuation of the measured plasma quantities, relative to the HRTS measurement frequency. These shortcomings are acknowledged here and will be taken into account in future studies. a larger dataset that is used to fit ML models to predict n e,sep , as well as full HRTS n e and T e profiles.
In JET-PDB, n e,sep is determined from the fitted HRTS profiles by assuming a conduction limited SOL and calculating the upstream T e based on Spitzer-Härm heat conductivity. This can be considered a two-point model approach for determining the separatrix location [8]. The method of determining n e,sep from HRTS profiles is outlined in detail in [12,13,1], but relevant steps are restated here. As HRTS measures n e and T e for the same spatial location, power balance arguments can be used to align T e with separatrix. Due to the strong temperature dependence of Spitzer-Härm heat conductivity ∝ T 5/2 e , the upstream T e in conduction limited SOL scales strongly sublinearly as a function of heat-flux in the SOL, T e ∝ q 2/7 || . As a result, for typical JET H-mode conditions, T e can be approximated to be around 100 eV at the upstream separatrix. Fitting the profiles that fall between 75-99% of the ELM cycles within the given flat-top H-mode time window, the value of the fitted T e closest to 100 eV is, therefore, the location of the corresponding density point at the separatrix, n e,sep . The corresponding machine parameters for this entry are considered to be the average across the entire time window.

Dataset
In this analysis we avoid mtanh curve fitting and ELM filtering and instead calculate n e,sep via a linear approximation of the HRTS measurements, which increases the amount of data points, w.r.t the JET-PDB, by an order of magnitude. This approach neglects the impact of the nonzero instrument function that would be taken into account in the mtanh fitting procedure, but the impact of this correction is a second order effect compared to shot to shot variation due to different engineering parameters [6]. To estimate n e,sep for a given time slice without time/elmaveraging, we apply the following linear approximation method (Fig 1): 1. Make an initial guess of where the separatrix is located r sep ≈ 1 2 (r top + r bot ), where r top , r bot are pedestal top and bottom determined from the second derivative of density profile, respectively. 2. Find the closest two temperature points, T L , T R , to the initial r sep , s.t., T L < 100 eV < T R . 3. Find weights, w L , w R , via linear approximation, s.t., w L T L + w R T r = 100 eV and w L + w R = 1. 4. Use weights: w L n L + w R n R = n e, sep and w L r L + w R r R = r sep , where n L , n R , T L , T R are the electron densities and temperatures located at radial positions r L , r R .
This approximation relies heavily on the assumption that T e, sep = 100 eV, which is often not the case during the course of an ELM cycle. Future studies will address this issue by applying other models for T e, sep . A possible pathway is to apply SOLPS-ITER, or a fast ML surrogate model for SOLPS-ITER [14], to provide a more sophisticated prediction of T e, sep than is obtained by the simple two-point model. Additionally, n e, sep values determined from the linear approximation are typically smaller than those found in the JET-PDB, as in the JET-PDB only profiles from 75-99% of the ELM cycle are used. For now, we use this linear approximation to estimate n e, sep for a given profile, and move to constructing a dataset to be able to map machine parameters to each profile/n e, sep . A dataset was created, including machine parameters, n e, sep , and HRTS measurements in the edge, using a subset of pulses from the JET-PDB [1]. The following criteria were applied: JET-ILW, deuterium fuelling, no seeding or nitrogen seeding, and no ELM mitigation techniques (kicks, pellets, or resonant magnetic perturbations) applied. This subset consists of over 600 shots, with a total of around 20000 time slices. Each entry in the dataset consists of time slice of HRTS measurements with corresponding fraction of ELM cycle. For a given time slice, the machine parameters are the average of those that fall between a window of 20 ms before the HRTS measurement (the HRTS measurement frequency); an overview of the domains is given in Table 1. For each time slice, only the last 20 points of the HRTS profile is used, from which n e, sep is calculated via the linear approximation. Thus, Table 1: Machine parameters and their domains for the constructed dataset. Unitless parameters are denoted as [-]. All variables except the gas puff and power parameters are taken from the equilibrium reconstruction for each pulse. The ELM fraction is the percentage of the ELM cycle that the HRTS measurement was taken. q cycl is the cylindrical approximation for q 95 , since it could be argued that q 95 would contain information regarding the separatrix. We calculate q cycl using a cylindrical approximation for the tokamak cross section, parameterized by relevant machine parameters, therefore q cycl : q cycl (κ, a, δ, B T , R 0 , I P ). Although this information is essentially encoded by other parameters, this can be seen as a form of feature engineering.

Machine Learning for Predictive Modeling
To predict n e, sep from machine parameters, we propose two approaches; direct mapping and representation learning. The direct mapping approach aims to learn a mapping from machine parameters directly to n e, sep . The representation learning approach aims to learn a lower dimensional representation, denoted as z, of the HRTS profiles, as opposed to only predicting n e, sep as per the direct mapping. The lower dimensional representation is conditioned on machine parameters and can be decoded to generate profiles. From the generated profiles, n e, sep is determined. Both approaches leverage recent advances in machine learning and deep neural networks.
The direct mapping approach aims to learn a function f : X → n e, sep , where X are the given set of machine parameters that lead to a particular n e, sep . No profile data is given in this approach. We aim to learn f by using the created dataset via supervised learning. Regression based supervised learning has seen major breakthroughs via artificial neural networks and decision tree ensembles in the last decades. We choose a decision tree ensemble, namely XGBoost [2], to learn the direct mapping. XG-Boost has shown good performance across many different types of tabular datasets [15]. The learning goal for the direct mapping approach boils down to minimizing the mean-squared-error (MSE) of predicted n e, sep against the 'true' n e, sep for a given set of machine parameters.
The representation learning approach aims to learn a latent variable z, with prior distribution p(z) = N (0, 1), that represents profiles y via p(y, z) = p(y | z)p(z). This idea is implemented with the variational autoencoder (VAE) [3] framework (Fig. 2), using an encoder q(z | y), and decoder p(y | z) distribution, parameterized by artificial neural networks. Contextually, this approach would take profiles as inputs, encode the information to a lower dimensional latent variable, then decode from the latent variable to profiles again. As the true likelihood of the data p(y, z) is intractable, we optimize a lower bound, the Evidence Lower Bound (ELBO). For our chosen prior, maximizing the ELBO corresponds to minimizing the following function: i.e., the model learns to minimize the mean-squared-error (MSE) of predicted profilesŷ via the lower-dimension representation q(z | y) determined from an encoder, against the 'real' profile y, passed to the encoder. The latent representation q(z | y) is regularized via the Kullback-Leibler divergence (KL) relative to a standard normal distribution, optimizing the latent distribution to closely follow a normal distribution with mean of 0 and variance of 1.
We add an additional learning goal of predicting machine parameters from the latent variable z. This is done via an auxiliary regressor, which takes the encoded representation, z, and predicts machine parameters, i.e., p(x | z). Ideally, this enforces the compressed representation to be informed about the machine configuration, i.e., the latent representation should not only encode profile information, but also encode the corresponding machine parameters for the given profile. The learning goal is updated to reflect this: where the MSE between predicted machine parametersx and real machine parameters, x, is added.
As per the DIVA framework [4], we split the latent variable z into different sub-spaces, namely, z mach and z stoch . The auxiliarly regressor previously mentioned then only uses the latent variables from z mach , and is therefore parameterized as p(x | z mach ). Ideally, all information pertaining to machine parameters would be contained in the latent variable z mach and information that represents the stochastic nature of the data would be contained in z stoch . The learning goal can be updated to reflect this: In order to learn a mapping of machine parameters to profiles, we introduce a conditional prior, x : p(z mach | x), to go from machine parameters to z. As per the DIVA framework, we want to learn a latent variable that is split by domain by using conditional priors depending on the domain of the data (machine parameters) during optimization. To learn this split we use conditional priors and the KL divergence. The conditional prior is the initial distribution found through machine parameters, then we optimize the approximate posterior (z found through the regular encoder) to be as close to the prior via the KL divergence. Thus, via the conditional prior, we can pass machine parameters to generate profiles for prediction. The resulting learning goal for the representation learning approach is to minimize via the following function: Row 1 enforces the model to learn a representation to best 'reconstruct' profiles and machine parameters. Row 2 enforces that the priors determined via the profile encoding are as close as possible to those suggested via the conditional prior found from machine parameters. Row 3 enforces the stochastic split latent variable, z stoch , to be close to a standard normal distribution. We further impose physics constraints calculated on the generated profiles and machine parameters in order to enforce the model to learn to generate 'physically realistic' or consistent profiles and machine parameters. This is done via the following three approximations: • static stored pressure at the pedestal/SOL, W ped e := W ped e (n e , T e ) = k B n k=0 p e,k for k radial measurements along the profile, and p e = n e T e , the plasma pressure. This constraint uses the outputs of the decoder, i.e, the predicted profiles, and ideally would enforce the model to learn that pressure is a conserved quantity, i.e., if density increases, temperature decreases.
• poloidal magnetic field, B ped θ := B ped θ (I P , κ, a, δ U , δ L ) = µ0I P C , where C is the approximation of circumference of a D-shaped cross section of JET. This constraint uses the outputs of the auxiliary regressor, i.e., the predicted machine parameters, and ideally would enforce the model to learn to scale the shaping parameters w.r.t to current, i.e., if one of these quantity changes and B θ,ped would like to be conserved, then the other parameters need to be scaled relative to the changes.
• ratio of magnetic pressure to plasma pressure β ped where p e,0 is the plasma electron pressure measurement closest to the high field side. B ped θ is the same as calculated above. This constraint uses both predicted machine parameters and profiles from the model, and ideally would enforce the model to learn to encode the scalings of parameters similarly to suggested above, i.e., if a plasma with different magnetic field or current has the sameβ ped e , then other parameters need to change to reflect this conserved quantity.
The learning goal then gains the following additional mean-squared-error terms: Ideally this enforces the model to learn a representation that is physically consistent. There are many other physics based constraints that could be introduced, but these three show the proof-of-principle on how to introduce physics into the model directly via the learning goal. The benefit of representation learning is that it learns a physically interpretative lower dimensional representation of the plasma state that can be sampled for probabilistic predictions.
Physical interpretation: Full density and temperature profiles are generated for each prediction, unlike the direct mapping. This allows us to validate the output of the model in terms of physical quantities. For example, we can inspect the conservation of the electron pressure (p e = n e T e ), i.e., if density increases, temperature should decrease. Another example is to inspect the inferred relationship between current, I P , and density (Fig. 3).
Latent representation: Through the lower dimensional compressed representation we can identify nonlinear relations between the plasma state and machine parameters. For example, we could correlate latent dimensions to machine parameters and potentially find a set of machine parameters that leads to a specific n e, sep (Fig. 4). This has enormous potential, as the representation could be used as a basis for a control-like algorithm.
Probabilistic: We can quantify the uncertainty in the profile predictions as well as the machine parameters through latent variable z stoch . This is done by repeatedly sampling from z stoch , and taking the standard deviation across the samples (Fig. 4).

Results
For predicting n e, sep , the direct mapping model currently outperforms the representation learning model (Fig.  5). This is likely due to the added constrains for the latter. There, while predicting n e, sep we pass through latent variable z, which has a Gaussian prior and a relatively low dimensionality. This assumption about the data need not align with the true process that describes the plasma behavior: We can expect that this representation compresses away information about the exact location of n e, sep . On the other hand, the direct mapping has no such limitation. Additionally, inferred separatrix location from the generated profile is very sensitive to the generated location T e ∼ 100 eV, and any error in this is propogated through the analysis. In the future, further constraints will be considered to better constrain the separatrix location in the generated profile.
The direct mapping approach performs similar when fit using the JET pedestal database and the dataset created The architecture of the model used in the representation learning approach. This is a modification of the DIVA framework [4]. For conditional generation, only the PriorReg and Decoder are used, in which machine parameters are passed through the PriorReg, from which latent variables z are determined, which are then passed to the decoder, which produces density and temperature profiles.  for this analysis (Fig. 5). Note that the created dataset includes the fraction of the ELM cycle as a machine parameter, whereas the JET pedestal database does not. By expanding the database to include entries within the flattop time window, the direct mapping can provide a decent approximation for inter-ELM n e, sep when given machine parameters and fraction of ELM cycle. Expanding existing databases and methods outside of the mtanh shows promise for training advanced machine learning methods.
The representation learning model is observed to have learned the linear proportionality of plasma current and electron density, I P ∝ n e (Fig. 3). The model is able to extrapolate this proportionality outside of the I P domain of the dataset (Table 1) to currents above and below 4 and 1 [MA] respectively. However, it would continue to increase the density with increasing current to arbitrarily low edge safety factors, whereas an instability would be observed in reality. This is likely because the data used for training does not explicitly contain the information about stability limits. However, these limits could be parameterized as part of the learning goal. Future studies will explore regularization of the latent space using parameterization for the boundaries of the stable operational domain.
Additionally, the model is observed to represent the cyclical crashing of the density over the course of an ELM cycle well (Fig. 3). This shows that the learned representation of the plasma state potentially has the capability to generate full ELM-cycle consistent density profiles for a given set of machine parameters. However, this cannot decisively be said about the predicted temperature.

Conclusion
Direct mapping and representation learning approach to predicting and inferring n e,sep have been explored. We evaluated both approaches on experimental data from JET, based on the EUROfusion pedestal database [1]. For solely predicting n e,sep , the direct mapping performed significantly better than the more comprehensive representation learning approach. This indicates that for specific regressions task on a well labeled datasets, supervised, direct mapping machine learning algorithms can be very powerful. However, the representation learning approach was able to learn a compressed representation of the pedestal electron density and temperature profiles, which could potentially have further applications in addition to predicting n e,sep , e.g., control. Furthermore, future development of these algorithms and approaches are also expected to improve the prediction accuracy through inclusion of additional experimental information, e.g., temporal evolution of the plasma parameters. Additionally, we show how to propagate physics based constraints on deep learning models through approximations of the static pressure of the pedestal top, poloidal magnetic field and β. The results of predicting ne, sep with advanced machine learning techniques show that the direct mapping approach via XG-Boost performs better than representation learning, while both outperform the log-linear scaling law from [1]. Left: Log-linear regression developed by Frassinetti et. al., [1] on the existing JET pedestal database. Middle: The representation learning approach via DIVA framework is made on the created dataset, with predicted ne, sep estimated from conditionally generated profiles via the same linear approximation algorithm used in the dataset creation. Right: A direct mapping via XGBoost is made on the created dataset. Since the JET pedestal database has significantly less entries than the created dataset, entries are randomly sampled from the created dataset for a more reasonable qualitative comparison. In each figure, the solid black line represents unity, and the dashed black lines ±20% error. The approximations of real ne, sep are noticeably larger in magnitude in the JET PDB compared to those in the created dataset. This is due to the lack of ELM averaging techniques for profile selection in this analysis, whereas the JET PDB approximations use only profiles from 75-99% of the ELM cycle [1].