RADYNVERSION: Learning to Invert a Solar Flare Atmosphere with Invertible Neural Networks

During a solar flare, it is believed that reconnection takes place in the corona followed by fast energy transport to the chromosphere. The resulting intense heating strongly disturbs the chromospheric structure, and induces complex radiation hydrodynamic effects. Interpreting the physics of the flaring solar atmosphere is one of the most challenging tasks in solar physics. Here we present a novel deep learning approach, an invertible neural network, to understanding the chromospheric physics of a flaring solar atmosphere via the inversion of observed solar line profiles in H{\alpha} and Ca II {\lambda}8542. Our network is trained using flare simulations from the 1D radiation hydrodynamics code RADYN as the expected atmosphere and line profile. This model is then applied to single pixels from an observation of an M1.1 solar flare taken with SST/CRISP instrument just after the flare onset. The inverted atmospheres obtained from observations provide physical information on the electron number density, temperature and bulk velocity flow of the plasma throughout the solar atmosphere ranging from 0-10 Mm in height. The density and temperature profiles appear consistent with the expected atmospheric response, and the bulk plasma velocity provides the gradients needed to produce the broad spectral lines whilst also predicting the expected chromospheric evaporation from flare heating. We conclude that we have taught our novel algorithm the physics of a solar flare according to RADYN and that this can be confidently used for the analysis of flare data taken in these two wavelengths. This algorithm can also be adapted for a menagerie of inverse problems providing extremely fast ($\sim$10 {\mu}s) inversion samples.


Introduction
The current and next generation of solar observations, with their high spatial, temporal, and spectral resolution, present a significant analysis challenge, as does the increasing complexity and realism of the models with which the data are confronted. The two go hand-in-hand: ever-increasing resolution reveals observational phenomena that cannot be understood using convenient theoretical simplifications, while the inclusion of "realistic physics" in models (often taken to mean, e.g., multifluid effects and nonequilibrium processes) motivates observational testing at higher and higher resolution. The challenge of model-data comparison grows accordingly and drives us to seek new approaches.
This paper deals specifically with combining models and observations to learn about the structure of the solar atmosphere during a solar flare. The underlying motivation for such investigations is to understand how the energy released in a flare is transported through and dissipated in the solar atmosphere, primarily in the solar chromosphere, where most of the flare's radiation originates (appearing mostly in the optical and UV; e.g., Kretzschmar 2011;Milligan et al. 2014). However, the route to this is complicated. The observed chromospheric radiation-a combination of optically thin (mostly extreme UV) and optically thick (mostly UV to optical)-carries information about the temperature, density, and velocity structure of the solar chromosphere, which evolves rapidly with time as it heats. This structure is determined by the pre-flare chromosphere and the characteristics of the flare energy input. The task is to work out the chromospheric structure from the radiation emitted and use this to constrain the properties of the energy input. The picture is complicated because the heating is very intense, between 10 10 and 10 12 erg cm −2 s −1 (Fletcher et al. 2007;Krucker et al. 2011), compared to the ∼10 7 erg cm −2 s −1 (Withbroe & Noyes 1977) needed to balance radiative losses in the nonflaring chromosphere. In addition, there is abundant evidence for nonthermal particles and flows close to the sound speed, meaning that simplifying assumptions such as hydrostatic or local thermodynamic equilibrium are unlikely to be valid.
We focus here on optically thick emission lines from the upper photosphere and chromosphere. These lines encode information about the atmospheric structure; typically, the emergent radiation in the line core is formed higher up in the atmosphere than in the line wings. A number of techniques exist for "inverting" optically thick line profiles to recover the structure of the atmosphere that emitted them, though most have been developed for the inversion of spectropolarimetric information to also include the magnetic field, which is not our concern at present. These include analytic methods employing the Milne-Eddington approximation for frequency-independent opacity in an LTE atmosphere (e.g., Skumanich & Lites 1987), the non-LTE codes NICOLE (Socas-Navarro et al. 2000) and HAZEL (Asensio Ramos et al. 2008), and the non-LTE code STiC (de la Cruz Rodriguez et al. 2019), which can treat multiple atomic species and a complex atmospheric stratification. In essence, these all iterate the output of a forward model toward the observed spectropolarimetric line profiles (note that an alternative approach for solving the inverse problem for the chromospheric temperature structure from an integral form was demonstrated by Metcalf et al. 1990). They have also not been developed with the flare chromosphere in mind, though NICOLE has been used by Kuridze et al. (2017Kuridze et al. ( , 2018 for flares. While non-LTE calculations are included in many codes, hydrostatic equilibrium is uniformly assumed. Instead, the most frequently used approach for flares is forward modeling with codes such as RADYN (Carlsson & Stein 1992Allred et al. 2005Allred et al. , 2015 in an attempt to match with observed spectral lines. The energy input to the model is specified according to observed properties, when possible (i.e., the energy input by nonthermal electrons deduced from hard X-rays). This approach has produced some notable insights into the properties of the flare chromosphere from both line and continuum emission (e.g., Kuridze et al. 2015;da Costa et al. 2016;Kowalski et al. 2017;Simões et al. 2017). However, iterating these models toward agreement with observations is not practical, and in some cases, reproducing features of the observations pushes the models in ways that are difficult to justify observationally (e.g., the long beam injection times required by Kennedy et al. 2015). Also, while manageable for small samples of data, this "trial-and-error" approach cannot realistically be scaled up to take advantage of the high volumes of data from new instruments. Furthermore, in cases where the energy input by nonthermal electrons cannot be constrained because of a lack of complementary observations, it is hard to know where to start among the vast range of model possibilities.
Here we take a different track, exploiting developments in machine learning to efficiently recover RADYN-like atmospheres from spectral-line profiles. We design and train an invertible neural network (INN; similar to that introduced in Dinh et al. 2016;Ardizzone et al. 2018) to learn the output Hα and Ca II 8542 Å spectral lines corresponding to many thousands of RADYN atmospheric solutions, and vice versa. The network proves capable of inverting model RADYN spectral-line profiles to accurately generate the corresponding RADYN atmospheric parameters, giving us confidence in its ability to recover reasonable, realistic atmospheres from observed flare spectral data. We demonstrate the method on data taken by the CRisp Imaging SpectroPolarimeter (CRISP) instrument on the Swedish Solar Telescope (Scharmer et al. 2003(Scharmer et al. , 2008. The method is fast, producing both atmospheric parameters and a measure of their uncertainties in about 44.7 μs measurement -1 on a GPU. This makes application to large data sets feasible.
This initial paper is intended to demonstrate proof of concept, underpinning future in-depth analysis of flares. In Section 2, we describe the principles of INNs, and Section 3 covers how our network is trained and validated on RADYN models. In Section 4, we present the first inversion using this method of real flare data, and we end with discussion and conclusions in Section 5.

INNs
An inverse problem is one in which a set of measurements is used to deduce the properties of the system that caused them. It is usually the case that information about the system is missing because of the properties of the medium or the complexity of the physics involved. The example presented in this paper is that of deducing the plasma parameters of the chromosphere that are three-dimensional quantities, whereas we only observe the chromosphere as two-dimensional images at a given wavelength from an instrument such as the Swedish Solar Telescope CRISP instrument (Scharmer et al. 2003(Scharmer et al. , 2008. We wish to learn about this missing information, as it will constrain our model of the physical system producing the observations. Formally for any process, there exists a function y=f (x) that maps the input of physical parameters x to the output of observations y: this function is known as the forward process. The forward process does not define a bijective function, meaning that we cannot find a unique mapping from the output to the input; i.e., there are many possible x for a single y. This proves to be important, since a traditional neural network trained on such a problem will only learn to find one of the possible solutions or an average of multiple correct but physically incompatible solutions. Furthermore, with a traditional neural network, it is impossible to ever know if the connections being made are the correct ones, as the network is trying to learn an ill-defined problem.
We circumvent this issue in our work by introducing a latent space z that captures all of the information lost in the forward process (Dinh et al. 2014 and references therein). The latent space z represents the space of all information loss in the forward process, such that a sample from the latent space combined with the observation y will be able to be mapped to the correct input parameters x. As a result of the introduction of latent variables, we now have a bijective mapping x y z , ]. This means that we have transformed the inverse process into a deterministic function (a function that has a definite result for a set of inputs). Consequently, sampling different values from the latent space will lead to a sampling of the distribution of the input parameters corresponding to a given output observation. This deterministic function x=g (y, z) is thus invertible, and we can learn the function g −1 as the forward process and g as the inverse process that will directly track where the lost information is obtained from the latent space. This is characterized by our network assuming that the latent variables z are drawn from the unit multivariate Gaussian distribution 0, N   ( ) for an N-dimensional data space in the reverse direction. Here g −1 will populate the true latent space z true with the information lost in the forward process. Our network is then trained in such a way (see Section 3) as to learn this mapping from the true latent distribution to the unit Gaussian latent distribution. After sufficient training, sampling the unit Gaussian distribution will be equivalent to sampling the true latent distribution, since they differ by only a known mapping. The choice of drawing from the unit multivariate Gaussian is an arbitrary one. It is true that any distribution could be used here, but we choose a Gaussian because it is smooth and continuous. The architecture we choose to learn this is our INN.
Like traditional neural networks, INNs are composed of interconnected layers of neurons that aim to learn a function from input to output. The key difference is the composition of the hidden layers between the input and output. These take the form of affine coupling layers (Dinh et al. 2014(Dinh et al. , 2016. Affine coupling layers are simple yet powerful tools. By construction, in learning the function from the input to the output with an affine coupling layer, we get the inverse function learned for free. This is due to the reversibility of the blocks, illustrated in Figure 1. We base our layers on the form first presented in Ardizzone et al. (2018). The input x is split into two equals parts [x 1 , x 2 ] that are propagated through the forward direction of the block. This leads to x 2 undergoing an affine transformation before combining with x 1 to obtain half of the output y 1 . Then, y 1 is subject to its own affine transformation and combination with x 2 to get the second half of the output y 2 . This is illustrated in the upper panel of Figure 1. There is now a simple relation between the input and output for this layer, where Ä denotes the element-wise multiplication of two tensors (which are represented by matrices in our problem), and the functions s i , t i are arbitrarily complex and differentiable (i 1, 2 Î { }). After obtaining the pair of outputs [y 1 , y 2 ], they are then concatenated to give the total output y. The inverse of this operation is then simple, and we can also map from the output y to the input x, where  denotes the element-wise division of two matrices.
We have now defined a setup in which the inverse is easily calculable. This is extremely useful for inverse problems, as it is rarely easy to find the inverse function for a forward model. This means that the only problem we now need to deal with is learning what the latent space is to make sure that our network produces the correct inversion; see Section 3 for more information. Since the functions s i , t i do not need to be inverted themselves to calculate the inversion, they can be as complex and arbitrary a function as needed. To fill this role, we look to fully connected artificial neural networks (ANNs).
ANNs can learn complex classification and regression problems via a method known as backpropagation, and are therefore universal function approximators (Rumelhart et al. 1986;Cybenko 1989). They are an example of supervised machine learning, meaning that the network is trained on a data set where the answers to the functions we want to learn are known. In backpropagation, the input data is fed through a neural network, where linearities and nonlinearities are applied to it until it reaches the output, where it is compared with the known answers. This comparison is then surmised by a loss function, which is minimized by changing the values of the weights in each layer of the network to produce a different result (Schmidhuber 2015). There have been innumerable successes of ANNs learning complex functions via this method, so we use randomly initialized ANNs as our complex s i and t i functions in the INN.
In our network, the functions s i and t i are defined by fourlayer fully connected networks (FCNs). An FCN is a type of ANN where all neurons in the previous layer are connected to all neurons in the current layer. The basic architecture for the FCNs utilized in our network is shown in Figure 2. The activation function (the function that determines to what extent the nodes pass on information to the next layer) after the first three layers in our deep networks is a leaky rectified linear unit (ReLU), with the activation after the fourth given by a ReLU where x is the input (in both cases). These activations are used, as they are sparse and thus speed up computation. Furthermore, ReLU activation and its variants are popular, as they are better at avoiding the vanishing gradient problem (when the gradients of the loss are small enough, they do not affect the update of the weights, leading to the optimizer getting stuck in the loss space). The functional forms of s i and t i differ by a clamping inverse tangent function applied at the end of the s i networks. This clamping function stops the exponential terms dominating the affine transform while still being smooth (i.e., gradients are still easy to calculate). These networks are trained as normal via backpropagation (see Section 3), and they learn the optimal representation of the affine transform that will approximate the forward physical model. Then, this representation is also optimal for the inverse problem, as the FCNs apply to the inverse problem too. Our network is comprised of five stacked affine coupling layers. Stacking these layers will allow us to approximate more complex tasks (this is the standard pillar of deep learning; Raschka 2015). This means that the network is dependent on 20 deep neural networks to approximate our inverse problem. Between each subsequent affine coupling layer, we have what is known as a permutation layer. This introduces channel mixing into our network by permuting the order of the inputs to each new layer. Channel mixing is when the inputs are shuffled into a different order. This is done as the input to the affine coupling layers is split in two, meaning that if there is no permutation, then these two halves remain independent throughout the network. The permutations are done by shuffling the input dimensions of our network in a random but fixed way (Dinh et al. 2014(Dinh et al. , 2016. Each permutation is different from the previous. This will increase the generalization properties of our network. The architecture of the INN is shown in Figure 3. The flow of the forward model is shown by the black arrows, and the flow of the inverse is shown by the cyan arrows.

Training an INN Using Synthetic Flare Data
This section describes the methods used to train and validate an INN to learn a bijective mapping between atmospheric profiles and two spectral lines. The training data consist of synthetic flaring solar atmospheres and spectral-line profiles generated from the one-dimensional radiation hydrodynamic model RADYN.

Training Data
The state-of-the-art forward models for simulating the atmospheric response and radiation originating from solar flares are one-dimensional radiation hydrodynamic models that solve the equations of hydrodynamics coupled with the equations of radiative transfer (outside local thermodynamic equilibrium and statistical equilibrium). Among these models are RADYN (Carlsson & Stein 1992Allred et al. 2005Allred et al. , 2015, FLARIX (Varady et al. 2010;Heinzel et al. 2016), and HYDRAD (Bradshaw & Cargill 2013). Due to the preexisting grid of RADYN simulations 3 and its widespread acceptance, we have chosen to use RADYN as the forward model for training here. These RADYN simulations all start from a modified VAL3C quiet Sun atmosphere (Vernazza et al. 1981).
For the simulations in the RADYN grid, the dynamic atmospheric response to an electron beam from a flare is computed, where 1. the distribution of electron energies in the beam is modeled as a power law with variable total energy flux (in the range 3×10 10 -1×10 12 erg cm −2 ), 2. the beam low-energy cutoff is E c ={10, 15, 20, 25} keV, 3. the beam spectral index δ={3, 4, 5, 6, 7, 8}, 4. the beam flux is a symmetric triangular pulse lasting for 20 s and peaking at 10 s, and 5. the simulation lasts for 50 s with data available every 0.1 s.
Some of the simulations with high total energy, lower values for E c , and higher values for δ did not complete and therefore are not available in the grid. This leaves 81 simulations, with 40,500 total timesteps to use as our training data. Of these timesteps, 20% are separated and used to independently verify the training.
RADYN uses an adaptive spatial grid (Dorfi & Drury 1987) to accurately represent the atmosphere, but due to the way in which our INN learns shapes, these data must first be interpolated onto a fixed, static grid. As we are primarily interested in the chromosphere and transition region, where the plasma parameters vary rapidly in space, we interpolate onto 45 linearly spaced points below 3.5 Mm with a grid spacing of 79.2 km. Five further points are used to represent the rest of the corona, and these are spaced exponentially from 3.5 up to 10 Mm.
The plasma parameters extracted from the RADYN simulations are the electron density n e [cm −3 ], the temperature T [K], and velocity v [cm s −1 ] as a function of altitude and time on the interpolated grid. The line profiles from these simulations, for Hα 6563 Åand Ca 8542 Å, are each interpolated onto 30 linearly spaced points in wavelength across wavelength ranges with half-widths of 1.4and 1.0 Å,respectively. The assumption of energy input specifically by an electron beam originating in the corona results in a characteristic Coulombcollisional energy deposition profile in the chromosphere determining n e , T, and v. For the spectral lines we will use, RADYN calculates both the thermal and nonthermal (i.e., direct beam-electron electron impact) collisional rates.
To reduce the dynamic range of these profiles and improve the performance of the INN, we first map n n log e e 10  , T T log 10  , and v v v sign log 10 1 10 5 +  ( ) (| | ). For each timestep in each simulation, the line profiles are divided by the maximal intensity in each profile, so that the profiles' relative intensities are preserved on a [0-1] scale.

MMD
Training the INN is made possible by the use of the maximum mean discrepancy (MMD). The MMD is a statistic used for computing the distance between two probability distributions based on a set of randomly drawn samples from each distribution by means of a high-or infinite-dimensional space through a nonlinear feature mapping. Our implementation is explained in depth in the Appendix, drawing on Gretton et al. (2012) and lectures given at the Machine Learning Summer School 2018. 4

Training
Our INN is trained similarly to Ardizzone et al. (2018) and is based on their Framework for Easily Invertible Architectures (FrEIA). 5 Herein, we provide a more in-depth description of the . These are deep neural networks with four hidden layers. The network architecture for the s i functions contains a smooth clamping function after output in the form of the inverse tangent. This clamps the output such that the exponential term in our affine transform does not overshadow the linear term (as this would make the linear term null). The input dimension is half the input dimension of the affine coupling layer due to the splitting of the input, as shown in Figure 1. The hidden layer depth is then double this.
training method and the slight differences in the MMD loss used.
The INN is trained with the preprocessed simulation data alternating forward and backward iterations. We define the input x as the concatenation of the electron density, temperature, and velocity profiles at a certain timestep. The output y is the concatenation of the normalized line profiles at this timestep. The latent space z is currently defined to be the same length as x, although this is still an area of investigation tied to the intrinsic dimensionality of the problem. The output of the INN is then the vector [z, y]. Both the input and output vectors are zero-padded to provide the network blocks with additional dimensionality over which to represent the learned mapping, as well as to fix the input and output to the vectors to the same length, as the structure of the affine coupling layers requires this. We will write these zero-padded vectors x p =[x, 0, 0, K] and y p =[z, 0, 0, K, y], and in our network, these are padded to have a length of 384.
The forward and backward training directions are both constrained by two loss functions. A loss function is a function that the neural network optimizer attempts to minimize during training so as to minimize the distance between the output from the ANN and the expected output. In the forward direction, we apply an L2 loss ( y y true 2 2 -|| | |) to the zero-padding and line profiles in the generated y p vector against the expected y p training vector from the forward model. An MMD loss is also applied between batches of [y, z] and y , During backpropagation (modification of the weights in the ANN layers guided by the gradients at these nodes), the gradients on the generated y due to the MMD loss are ignored so as to train the neurons learning the mapping from the true latent distribution to the normal distribution without hindering the training of the forward model x y  . The convergence of this MMD loss ensures the independence of z from y.
The backward direction is trained similarly. The vector of y true and the latents z generated by the forward iteration is propagated through the network in reverse, and an L2 loss is applied between x p and a zero-padded vector containing x true . Another vector of y true with latents z drawn from the normal distribution is also propagated in reverse, and an MMD loss is computed between x and x true . This second MMD loss serves to ensure that the distributions of x across the batch look alike (while taking into account internal variability within the true distribution).
The kernel used in our MMD loss is the same as that of Tolstikhin et al. (2017) and Ardizzone et al. (2018), the inverse multiquadric (IMQ) kernel as it has been found most effective for comparing sample quality in these problems. In the example provided by Ardizzone et al. (2018), the kernel used is a sum of IMQ kernels with different α (due to the properties of the reproducing kernel Hilbert space (RKHS) over which the MMD is defined, this sum is also a kernel); however, we had difficulty isolating a set of values for α that were effective in training the latent distribution to match the expected distribution without dependence on y. By plotting the MMD for the same x and y samples but different values of α, it was found that the biased sample estimate of the MMD between x and y drawn from similar but perturbed distributions produced a peak for certain values of α. We therefore compute the value of α at the turning point of the MMD 2 (α) (for which the MMD is maximal) during the training of the net and update α every five epochs. This approach is supported by Sriperumbudur et al. (2009), as the kernel of a family that yields the greatest distinction between the two differing distributions is the one for which the MMD estimate is maximal. Our INN is trained using the Adam optimizer (Kingma & Ba 2014) with β 1 =β 2 =0.8 and ò=1×10 −6 , where the β hyperparameters control the momentum of the first and second moments of the gradients and ò prevents division by zero. A hyperparameter is a parameter that is set prior to training, possibly evolving in a predictable fashion, and is not optimized by the training process. The values of these parameters are typically determined empirically and may well not be optimal, but they have been chosen to lead to convergence of the model. The learning rate η (the size of the steps taken in descending the gradient) is initially set to 1.5×10 −3 and decays by a factor of γ=0.004 1/1333 every 12 epochs; thus, for the model presented in this paper, trained for 11,400 epochs, the final learning rate is η≈3.38×10 −5 . This model does not appear to be very sensitive to variations in the learning rate, and multiple variations of γ have been used with success. We used a minibatch size of 500, with 20 minibatches per epoch, and the backpropagation took place every minibatch. In contrast to Figure 3. Architecture of our INN. We have five affine coupling layers with a permutation layer sandwiched between two affine coupling layers (four in total). The forward process mapping the input to the output is illustrated by the black arrows. The inverse process mapping a combination of the output and the latent space to the input is illustrated by the cyan arrows.
traditional training, where the model is trained on the entire training set every epoch and accumulates gradients over the entire training set before backpropagation, minibatch training shows the model multiple small subsets of the data each epoch with gradient accumulation and backpropagation between each of these minibatches.
The two losses computed for each of the forward and backward iterations need to be combined into a single loss in each direction for the backpropagation. We use this as an opportunity to add additional hyperparameters with which to weight the various losses when combining them. We therefore define three weights: w pred , w latent , and w rev . Then, the loss from the forward process is produced by where n is the current epoch and N fade is the number of epochs in the initial training stage. The function ξ(n) helps to avoid the initially large gradients in MMD b from steering the net away from the correct solution. In practice, it was found that this function was not strictly necessary but improved convergence. Additionally, the zero padding was set to n 5 10 1 0, 1 2  x --( ( )) ( ) to increase the activations of these neurons during early training and therefore push their outputs toward zero. The exact values of these parameters were determined empirically but with an emphasis on minimizing the L2 losses.
The initial 800 epochs were treated as an initial fade-in stage as ξ(n) grew to 1 and the padding became zero. For this phase, the loss weightings were set to w pred =4000, w latent =900, and w rev =1000. After this initial phase, the net was trained in batches of 400 epochs up to 4800 epochs, increasing w pred by 1000 each batch. This process was then repeated with batches of 600 epochs up to a total of 12,000 epochs. Finally, the model that performed best on the unseen validation set was chosen as the final model. This model was trained for 11,400 epochs.

Validation
The first stage in validating the training of the model is to test the forward model against ground truths on the unseen testing data. Figure 4 shows the results of the forward model. The top panels are the electron number density, temperature, and flow speed from an unseen RADYN snapshot, and the bottom panels compare the "ground-truth" RADYN output line profiles with the network's forward process. The mean squared error is 5.73×10 −5 in the scaled intensity at each wavelength point. Note that for all figures in this paper, wavelength axes show the wavelength in a vacuum, and positive velocities represent upflows.
It is somewhat more difficult to evaluate the model's ability to reproduce an atmosphere when given the line profiles due to the aforementioned ambiguity of the problem, as one set of line profiles may have been produced by a variety of atmospheres. To understand the range of solutions, we draw random samples from the latent space multiple times and use these samples with the line profiles to generate a histogram of atmospheric properties predicted by the INN. Figure 5 shows the results and verification of the inversion of data from the unseen testing set. In the top panels, the input line profiles are plotted in dashed blue lines on top of horizontal bars representing the line profiles calculated using the recovered atmospheric solutions. The recovered solutions are shown in the bottom panels, plotted as two-dimensional colored histograms representing the probability density of the solution at each altitude node. The regions of highest density in these parameters are therefore the most likely values. Superposed on this are the ground-truth values for each parameter, plotted as dashed lines. The data in the histograms are accumulated for every solution for the atmospheric profile produced from different draws of the latent space and represent 10,000 sampled solutions.
To better show the range of outlying solutions, all of the histograms were gamma-corrected (with γ=0.3) to reduce contrast. As can be seen from the dashed black lines in the bottom panels of Figure 5, the peak density of the solutions is close to the ground truth, and the narrowness of the histograms shows that the solution is well constrained through the atmosphere up to around 3 Mm above the photosphere. However, the spectral lines used do not constrain the problem well in the upper atmosphere, and although the solutions align very well with the ground-truth, the histograms are broader, particularly for the profile of velocity at 4 Mm and above. The histograms underneath the input line profiles in the top panels of Figure 5-so narrow as to look like single bars-are obtained by applying the forward model to each atmosphere produced by the inverse process and gamma-corrected in the same way. They reproduce the input line profiles very closely, demonstrating the self-consistency of the model's solutions.

Single-pixel Inversion of Real Flare Data
We have demonstrated above that the INN has successfully learned the synthetic flare model from RADYN. The next step is to apply our learned model to real spectroscopic data, with the intention of characterizing the atmosphere that produced it, and eventually learning about the physics of a flaring chromosphere. As our problem is only defined in Hα and Ca II 8542 Å, and these are mostly formed in the chromosphere (cores) and upper photosphere (wings), we will focus specifically on our results for atmospheric parameters below around z≈2 Mm. We do not attach much significance to the results from the small number of points in the corona.
The flare data we use are from the M1.1 two-ribbon solar flare SOL 20140906T17:09, which occurred in NOAA AR 12157 with heliocentric coordinates (−732″, −302″). Data were taken by CRISP (Scharmer 2006;Scharmer et al. 2008) mounted on the Swedish 1 m Solar Telescope (Scharmer et al. 2003) on La Palma. CRISP produced imaging spectroscopy data in both Hα and Ca II. The Hα data consist of 15 wavelength positions sampled at intervals of 200 mÅ from the line core, and the Ca II data consist of 25 wavelength positions sampled at intervals of 100 mÅ from the line core. The cadence of these observations is 11.54s with a spatial sampling of 0.057″ pixel −1 (giving a spatial resolution of 0.114″). The data set is open access and available from the F-CHROMA solar flare database (Cauzzi et al. 2014), 6 where it has been preprocessed and reconstructed using Multi-Object Multi-Frame Blind Deconvolution (MOMFBD; Van Noort et al. 2005) and the CRISPRED data reduction pipeline (de la Cruz . We assume that the intensity calibration of the two lines is done as well as possible in the same way through the CRISPRED pipeline. Therefore, we are assuming that the relative intensities between the two lines are physically meaningful, as assumed by our inversion technique. This event was previously analyzed by Kuridze et al. (2015), who presented the time evolution of the Hα and Ca II 8542 Å lines in small flaring regions and compared these with RADYN forward modeling, driven by an electron beam with properties deduced from the observed hard X-ray spectrum, commenting primarily on the relationship between plasma flows and line asymmetries. Figure 6 shows the wing and core images of Ca II and Hα at ∼16:56UTC, just after the onset of the flare at ∼16:54UTC. These images clearly show the presence of two flare ribbons during the time of the observation. We chose two pixels to invert: one on the flare ribbon and one off the flare ribbon. These are indicated in Figure 6 by a circle and square, respectively. The spectral-line profiles from the two pixels are extracted, normalized to the maximum value of the two lines, and interpolated to the RADYN grid. These are shown in Figure 7.
The lines in the top panels of Figure 7 are from a point on the flare ribbon, and those in the bottom panels are from a point off the flare ribbon (the circle and square, respectively, in Figure 6). The Ca II 8542 Åline profile for the circular point is characteristic during a flare. It is fully in emission, and the core is slightly blueshifted (with respect to the vacuum wavelength) by ∼3.51 km s −1 with a slight wing asymmetry. The Hα profile is highly asymmetric, with the blue peak of the central reversal being much higher in emission than the red peak. For the square point, both profiles are heavily in absorption (indicative of the quiet Sun). The Ca II and Hα cores are slightly redshifted here (by ∼1.26 and ∼2.18 km s −1 , respectively), and both profiles have some asymmetry between the wings.
To calculate the asymmetries in the profiles, we use a technique similar to that described in Mein et al. (1997), De Pontieu et al. (2009), and Kuridze et al. (2015, where λ 0B and λ 0R are the center wavelengths of the blue and red wings, respectively, and δλ is the width of the wing from its center wavelength. The wings are defined as being the area of the line one standard deviation away from the calculated Then the end of the blue wing and the start of the red wing are defined by λ 0 −σ and λ 0 +σ, respectively, allowing us to calculate the central wavelengths for the wings and the halfwidth of the wings (i.e., λ 0B , λ 0R , and δλ). These values, along with the intensity ratio of the wings, I B /I R , are presented in Table 1. The off-ribbon profiles both have red asymmetries of ∼1.8% for calcium and ∼1.7% for Hα. This corresponds to small positive velocity gradients or downflows in the region where the wings of these lines are formed. The calcium profile on the ribbon has an ∼3.2% blue asymmetry, while the Hα profile has a red asymmetry of ∼0.4%. This corresponds to small negative velocity gradients or upflows in the region where the wings of calcium are formed. It has been shown that the spectral lines we are considering should be symmetric about the line core in a static atmosphere (Canfield et al. 1984;Fang et al. 1993;Cheng et al. 2006), implying that the velocity field in the flaring atmosphere is responsible for the observed asymmetries. This is likely linked to chromospheric evaporation (Neupert 1968;Fisher et al. 1985;Graham & Cauzzi 2015) and condensation (Ichimoto & Kurokawa 1984;Wulser & Marti 1989), which are the bulk expansion flows that occur in the rapidly heated flare chromosphere. However, mapping between the observed asymmetry and the flow direction is complicated by absorption and emission in the moving plasma. For example, a blue asymmetry, as is observed in the Ca II line on the flare ribbon, could be due to emission from upflowing plasma or absorption by downflowing plasma, as argued for this flare by Kuridze et al. (2015).
These observed spectral-line profiles were propagated in the backward direction through our INN (see Figure 3) 20,000 times each with different random draws from the unit Gaussian latent space (i.e., 20,000 inversions). The inversion of a single pixel takes ∼893 ms on an NVIDIA GTX1050Ti and ∼84.5 s on an Intel Core i7-8700 CPU. The results of the inversions for the point on the flare ribbon are shown in Figure 8 and for the point off the flare ribbon in Figure 9. As in the case of the model validation in Section 3.4, the results are plotted as twodimensional histograms (top panels of Figures 8 and 9). The dashed lines show the median profile for the parameters. This gives an approximation to the true solution from our inversion, as the median profile will pass through the bins with the highest densities. The bottom panels of these figures are plots of the observed spectral lines (dotted blue lines) and the densities of the round-trip profiles obtained by passing the results of the inversion back through the network in the forward direction. This shows that each of the atmospheres we produce are viable for the production of these spectral lines, with some curves being less likely due to the lack of density in the bins of the histogram (i.e., models with specific points in less dense bins are less likely to be the true solution).
Examining the atmospheric profiles obtained from the inversions helps us interpret the line profiles generated. Looking first at the line asymmetries, we have previously remarked that for the on-ribbon pixel, the Ca II line is slightly blueshifted, with a blue asymmetry in the wings. According to Kerr et al. (2016), the Ca II 8542 Å line during a flare is formed between 0.2 and 1.0 Mm above the photosphere, with the wings beyond±0.3 Å from line center formed between 0.2 and 0.4 Mm, i.e., in the upper photosphere/lower chromosphere. The line core within ±0.3 Åof line center is formed above that. A steep positive velocity gradient in the area of core formation (0.9-1 Mm) explains the blueshifted core of our flare ribbon calcium profile. In the region of formation of the wings of this line, we observe a small positive upflow that would cause the observed blue asymmetry due to the emitting material moving upward. Kuridze et al. (2015) indicated that the Hα profile forms below 1.2 Mm, with the wings forming below 0.95 Mm and the core forming above this height. The wings of the on-ribbon Hα profile are very slightly asymmetric in favor of the red wing. In the region where the wings are formed, there is a small positive velocity gradient. This leads us to believe that there has been chromospheric evaporation in this region leading to an increase in optical depth in the region of the blue wing, meaning that there will be more absorption in the blue wing.
For our off-ribbon pixel, both profiles have small red asymmetries. This can be explained in our inverted atmosphere due to a turbulent flow where the lines are formed, which would also explain the asymmetries. Our velocity solution here is quite oscillatory. RADYN has an underlying 2 km s −1 microturbulent velocity, so the line profiles it produces are not as broad as those observed. Having learned that flows produce shifted emission, this oscillation is our model's attempt at making the lines the correct width.

Discussion and Conclusions
We have presented a novel approach to obtaining the distribution of solar atmospheric properties n e , T, and bulk flow speed v from observed Hα and Ca II 8542 Å spectral-line profiles using an INN trained on RADYN flare models. The network learns a bijective approximation to the forward and inverse problems of mapping atmospheric snapshots to (observable) spectral-line profiles, and vice versa. Our initial results are very promising when tested on a flare previously analyzed by Kuridze et al. (2015), aligning well with their results, as discussed in Section 4.
The INN method of atmospheric inversion represents a significant theoretical step forward in the field of inversion. Taking the process of training and applying the INN as a whole, it is comparable to the process performed by existing non-LTE inversion tools, which are typically composed of a forward model for computing the line profiles from an atmosphere, such as RH (Uitenbroek 2001), and an "inversion engine" that is responsible for determining the necessary perturbations to the  Note.Here λ 0B and λ 0R are the central wavelengths of the blue and red wings of the line, respectively; δλ is the half-width of the wings; and I B /I R is the wing intensity ratio.
atmosphere to produce a best-fit line profile. Our INN first learns the forward process from our training data, but due to the bijective nature of the mapping, a perturbative solution approach is not required, as all of the information lost in the forward process can be restored through the latent space. The models that take this "inversion engine" approach, such as STiC (de la Cruz Rodriguez et al. 2019) and NICOLE (Socas-Navarro et al. 2015), are effectively performing a walk through the latent space guided by their "inversion engines." There is no guarantee of solution uniqueness from those approaches, as the entire latent space is not visited. With the INN approach, the useful extent of the latent space is learned during training, and it is therefore trivial to span the latent space with multiple draws of the unit multivariate normal distribution. As our INN was trained on RADYN data, it is important to stress that it can only generate RADYN-like solutions, and this should be taken into account when analyzing any atmospheric inversions performed. The RADYN training atmospheres also include the specific assumption of heating and nonthermal excitations by an electron beam from the corona. As a counterpoint to this, it is important to note that the INN does not simply ingest the grid of RADYN simulations and return a closely matched or interpolated template (an approach used, for example, by Beck et al. 2015 in the fast inversion of Ca II 8542 Å spectropolarimetric data). Instead, the INN has learned a bijective mapping between the input space containing the atmospheric parameters and the output space containing the line profiles and the explicit latent space. In the inverse process, the line profiles are complemented by the latent space to remove ambiguities due to information lost in the forward process. The model's validation on the unseen testing set should ensure that the atmospheres recovered are physically reasonable, and that the model has learned to relate the emergent line profiles with the properties of the atmosphere.
The INN method is fast, as it "front-loads" a large portion of the computational work by requiring a large training set in the form of RADYN simulations followed by approximately 1 day of training on an NVIDIA GTX1050Ti GPU. The result of this precomputation is that inference is then extremely rapid while still drawing on a very complex physical model. The complex model is needed for the flare problem, where assumptions of hydrostatic and local thermodynamic equilibrium cannot hold, Figure 8. Inversion of the pixel on the flare ribbon. The top panels show the atmospheric parameters obtained from the inversion. The top left panel shows the electron density and temperature plotted on log scales, and the top right panel shows the net velocity flow in our plasma. The plots were made by sampling the latent space 20,000 times and plotting the results of the inversions as a two-dimensional histogram. The bins with the greatest density are the most likely values for the parameters at a certain height. The black dotted lines show the median profiles for each quantity. The bottom panels show the lines that were inverted. The blue dotted lines are the true line profiles. The black bins are the round-trip generation of the spectral lines produced by performing the forward process on the sets of atmospheric parameters we obtain from the inversion. and steep gradients are expected to form. This presents a further advantage of the INN method for flares, since, to reduce the size of the parameter space and allow an "inversion engine" to converge in a reasonable amount of time, all other inversion codes currently assume that the atmosphere is in hydrostatic equilibrium (Socas-Navarro et al. 2015;de la Cruz Rodriguez et al. 2019) and use <10 nodes in the atmosphere where the parameters are computed, with various interpolation techniques used between these.
As found in Brown et al. (2018), the nonequilibrium level population and ionization effects present in RADYN, including those due to direct excitations by nonthermal electrons, cause significant deviations between the line profiles computed with these populations and those computed under the assumption of statistical equilibrium in RH (Uitenbroek 2001). Because our model is trained on RADYN data, the associated line profiles are based on RADYN's nonequilibrium formalism and assumption of complete redistribution (i.e., the frequency of an absorbed photon that leads to an excited state and that of the resulting emitted photon are assumed to be independent). These effects are therefore learned by the INN. It is interesting that, even with limited atmospheric information, i.e., n e , T, and v, which are a far-from-complete description of the state of the atmosphere, the INN was nevertheless able to very successfully reproduce the emission from the unseen RADYN snapshots from the F-CHROMA grid. This implies that sufficient non-LTE and non-hydrostatic equilibrium information about local "microscopic" (ionization, level populations), "macroscopic" (gas pressure, opacity), and nonlocal physics (conduction, radiative back warming) must be encoded in these three parameters and their variation through the atmosphere.
Inversions of pixels on the flare ribbon performed in Section 4 suggest significant oscillations in the velocity profile in the transition region (e.g., Figure 8). These oscillations do not simply appear on the median line but appear with a similar form on many of the individual velocity profiles obtained from the inversion. This may in part be due to RADYN using a conservative 2 km s −1 microturbulent velocity throughout the atmosphere. Studies with the Interface Region Imaging Spectrograph (De Pontieu et al. 2014) have required significantly higher values to explain the nonthermal broadening in Mg II h & k in chromospheric plage. Carlsson et al. (2015) found a value of ∼7 km s −1 , and the inversions performed with STiC (de la Cruz Rodriguez et al. 2019) suggest a value of ∼8 km s −1 for the same observation. We suggest, then, that the INN needs to broaden the line to match observations and uses an oscillating velocity and higher temperature in the τ=1 region to achieve this. To better constrain the parameters in the upper chromosphere and transition region requires computation of lines such as Mg IIh Training the INN is made possible by the use of the MMD. The MMD is a technique for determining the distance between probability distributions P and Q using observations X={x 1 , K, x m } and Y={y 1 , K, y n } drawn in an independent and identically distributed fashion from P and Q, respectively. The MMD can be mathematically expressed as where  is an RKHS known as the feature space, with elements known as features; ,  á ñ · · denotes the inner product in the feature space; and μ A represents the expectation vector of the features of  evaluated for the distribution A.
Let X be a nonempty space with a positive definite kernel k X X :   and X :  f  a feature map; then, for all x, yäX, k x y x y , , .
1 5 The feature spaces of kernels such as the Gaussian kernel k x y e , , 0 are in fact infinite-dimensional, but the kernel trick of Equation (15) allows the inner product between vectors in this space to be written in closed form. The reproducing property of the RKHS states simply that under the inner product of features in  , the kernel will always be recovered. For a positive definite kernel, there is a unique RKHS  with a reproducing kernel k whose features are a subset of  ; therefore, a feature map is not unique, but the kernel is. Then, μ P from Equation (14) can be written in terms of the features of  , where P  denotes the expectation value of its argument with respect to P and f i is the ith feature of f. From this definition, we can write k x y , , , The bias on this statistic simply increases the expected MMD result but has the advantage of remaining positive even when P=Q, which works better with the optimizer used to train the INN.